1
0
Files
githubDataSampler/plan.md
2026-04-22 03:05:00 -07:00

6.6 KiB
Raw Permalink Blame History

GitHub Commit Pipeline Plan

Summary

Build a Python pipeline with two separate CLI commands:

  • sample-repos: collect repositories from GitHub Search using config-backed defaults and optional runtime query overrides, then save repository metadata.
  • fetch-commits: read the saved repositories and fetch commit history from each repositorys current default branch.

Randomized sampling stays out of scope for v1, but phase 1 should preserve enough metadata and modularity to support future sampling improvements without changing phase 2 behavior.

Key Changes

Phase 1 behavior

  • Use GET /search/repositories as the primary discovery endpoint.
  • Build the effective search query from:
    • default search constants stored outside the CLI
    • optional runtime --query override supplied by the user
  • Start with default filters equivalent to:
    • is:public
    • stars:>10
    • size:>1000
    • archived:false
    • exclude forks by default
  • Keep repository selection deterministic/simple for v1:
    • fetch search results with normal pagination
    • save accepted repositories in returned order
    • no randomization logic in v1
  • Deduplicate repositories by GitHub repo_id.
  • Persist:
    • raw repo dataset
    • seen-repo index keyed by repo_id
    • run metadata containing the final resolved search query and CLI inputs
  • Default repo count for phase 1 is n=10 when the user does not provide a count.

Phase 2 behavior

  • Refresh repository metadata with GET /repos/{owner}/{repo} before commit extraction.
  • Fetch commits only from the repositorys current default branch using GET /repos/{owner}/{repo}/commits.
  • Treat storage identity for commit records as repo_id + sha so the same SHA in different repository contexts is not accidentally collapsed.
  • Write one normalized commit record per JSONL line.
  • Prototype with a default truncation cap of 1 page per repository:
    • default per_page=100
    • default max_pages_per_repo=1
  • If a repository hits the page cap:
    • mark it success_with_warning
    • set truncated=true
    • record truncation_reason=max_pages
  • If a repository becomes inaccessible or repeatedly fails:
    • log the repo and failure reason
    • mark it failed in run state
    • continue processing remaining repositories
  • Support rerun policy via flag:
    • default: refresh from scratch
    • optional resume mode: skip repos already marked complete

CLI and storage

  • Expose two commands:
    • sample-repos
    • fetch-commits
  • Keep v1 flags focused:
    • sample-repos: --count defaulting to 10, output root, rerun/dedupe mode, optional --query
    • fetch-commits: input repo dataset or run folder, output root, refresh-vs-resume mode, --max-pages-per-repo, retry settings
  • Organize outputs under a run_id folder with separate phase 1 and phase 2 subfolders.
  • Keep phase 1 and phase 2 datasets separate.
  • Write phase 2 output as chunked or partitioned JSONL files rather than one monolithic file.
  • Include runnable test commands in project docs/help output so phase 1 and phase 2 can be exercised from the terminal immediately after setup.

Public Interfaces / Data Shape

CLI examples to include for testing

  • Default phase 1 run:
    • python main.py sample-repos
  • Phase 1 with explicit repo count:
    • python main.py sample-repos --count 25
  • Phase 1 with runtime search override:
    • python main.py sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
  • Default phase 2 run against a run folder:
    • python main.py fetch-commits --run-id <run_id>
  • Phase 2 with resume and higher page cap:
    • python main.py fetch-commits --run-id <run_id> --resume --max-pages-per-repo 3

Phase 1 repo record

Each repo record should include at minimum:

  • run_id
  • repo_id
  • full_name
  • html_url
  • api_url
  • default_branch
  • language
  • description
  • stargazers_count
  • size_kb
  • fork
  • archived
  • visibility
  • sample_query
  • sample_page
  • sampled_at

Phase 2 commit record

Each commit record should include at minimum:

  • run_id
  • repo_id
  • repo_full_name
  • repo_html_url
  • default_branch_at_fetch
  • sha
  • parent_shas
  • author_name
  • author_email
  • author_date
  • committer_name
  • committer_email
  • committer_date
  • message
  • html_url
  • fetched_at
  • source_endpoint
  • page_number
  • truncated

Run state / config shape

Persist lightweight state to support resume and future extensibility:

  • resolved search query after merging defaults and runtime override
  • per-run repo status: pending, complete, failed, success_with_warning
  • failure reason if failed
  • seen repo index keyed by repo_id

Test Plan

  • Phase 1 uses default count=10 when no repo count is supplied.
  • Phase 1 accepts an explicit repo count and stops after collecting that many unique repos.
  • Phase 1 resolves the effective search query correctly from config defaults plus optional --query.
  • Phase 1 calls GitHub Search with the expected final filters.
  • Phase 1 deduplicates repositories by repo_id within a run and across reruns.
  • Phase 1 persists repo records and seen-index state correctly.
  • Phase 2 refreshes repo metadata before commit extraction.
  • Phase 2 fetches only default-branch history.
  • Phase 2 stops after 1 page by default and marks capped repos as success_with_warning.
  • Phase 2 writes one commit per JSONL line with truncation metadata when capped.
  • Phase 2 logs and skips inaccessible repositories without failing the whole run.
  • Resume mode skips repos already marked complete.
  • Refresh mode re-fetches repos from scratch.
  • The documented CLI example commands run successfully with defaults and with explicit overrides.

Assumptions and Defaults

  • Randomized sampling is out of scope for v1.
  • V1 phase 1 is a search-and-save workflow, not a statistically rigorous sampler.
  • Search defaults live outside the CLI in a config/constants source.
  • --query is supported at runtime for phase 1 and is treated as the user-supplied search query input.
  • Default phase 2 behavior is refresh from scratch; resume is opt-in.
  • Default phase 2 truncation is 1 page per repository for prototyping and storage control.
  • JSONL is the raw storage format for v1.
  • Future sampling improvements should be additive and should not require changing the repo dataset schema, commit dataset schema, or phase 2 contract.
  • Remaining CLI clarification to lock before implementation: whether --query should fully replace the default search query or be appended on top of the default filters. If unspecified, the implementation should default to replacing the search query entirely and record the resolved final query in run metadata.