6.6 KiB
6.6 KiB
GitHub Commit Pipeline Plan
Summary
Build a Python pipeline with two separate CLI commands:
sample-repos: collect repositories from GitHub Search using config-backed defaults and optional runtime query overrides, then save repository metadata.fetch-commits: read the saved repositories and fetch commit history from each repository’s current default branch.
Randomized sampling stays out of scope for v1, but phase 1 should preserve enough metadata and modularity to support future sampling improvements without changing phase 2 behavior.
Key Changes
Phase 1 behavior
- Use
GET /search/repositoriesas the primary discovery endpoint. - Build the effective search query from:
- default search constants stored outside the CLI
- optional runtime
--queryoverride supplied by the user
- Start with default filters equivalent to:
is:publicstars:>10size:>1000archived:false- exclude forks by default
- Keep repository selection deterministic/simple for v1:
- fetch search results with normal pagination
- save accepted repositories in returned order
- no randomization logic in v1
- Deduplicate repositories by GitHub
repo_id. - Persist:
- raw repo dataset
- seen-repo index keyed by
repo_id - run metadata containing the final resolved search query and CLI inputs
- Default repo count for phase 1 is
n=10when the user does not provide a count.
Phase 2 behavior
- Refresh repository metadata with
GET /repos/{owner}/{repo}before commit extraction. - Fetch commits only from the repository’s current default branch using
GET /repos/{owner}/{repo}/commits. - Treat storage identity for commit records as
repo_id + shaso the same SHA in different repository contexts is not accidentally collapsed. - Write one normalized commit record per JSONL line.
- Prototype with a default truncation cap of
1page per repository:- default
per_page=100 - default
max_pages_per_repo=1
- default
- If a repository hits the page cap:
- mark it
success_with_warning - set
truncated=true - record
truncation_reason=max_pages
- mark it
- If a repository becomes inaccessible or repeatedly fails:
- log the repo and failure reason
- mark it failed in run state
- continue processing remaining repositories
- Support rerun policy via flag:
- default: refresh from scratch
- optional resume mode: skip repos already marked complete
CLI and storage
- Expose two commands:
sample-reposfetch-commits
- Keep v1 flags focused:
sample-repos:--countdefaulting to10, output root, rerun/dedupe mode, optional--queryfetch-commits: input repo dataset or run folder, output root, refresh-vs-resume mode,--max-pages-per-repo, retry settings
- Organize outputs under a
run_idfolder with separate phase 1 and phase 2 subfolders. - Keep phase 1 and phase 2 datasets separate.
- Write phase 2 output as chunked or partitioned JSONL files rather than one monolithic file.
- Include runnable test commands in project docs/help output so phase 1 and phase 2 can be exercised from the terminal immediately after setup.
Public Interfaces / Data Shape
CLI examples to include for testing
- Default phase 1 run:
python main.py sample-repos
- Phase 1 with explicit repo count:
python main.py sample-repos --count 25
- Phase 1 with runtime search override:
python main.py sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
- Default phase 2 run against a run folder:
python main.py fetch-commits --run-id <run_id>
- Phase 2 with resume and higher page cap:
python main.py fetch-commits --run-id <run_id> --resume --max-pages-per-repo 3
Phase 1 repo record
Each repo record should include at minimum:
run_idrepo_idfull_namehtml_urlapi_urldefault_branchlanguagedescriptionstargazers_countsize_kbforkarchivedvisibilitysample_querysample_pagesampled_at
Phase 2 commit record
Each commit record should include at minimum:
run_idrepo_idrepo_full_namerepo_html_urldefault_branch_at_fetchshaparent_shasauthor_nameauthor_emailauthor_datecommitter_namecommitter_emailcommitter_datemessagehtml_urlfetched_atsource_endpointpage_numbertruncated
Run state / config shape
Persist lightweight state to support resume and future extensibility:
- resolved search query after merging defaults and runtime override
- per-run repo status: pending, complete, failed, success_with_warning
- failure reason if failed
- seen repo index keyed by
repo_id
Test Plan
- Phase 1 uses default
count=10when no repo count is supplied. - Phase 1 accepts an explicit repo count and stops after collecting that many unique repos.
- Phase 1 resolves the effective search query correctly from config defaults plus optional
--query. - Phase 1 calls GitHub Search with the expected final filters.
- Phase 1 deduplicates repositories by
repo_idwithin a run and across reruns. - Phase 1 persists repo records and seen-index state correctly.
- Phase 2 refreshes repo metadata before commit extraction.
- Phase 2 fetches only default-branch history.
- Phase 2 stops after 1 page by default and marks capped repos as
success_with_warning. - Phase 2 writes one commit per JSONL line with truncation metadata when capped.
- Phase 2 logs and skips inaccessible repositories without failing the whole run.
- Resume mode skips repos already marked complete.
- Refresh mode re-fetches repos from scratch.
- The documented CLI example commands run successfully with defaults and with explicit overrides.
Assumptions and Defaults
- Randomized sampling is out of scope for v1.
- V1 phase 1 is a search-and-save workflow, not a statistically rigorous sampler.
- Search defaults live outside the CLI in a config/constants source.
--queryis supported at runtime for phase 1 and is treated as the user-supplied search query input.- Default phase 2 behavior is refresh from scratch; resume is opt-in.
- Default phase 2 truncation is
1page per repository for prototyping and storage control. - JSONL is the raw storage format for v1.
- Future sampling improvements should be additive and should not require changing the repo dataset schema, commit dataset schema, or phase 2 contract.
- Remaining CLI clarification to lock before implementation: whether
--queryshould fully replace the default search query or be appended on top of the default filters. If unspecified, the implementation should default to replacing the search query entirely and record the resolved final query in run metadata.