GitHub Commit Pipeline Plan

Summary

Build a Python pipeline with two separate CLI commands:

sample-repos: collect repositories from GitHub Search using config-backed defaults and optional runtime query overrides, then save repository metadata.
fetch-commits: read the saved repositories and fetch commit history from each repository’s current default branch.

Randomized sampling stays out of scope for v1, but phase 1 should preserve enough metadata and modularity to support future sampling improvements without changing phase 2 behavior.

Key Changes

Phase 1 behavior

Use GET /search/repositories as the primary discovery endpoint.
Build the effective search query from:
- default search constants stored outside the CLI
- optional runtime --query override supplied by the user
Start with default filters equivalent to:
- is:public
- stars:>10
- size:>1000
- archived:false
- exclude forks by default
Keep repository selection deterministic/simple for v1:
- fetch search results with normal pagination
- save accepted repositories in returned order
- no randomization logic in v1
Deduplicate repositories by GitHub repo_id.
Persist:
- raw repo dataset
- seen-repo index keyed by repo_id
- run metadata containing the final resolved search query and CLI inputs
Default repo count for phase 1 is n=10 when the user does not provide a count.

Phase 2 behavior

Refresh repository metadata with GET /repos/{owner}/{repo} before commit extraction.
Fetch commits only from the repository’s current default branch using GET /repos/{owner}/{repo}/commits.
Treat storage identity for commit records as repo_id + sha so the same SHA in different repository contexts is not accidentally collapsed.
Write one normalized commit record per JSONL line.
Prototype with a default truncation cap of 1 page per repository:
- default per_page=100
- default max_pages_per_repo=1
If a repository hits the page cap:
- mark it success_with_warning
- set truncated=true
- record truncation_reason=max_pages
If a repository becomes inaccessible or repeatedly fails:
- log the repo and failure reason
- mark it failed in run state
- continue processing remaining repositories
Support rerun policy via flag:
- default: refresh from scratch
- optional resume mode: skip repos already marked complete

CLI and storage

Expose two commands:
- sample-repos
- fetch-commits
Keep v1 flags focused:
- sample-repos: --count defaulting to 10, output root, rerun/dedupe mode, optional --query
- fetch-commits: input repo dataset or run folder, output root, refresh-vs-resume mode, --max-pages-per-repo, retry settings
Organize outputs under a run_id folder with separate phase 1 and phase 2 subfolders.
Keep phase 1 and phase 2 datasets separate.
Write phase 2 output as chunked or partitioned JSONL files rather than one monolithic file.
Include runnable test commands in project docs/help output so phase 1 and phase 2 can be exercised from the terminal immediately after setup.

Public Interfaces / Data Shape

CLI examples to include for testing

Default phase 1 run:
- python main.py sample-repos
Phase 1 with explicit repo count:
- python main.py sample-repos --count 25
Phase 1 with runtime search override:
- python main.py sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
Default phase 2 run against a run folder:
- python main.py fetch-commits --run-id <run_id>
Phase 2 with resume and higher page cap:
- python main.py fetch-commits --run-id <run_id> --resume --max-pages-per-repo 3

Phase 1 repo record

Each repo record should include at minimum:

run_id
repo_id
full_name
html_url
api_url
default_branch
language
description
stargazers_count
size_kb
fork
archived
visibility
sample_query
sample_page
sampled_at

Phase 2 commit record

Each commit record should include at minimum:

run_id
repo_id
repo_full_name
repo_html_url
default_branch_at_fetch
sha
parent_shas
author_name
author_email
author_date
committer_name
committer_email
committer_date
message
html_url
fetched_at
source_endpoint
page_number
truncated

Run state / config shape

Persist lightweight state to support resume and future extensibility:

resolved search query after merging defaults and runtime override
per-run repo status: pending, complete, failed, success_with_warning
failure reason if failed
seen repo index keyed by repo_id

Test Plan

Phase 1 uses default count=10 when no repo count is supplied.
Phase 1 accepts an explicit repo count and stops after collecting that many unique repos.
Phase 1 resolves the effective search query correctly from config defaults plus optional --query.
Phase 1 calls GitHub Search with the expected final filters.
Phase 1 deduplicates repositories by repo_id within a run and across reruns.
Phase 1 persists repo records and seen-index state correctly.
Phase 2 refreshes repo metadata before commit extraction.
Phase 2 fetches only default-branch history.
Phase 2 stops after 1 page by default and marks capped repos as success_with_warning.
Phase 2 writes one commit per JSONL line with truncation metadata when capped.
Phase 2 logs and skips inaccessible repositories without failing the whole run.
Resume mode skips repos already marked complete.
Refresh mode re-fetches repos from scratch.
The documented CLI example commands run successfully with defaults and with explicit overrides.

Assumptions and Defaults

Randomized sampling is out of scope for v1.
V1 phase 1 is a search-and-save workflow, not a statistically rigorous sampler.
Search defaults live outside the CLI in a config/constants source.
--query is supported at runtime for phase 1 and is treated as the user-supplied search query input.
Default phase 2 behavior is refresh from scratch; resume is opt-in.
Default phase 2 truncation is 1 page per repository for prototyping and storage control.
JSONL is the raw storage format for v1.
Future sampling improvements should be additive and should not require changing the repo dataset schema, commit dataset schema, or phase 2 contract.
Remaining CLI clarification to lock before implementation: whether --query should fully replace the default search query or be appended on top of the default filters. If unspecified, the implementation should default to replacing the search query entirely and record the resolved final query in run metadata.

6.6 KiB Raw Permalink Blame History Unescape Escape