# GitHub Commit Pipeline Plan

## Summary
Build a Python pipeline with two separate CLI commands:

- `sample-repos`: collect repositories from GitHub Search using config-backed defaults and optional runtime query overrides, then save repository metadata.
- `fetch-commits`: read the saved repositories and fetch commit history from each repository’s current default branch.

Randomized sampling stays out of scope for v1, but phase 1 should preserve enough metadata and modularity to support future sampling improvements without changing phase 2 behavior.

## Key Changes
### Phase 1 behavior
- Use `GET /search/repositories` as the primary discovery endpoint.
- Build the effective search query from:
  - default search constants stored outside the CLI
  - optional runtime `--query` override supplied by the user
- Start with default filters equivalent to:
  - `is:public`
  - `stars:>10`
  - `size:>1000`
  - `archived:false`
  - exclude forks by default
- Keep repository selection deterministic/simple for v1:
  - fetch search results with normal pagination
  - save accepted repositories in returned order
  - no randomization logic in v1
- Deduplicate repositories by GitHub `repo_id`.
- Persist:
  - raw repo dataset
  - seen-repo index keyed by `repo_id`
  - run metadata containing the final resolved search query and CLI inputs
- Default repo count for phase 1 is `n=10` when the user does not provide a count.

### Phase 2 behavior
- Refresh repository metadata with `GET /repos/{owner}/{repo}` before commit extraction.
- Fetch commits only from the repository’s current default branch using `GET /repos/{owner}/{repo}/commits`.
- Treat storage identity for commit records as `repo_id + sha` so the same SHA in different repository contexts is not accidentally collapsed.
- Write one normalized commit record per JSONL line.
- Prototype with a default truncation cap of `1` page per repository:
  - default `per_page=100`
  - default `max_pages_per_repo=1`
- If a repository hits the page cap:
  - mark it `success_with_warning`
  - set `truncated=true`
  - record `truncation_reason=max_pages`
- If a repository becomes inaccessible or repeatedly fails:
  - log the repo and failure reason
  - mark it failed in run state
  - continue processing remaining repositories
- Support rerun policy via flag:
  - default: refresh from scratch
  - optional resume mode: skip repos already marked complete

### CLI and storage
- Expose two commands:
  - `sample-repos`
  - `fetch-commits`
- Keep v1 flags focused:
  - `sample-repos`: `--count` defaulting to `10`, output root, rerun/dedupe mode, optional `--query`
  - `fetch-commits`: input repo dataset or run folder, output root, refresh-vs-resume mode, `--max-pages-per-repo`, retry settings
- Organize outputs under a `run_id` folder with separate phase 1 and phase 2 subfolders.
- Keep phase 1 and phase 2 datasets separate.
- Write phase 2 output as chunked or partitioned JSONL files rather than one monolithic file.
- Include runnable test commands in project docs/help output so phase 1 and phase 2 can be exercised from the terminal immediately after setup.

## Public Interfaces / Data Shape
### CLI examples to include for testing
- Default phase 1 run:
  - `python main.py sample-repos`
- Phase 1 with explicit repo count:
  - `python main.py sample-repos --count 25`
- Phase 1 with runtime search override:
  - `python main.py sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"`
- Default phase 2 run against a run folder:
  - `python main.py fetch-commits --run-id <run_id>`
- Phase 2 with resume and higher page cap:
  - `python main.py fetch-commits --run-id <run_id> --resume --max-pages-per-repo 3`

### Phase 1 repo record
Each repo record should include at minimum:
- `run_id`
- `repo_id`
- `full_name`
- `html_url`
- `api_url`
- `default_branch`
- `language`
- `description`
- `stargazers_count`
- `size_kb`
- `fork`
- `archived`
- `visibility`
- `sample_query`
- `sample_page`
- `sampled_at`

### Phase 2 commit record
Each commit record should include at minimum:
- `run_id`
- `repo_id`
- `repo_full_name`
- `repo_html_url`
- `default_branch_at_fetch`
- `sha`
- `parent_shas`
- `author_name`
- `author_email`
- `author_date`
- `committer_name`
- `committer_email`
- `committer_date`
- `message`
- `html_url`
- `fetched_at`
- `source_endpoint`
- `page_number`
- `truncated`

### Run state / config shape
Persist lightweight state to support resume and future extensibility:
- resolved search query after merging defaults and runtime override
- per-run repo status: pending, complete, failed, success_with_warning
- failure reason if failed
- seen repo index keyed by `repo_id`

## Test Plan
- Phase 1 uses default `count=10` when no repo count is supplied.
- Phase 1 accepts an explicit repo count and stops after collecting that many unique repos.
- Phase 1 resolves the effective search query correctly from config defaults plus optional `--query`.
- Phase 1 calls GitHub Search with the expected final filters.
- Phase 1 deduplicates repositories by `repo_id` within a run and across reruns.
- Phase 1 persists repo records and seen-index state correctly.
- Phase 2 refreshes repo metadata before commit extraction.
- Phase 2 fetches only default-branch history.
- Phase 2 stops after 1 page by default and marks capped repos as `success_with_warning`.
- Phase 2 writes one commit per JSONL line with truncation metadata when capped.
- Phase 2 logs and skips inaccessible repositories without failing the whole run.
- Resume mode skips repos already marked complete.
- Refresh mode re-fetches repos from scratch.
- The documented CLI example commands run successfully with defaults and with explicit overrides.

## Assumptions and Defaults
- Randomized sampling is out of scope for v1.
- V1 phase 1 is a search-and-save workflow, not a statistically rigorous sampler.
- Search defaults live outside the CLI in a config/constants source.
- `--query` is supported at runtime for phase 1 and is treated as the user-supplied search query input.
- Default phase 2 behavior is refresh from scratch; resume is opt-in.
- Default phase 2 truncation is `1` page per repository for prototyping and storage control.
- JSONL is the raw storage format for v1.
- Future sampling improvements should be additive and should not require changing the repo dataset schema, commit dataset schema, or phase 2 contract.
- Remaining CLI clarification to lock before implementation: whether `--query` should fully replace the default search query or be appended on top of the default filters. If unspecified, the implementation should default to replacing the search query entirely and record the resolved final query in run metadata.