Add phase 2 - download default branch commits for repositories collected in phase 1

2026-04-23 19:16:50 -07:00
parent 96906ee00f
commit 98696ddb29
18 changed files with 991 additions and 381 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,121 @@
+# GitHub Data Pipeline
+
+This project builds a two-phase GitHub data pipeline:
+
+- Phase 1 samples repositories from GitHub Search and stores repository metadata.
+- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.
+
+The project uses Poetry for dependency management and a `src` package layout for Python modules.
+
+## Project Layout
+
+- `src/github_datapipe/core/`
+  Shared config, GitHub API, runtime, and file IO helpers.
+- `src/github_datapipe/phases/phase1_repository_sampling/`
+  Phase 1 repository discovery and persistence.
+- `src/github_datapipe/phases/phase2_commit_ingestion/`
+  Phase 2 commit fetching and persistence.
+- `tests/`
+  Automated tests for the pipeline behavior.
+
+## Prerequisites
+
+- Python 3.12 installed locally
+- Poetry installed locally
+- GitHub personal access token stored in `.env` as:
+
+```env
+github_token=YOUR_TOKEN_HERE
+```
+
+## Install Dependencies
+
+From the project root, install the local environment with:
+
+```powershell
+poetry install
+```
+
+## Run Phase 1
+
+The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`.
+
+```powershell
+poetry run github-datapipe sample-repos
+```
+
+Useful overrides:
+
+```powershell
+poetry run github-datapipe sample-repos --count 25
+poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
+poetry run github-datapipe sample-repos --mode fresh
+```
+
+### Phase 1 Outputs
+
+After phase 1 completes, check:
+
+- `runs/<run_id>/phase1_repository_sampling/repos.jsonl`
+- `runs/<run_id>/phase1_repository_sampling/manifest.json`
+- `runs/seen_repo_ids.json`
+
+The command prints the generated `run_id`, which you will use for phase 2.
+
+## Run Phase 2
+
+Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.
+
+Run against a phase 1 run:
+
+```powershell
+poetry run github-datapipe fetch-commits --run-id <run_id>
+```
+
+Useful overrides:
+
+```powershell
+poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
+poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
+poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50
+```
+
+### Phase 2 Defaults
+
+- `mode=refresh`
+- `per_page=100`
+- `max_pages_per_repo=1`
+
+The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository.
+
+### Phase 2 Outputs
+
+After phase 2 completes, check:
+
+- `runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl`
+- `runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl`
+- `runs/<run_id>/phase2_commit_ingestion/manifest.json`
+
+## Run Tests
+
+Run the automated tests with:
+
+```powershell
+poetry run pytest -q
+```
+
+## Alternate CLI Invocation
+
+If the Poetry script entrypoint is not available yet, run the CLI module directly:
+
+```powershell
+poetry run python -m github_datapipe.cli sample-repos
+poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>
+```
+
+## Notes
+
+- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
+- Phase 2 stores one normalized commit per JSONL line.
+- Resume mode in phase 2 skips repositories already marked `complete`.
+- Truncated repositories are marked as `success_with_warning` when the page cap is reached.