# GitHub Data Pipeline This project builds a two-phase GitHub data pipeline: - Phase 1 samples repositories from GitHub Search and stores repository metadata. - Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch. The project uses Poetry for dependency management and a `src` package layout for Python modules. ## Project Layout - `src/github_datapipe/core/` Shared config, GitHub API, runtime, and file IO helpers. - `src/github_datapipe/phases/phase1_repository_sampling/` Phase 1 repository discovery and persistence. - `src/github_datapipe/phases/phase2_commit_ingestion/` Phase 2 commit fetching and persistence. - `tests/` Automated tests for the pipeline behavior. ## Prerequisites - Python 3.12 installed locally - Poetry installed locally - GitHub personal access token stored in `.env` as: ```env github_token=YOUR_TOKEN_HERE ``` ## Install Dependencies From the project root, install the local environment with: ```powershell poetry install ``` ## CLI Help To see the available commands and their descriptions, run: ```powershell poetry run github-datapipe --help ``` To see the available arguments and options for a specific command, use the `--help` flag with that command: ```powershell poetry run github-datapipe sample-repos --help poetry run github-datapipe fetch-commits --help ``` ## Run Phase 1 The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`. ```powershell poetry run github-datapipe sample-repos ``` Useful overrides: ```powershell poetry run github-datapipe sample-repos --count 25 poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false" poetry run github-datapipe sample-repos --mode fresh ``` ### Phase 1 Outputs After phase 1 completes, check: - `runs//phase1_repository_sampling/repos.jsonl` - `runs//phase1_repository_sampling/manifest.json` - `runs/seen_repo_ids.json` The command prints the generated `run_id`, which you will use for phase 2. ## Run Phase 2 Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch. Run against a phase 1 run: ```powershell poetry run github-datapipe fetch-commits --run-id ``` Useful overrides: ```powershell poetry run github-datapipe fetch-commits --run-id --mode resume poetry run github-datapipe fetch-commits --run-id --max-pages-per-repo 3 poetry run github-datapipe fetch-commits --run-id --per-page 50 ``` ### Phase 2 Defaults - `mode=refresh` - `per_page=100` - `max_pages_per_repo=1` The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository. ### Phase 2 Outputs After phase 2 completes, check: - `runs//phase2_commit_ingestion/commits/commits-0001.jsonl` - `runs//phase2_commit_ingestion/repo_status.jsonl` - `runs//phase2_commit_ingestion/manifest.json` ## Run Tests Run the automated tests with: ```powershell poetry run pytest -q ``` ## Alternate CLI Invocation If the Poetry script entrypoint is not available yet, run the CLI module directly: ```powershell poetry run python -m github_datapipe.cli sample-repos poetry run python -m github_datapipe.cli fetch-commits --run-id ``` ## Notes - Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling. - Phase 2 stores one normalized commit per JSONL line. - Resume mode in phase 2 skips repositories already marked `complete`. - Truncated repositories are marked as `success_with_warning` when the page cap is reached.