Add phase 2 - download default branch commits for repositories collected in phase 1
This commit is contained in:
121
README.md
Normal file
121
README.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# GitHub Data Pipeline
|
||||
|
||||
This project builds a two-phase GitHub data pipeline:
|
||||
|
||||
- Phase 1 samples repositories from GitHub Search and stores repository metadata.
|
||||
- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.
|
||||
|
||||
The project uses Poetry for dependency management and a `src` package layout for Python modules.
|
||||
|
||||
## Project Layout
|
||||
|
||||
- `src/github_datapipe/core/`
|
||||
Shared config, GitHub API, runtime, and file IO helpers.
|
||||
- `src/github_datapipe/phases/phase1_repository_sampling/`
|
||||
Phase 1 repository discovery and persistence.
|
||||
- `src/github_datapipe/phases/phase2_commit_ingestion/`
|
||||
Phase 2 commit fetching and persistence.
|
||||
- `tests/`
|
||||
Automated tests for the pipeline behavior.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.12 installed locally
|
||||
- Poetry installed locally
|
||||
- GitHub personal access token stored in `.env` as:
|
||||
|
||||
```env
|
||||
github_token=YOUR_TOKEN_HERE
|
||||
```
|
||||
|
||||
## Install Dependencies
|
||||
|
||||
From the project root, install the local environment with:
|
||||
|
||||
```powershell
|
||||
poetry install
|
||||
```
|
||||
|
||||
## Run Phase 1
|
||||
|
||||
The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`.
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe sample-repos
|
||||
```
|
||||
|
||||
Useful overrides:
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe sample-repos --count 25
|
||||
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
|
||||
poetry run github-datapipe sample-repos --mode fresh
|
||||
```
|
||||
|
||||
### Phase 1 Outputs
|
||||
|
||||
After phase 1 completes, check:
|
||||
|
||||
- `runs/<run_id>/phase1_repository_sampling/repos.jsonl`
|
||||
- `runs/<run_id>/phase1_repository_sampling/manifest.json`
|
||||
- `runs/seen_repo_ids.json`
|
||||
|
||||
The command prints the generated `run_id`, which you will use for phase 2.
|
||||
|
||||
## Run Phase 2
|
||||
|
||||
Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.
|
||||
|
||||
Run against a phase 1 run:
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id>
|
||||
```
|
||||
|
||||
Useful overrides:
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50
|
||||
```
|
||||
|
||||
### Phase 2 Defaults
|
||||
|
||||
- `mode=refresh`
|
||||
- `per_page=100`
|
||||
- `max_pages_per_repo=1`
|
||||
|
||||
The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository.
|
||||
|
||||
### Phase 2 Outputs
|
||||
|
||||
After phase 2 completes, check:
|
||||
|
||||
- `runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl`
|
||||
- `runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl`
|
||||
- `runs/<run_id>/phase2_commit_ingestion/manifest.json`
|
||||
|
||||
## Run Tests
|
||||
|
||||
Run the automated tests with:
|
||||
|
||||
```powershell
|
||||
poetry run pytest -q
|
||||
```
|
||||
|
||||
## Alternate CLI Invocation
|
||||
|
||||
If the Poetry script entrypoint is not available yet, run the CLI module directly:
|
||||
|
||||
```powershell
|
||||
poetry run python -m github_datapipe.cli sample-repos
|
||||
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
|
||||
- Phase 2 stores one normalized commit per JSONL line.
|
||||
- Resume mode in phase 2 skips repositories already marked `complete`.
|
||||
- Truncated repositories are marked as `success_with_warning` when the page cap is reached.
|
||||
Reference in New Issue
Block a user