137 lines
3.7 KiB
Markdown
137 lines
3.7 KiB
Markdown
# GitHub Data Pipeline
|
|
|
|
This project builds a two-phase GitHub data pipeline:
|
|
|
|
- Phase 1 samples repositories from GitHub Search and stores repository metadata.
|
|
- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.
|
|
|
|
The project uses Poetry for dependency management and a `src` package layout for Python modules.
|
|
|
|
## Project Layout
|
|
|
|
- `src/github_datapipe/core/`
|
|
Shared config, GitHub API, runtime, and file IO helpers.
|
|
- `src/github_datapipe/phases/phase1_repository_sampling/`
|
|
Phase 1 repository discovery and persistence.
|
|
- `src/github_datapipe/phases/phase2_commit_ingestion/`
|
|
Phase 2 commit fetching and persistence.
|
|
- `tests/`
|
|
Automated tests for the pipeline behavior.
|
|
|
|
## Prerequisites
|
|
|
|
- Python 3.12 installed locally
|
|
- Poetry installed locally
|
|
- GitHub personal access token stored in `.env` as:
|
|
|
|
```env
|
|
github_token=YOUR_TOKEN_HERE
|
|
```
|
|
|
|
## Install Dependencies
|
|
|
|
From the project root, install the local environment with:
|
|
|
|
```powershell
|
|
poetry install
|
|
```
|
|
|
|
## CLI Help
|
|
|
|
To see the available commands and their descriptions, run:
|
|
|
|
```powershell
|
|
poetry run github-datapipe --help
|
|
```
|
|
|
|
To see the available arguments and options for a specific command, use the `--help` flag with that command:
|
|
|
|
```powershell
|
|
poetry run github-datapipe sample-repos --help
|
|
poetry run github-datapipe fetch-commits --help
|
|
```
|
|
|
|
## Run Phase 1
|
|
|
|
The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`.
|
|
|
|
```powershell
|
|
poetry run github-datapipe sample-repos
|
|
```
|
|
|
|
Useful overrides:
|
|
|
|
```powershell
|
|
poetry run github-datapipe sample-repos --count 25
|
|
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
|
|
poetry run github-datapipe sample-repos --mode fresh
|
|
```
|
|
|
|
### Phase 1 Outputs
|
|
|
|
After phase 1 completes, check:
|
|
|
|
- `runs/<run_id>/phase1_repository_sampling/repos.jsonl`
|
|
- `runs/<run_id>/phase1_repository_sampling/manifest.json`
|
|
- `runs/seen_repo_ids.json`
|
|
|
|
The command prints the generated `run_id`, which you will use for phase 2.
|
|
|
|
## Run Phase 2
|
|
|
|
Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.
|
|
|
|
Run against a phase 1 run:
|
|
|
|
```powershell
|
|
poetry run github-datapipe fetch-commits --run-id <run_id>
|
|
```
|
|
|
|
Useful overrides:
|
|
|
|
```powershell
|
|
poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
|
|
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
|
|
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50
|
|
```
|
|
|
|
### Phase 2 Defaults
|
|
|
|
- `mode=refresh`
|
|
- `per_page=100`
|
|
- `max_pages_per_repo=1`
|
|
|
|
The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository.
|
|
|
|
### Phase 2 Outputs
|
|
|
|
After phase 2 completes, check:
|
|
|
|
- `runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl`
|
|
- `runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl`
|
|
- `runs/<run_id>/phase2_commit_ingestion/manifest.json`
|
|
|
|
## Run Tests
|
|
|
|
Run the automated tests with:
|
|
|
|
```powershell
|
|
poetry run pytest -q
|
|
```
|
|
|
|
## Alternate CLI Invocation
|
|
|
|
If the Poetry script entrypoint is not available yet, run the CLI module directly:
|
|
|
|
```powershell
|
|
poetry run python -m github_datapipe.cli sample-repos
|
|
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
|
|
- Phase 2 stores one normalized commit per JSONL line.
|
|
- Resume mode in phase 2 skips repositories already marked `complete`.
|
|
- Truncated repositories are marked as `success_with_warning` when the page cap is reached.
|