1
0
Files
githubDataSampler/README.md

122 lines
3.3 KiB
Markdown

# GitHub Data Pipeline
This project builds a two-phase GitHub data pipeline:
- Phase 1 samples repositories from GitHub Search and stores repository metadata.
- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.
The project uses Poetry for dependency management and a `src` package layout for Python modules.
## Project Layout
- `src/github_datapipe/core/`
Shared config, GitHub API, runtime, and file IO helpers.
- `src/github_datapipe/phases/phase1_repository_sampling/`
Phase 1 repository discovery and persistence.
- `src/github_datapipe/phases/phase2_commit_ingestion/`
Phase 2 commit fetching and persistence.
- `tests/`
Automated tests for the pipeline behavior.
## Prerequisites
- Python 3.12 installed locally
- Poetry installed locally
- GitHub personal access token stored in `.env` as:
```env
github_token=YOUR_TOKEN_HERE
```
## Install Dependencies
From the project root, install the local environment with:
```powershell
poetry install
```
## Run Phase 1
The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`.
```powershell
poetry run github-datapipe sample-repos
```
Useful overrides:
```powershell
poetry run github-datapipe sample-repos --count 25
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
poetry run github-datapipe sample-repos --mode fresh
```
### Phase 1 Outputs
After phase 1 completes, check:
- `runs/<run_id>/phase1_repository_sampling/repos.jsonl`
- `runs/<run_id>/phase1_repository_sampling/manifest.json`
- `runs/seen_repo_ids.json`
The command prints the generated `run_id`, which you will use for phase 2.
## Run Phase 2
Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.
Run against a phase 1 run:
```powershell
poetry run github-datapipe fetch-commits --run-id <run_id>
```
Useful overrides:
```powershell
poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50
```
### Phase 2 Defaults
- `mode=refresh`
- `per_page=100`
- `max_pages_per_repo=1`
The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository.
### Phase 2 Outputs
After phase 2 completes, check:
- `runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl`
- `runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl`
- `runs/<run_id>/phase2_commit_ingestion/manifest.json`
## Run Tests
Run the automated tests with:
```powershell
poetry run pytest -q
```
## Alternate CLI Invocation
If the Poetry script entrypoint is not available yet, run the CLI module directly:
```powershell
poetry run python -m github_datapipe.cli sample-repos
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>
```
## Notes
- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
- Phase 2 stores one normalized commit per JSONL line.
- Resume mode in phase 2 skips repositories already marked `complete`.
- Truncated repositories are marked as `success_with_warning` when the page cap is reached.