githubDataSampler/README.md

# GitHub Data Pipeline

This project builds a two-phase GitHub data pipeline:

- Phase 1 samples repositories from GitHub Search and stores repository metadata.
- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.

The project uses Poetry for dependency management and a `src` package layout for Python modules.

## Project Layout

- `src/github_datapipe/core/`
  Shared config, GitHub API, runtime, and file IO helpers.
- `src/github_datapipe/phases/phase1_repository_sampling/`
  Phase 1 repository discovery and persistence.
- `src/github_datapipe/phases/phase2_commit_ingestion/`
  Phase 2 commit fetching and persistence.
- `tests/`
  Automated tests for the pipeline behavior.

## Prerequisites

- Python 3.12 installed locally
- Poetry installed locally
- GitHub personal access token stored in `.env` as:

```env
github_token=YOUR_TOKEN_HERE
```

## Install Dependencies

From the project root, install the local environment with:

```powershell
poetry install
```

## Run Phase 1

The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`.

```powershell
poetry run github-datapipe sample-repos
```

Useful overrides:

```powershell
poetry run github-datapipe sample-repos --count 25
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
poetry run github-datapipe sample-repos --mode fresh
```

### Phase 1 Outputs

After phase 1 completes, check:

- `runs/<run_id>/phase1_repository_sampling/repos.jsonl`
- `runs/<run_id>/phase1_repository_sampling/manifest.json`
- `runs/seen_repo_ids.json`

The command prints the generated `run_id`, which you will use for phase 2.

## Run Phase 2

Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.

Run against a phase 1 run:

```powershell
poetry run github-datapipe fetch-commits --run-id <run_id>
```

Useful overrides:

```powershell
poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50
```

### Phase 2 Defaults

- `mode=refresh`
- `per_page=100`
- `max_pages_per_repo=1`

The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository.

### Phase 2 Outputs

After phase 2 completes, check:

- `runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl`
- `runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl`
- `runs/<run_id>/phase2_commit_ingestion/manifest.json`

## Run Tests

Run the automated tests with:

```powershell
poetry run pytest -q
```

## Alternate CLI Invocation

If the Poetry script entrypoint is not available yet, run the CLI module directly:

```powershell
poetry run python -m github_datapipe.cli sample-repos
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>
```

## Notes

- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
- Phase 2 stores one normalized commit per JSONL line.
- Resume mode in phase 2 skips repositories already marked `complete`.
- Truncated repositories are marked as `success_with_warning` when the page cap is reached.