Add phase 2 - download default branch commits for repositories collected in phase 1
This commit is contained in:
121
README.md
Normal file
121
README.md
Normal file
@@ -0,0 +1,121 @@
|
||||
# GitHub Data Pipeline
|
||||
|
||||
This project builds a two-phase GitHub data pipeline:
|
||||
|
||||
- Phase 1 samples repositories from GitHub Search and stores repository metadata.
|
||||
- Phase 2 reads the saved repositories, refreshes repository metadata, and downloads commit history from the current default branch.
|
||||
|
||||
The project uses Poetry for dependency management and a `src` package layout for Python modules.
|
||||
|
||||
## Project Layout
|
||||
|
||||
- `src/github_datapipe/core/`
|
||||
Shared config, GitHub API, runtime, and file IO helpers.
|
||||
- `src/github_datapipe/phases/phase1_repository_sampling/`
|
||||
Phase 1 repository discovery and persistence.
|
||||
- `src/github_datapipe/phases/phase2_commit_ingestion/`
|
||||
Phase 2 commit fetching and persistence.
|
||||
- `tests/`
|
||||
Automated tests for the pipeline behavior.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- Python 3.12 installed locally
|
||||
- Poetry installed locally
|
||||
- GitHub personal access token stored in `.env` as:
|
||||
|
||||
```env
|
||||
github_token=YOUR_TOKEN_HERE
|
||||
```
|
||||
|
||||
## Install Dependencies
|
||||
|
||||
From the project root, install the local environment with:
|
||||
|
||||
```powershell
|
||||
poetry install
|
||||
```
|
||||
|
||||
## Run Phase 1
|
||||
|
||||
The default phase 1 command collects 10 repositories using the default search query from `src/github_datapipe/core/config.py`.
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe sample-repos
|
||||
```
|
||||
|
||||
Useful overrides:
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe sample-repos --count 25
|
||||
poetry run github-datapipe sample-repos --count 25 --query "is:public stars:>50 size:>5000 archived:false"
|
||||
poetry run github-datapipe sample-repos --mode fresh
|
||||
```
|
||||
|
||||
### Phase 1 Outputs
|
||||
|
||||
After phase 1 completes, check:
|
||||
|
||||
- `runs/<run_id>/phase1_repository_sampling/repos.jsonl`
|
||||
- `runs/<run_id>/phase1_repository_sampling/manifest.json`
|
||||
- `runs/seen_repo_ids.json`
|
||||
|
||||
The command prints the generated `run_id`, which you will use for phase 2.
|
||||
|
||||
## Run Phase 2
|
||||
|
||||
Phase 2 consumes the saved phase 1 repositories and downloads commit history from each repository's current default branch.
|
||||
|
||||
Run against a phase 1 run:
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id>
|
||||
```
|
||||
|
||||
Useful overrides:
|
||||
|
||||
```powershell
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id> --mode resume
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id> --max-pages-per-repo 3
|
||||
poetry run github-datapipe fetch-commits --run-id <run_id> --per-page 50
|
||||
```
|
||||
|
||||
### Phase 2 Defaults
|
||||
|
||||
- `mode=refresh`
|
||||
- `per_page=100`
|
||||
- `max_pages_per_repo=1`
|
||||
|
||||
The default `max_pages_per_repo=1` keeps the prototype small and limits commit downloads to the first page for each repository.
|
||||
|
||||
### Phase 2 Outputs
|
||||
|
||||
After phase 2 completes, check:
|
||||
|
||||
- `runs/<run_id>/phase2_commit_ingestion/commits/commits-0001.jsonl`
|
||||
- `runs/<run_id>/phase2_commit_ingestion/repo_status.jsonl`
|
||||
- `runs/<run_id>/phase2_commit_ingestion/manifest.json`
|
||||
|
||||
## Run Tests
|
||||
|
||||
Run the automated tests with:
|
||||
|
||||
```powershell
|
||||
poetry run pytest -q
|
||||
```
|
||||
|
||||
## Alternate CLI Invocation
|
||||
|
||||
If the Poetry script entrypoint is not available yet, run the CLI module directly:
|
||||
|
||||
```powershell
|
||||
poetry run python -m github_datapipe.cli sample-repos
|
||||
poetry run python -m github_datapipe.cli fetch-commits --run-id <run_id>
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Phase 1 currently uses deterministic GitHub Search traversal rather than randomized sampling.
|
||||
- Phase 2 stores one normalized commit per JSONL line.
|
||||
- Resume mode in phase 2 skips repositories already marked `complete`.
|
||||
- Truncated repositories are marked as `success_with_warning` when the page cap is reached.
|
||||
@@ -16,8 +16,12 @@ dependencies = [
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
github-datapipe = "github_datapipe.extract_repos:main"
|
||||
github-datapipe = "github_datapipe.cli:main"
|
||||
|
||||
[tool.poetry]
|
||||
packages = [
|
||||
{ include = "github_datapipe", from = "src" }
|
||||
]
|
||||
|
||||
[build-system]
|
||||
requires = ["poetry-core>=2.0.0,<3.0.0"]
|
||||
|
||||
@@ -0,0 +1 @@
|
||||
"""GitHub data pipeline package."""
|
||||
|
||||
165
src/github_datapipe/cli.py
Normal file
165
src/github_datapipe/cli.py
Normal file
@@ -0,0 +1,165 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
from github_datapipe.core.config import GithubConfig
|
||||
# Phase1 is concerned with querying the repo data
|
||||
from github_datapipe.phases.phase1_repository_sampling.service import (
|
||||
SampleReposOptions,
|
||||
resolve_query,
|
||||
sample_repositories,
|
||||
)
|
||||
# Phase 2 is concerned with downloading the commit data for repos collected in phase 1
|
||||
from github_datapipe.phases.phase2_commit_ingestion.service import (
|
||||
FetchCommitsOptions,
|
||||
fetch_commits,
|
||||
)
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="GitHub data pipeline CLI")
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
sample_parser = subparsers.add_parser(
|
||||
"sample-repos",
|
||||
help="Collect repositories from GitHub Search and save phase 1 outputs.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--count",
|
||||
type=int,
|
||||
default=GithubConfig.default_repo_count,
|
||||
help=f"Number of repositories to sample. Defaults to {GithubConfig.default_repo_count}.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--query",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Optional raw GitHub repository search query. Replaces config defaults when provided.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--output-root",
|
||||
type=Path,
|
||||
default=Path(GithubConfig.default_output_root),
|
||||
help=f"Directory where run outputs are stored. Defaults to `{GithubConfig.default_output_root}`.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--mode",
|
||||
choices=("append-deduped", "fresh"),
|
||||
default="append-deduped",
|
||||
help="Whether to dedupe against the persisted seen-repo index or start a fresh phase-1 run.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--per-page",
|
||||
type=int,
|
||||
default=GithubConfig.default_per_page,
|
||||
help=f"GitHub Search page size. Defaults to {GithubConfig.default_per_page}.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--run-id",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Optional run identifier. If omitted, a run id is generated automatically.",
|
||||
)
|
||||
|
||||
commit_parser = subparsers.add_parser(
|
||||
"fetch-commits",
|
||||
help="Fetch commit history for repositories collected in phase 1.",
|
||||
)
|
||||
commit_input = commit_parser.add_mutually_exclusive_group(required=True)
|
||||
commit_input.add_argument(
|
||||
"--run-id",
|
||||
type=str,
|
||||
help="Run identifier whose phase 1 repository dataset should be consumed.",
|
||||
)
|
||||
commit_input.add_argument(
|
||||
"--repos-file",
|
||||
type=Path,
|
||||
help="Path to a phase 1 repos.jsonl file.",
|
||||
)
|
||||
commit_parser.add_argument(
|
||||
"--output-root",
|
||||
type=Path,
|
||||
default=Path(GithubConfig.default_output_root),
|
||||
help=f"Directory where run outputs are stored. Defaults to `{GithubConfig.default_output_root}`.",
|
||||
)
|
||||
commit_parser.add_argument(
|
||||
"--mode",
|
||||
choices=("refresh", "resume"),
|
||||
default="refresh",
|
||||
help="Whether to fetch all repositories from scratch or skip those already marked complete.",
|
||||
)
|
||||
commit_parser.add_argument(
|
||||
"--max-pages-per-repo",
|
||||
type=int,
|
||||
default=GithubConfig.default_max_pages_per_repo,
|
||||
help=(
|
||||
"Maximum number of commit pages to fetch per repository. "
|
||||
f"Defaults to {GithubConfig.default_max_pages_per_repo}."
|
||||
),
|
||||
)
|
||||
commit_parser.add_argument(
|
||||
"--per-page",
|
||||
type=int,
|
||||
default=GithubConfig.default_per_page,
|
||||
help=f"GitHub commit page size. Defaults to {GithubConfig.default_per_page}.",
|
||||
)
|
||||
commit_parser.add_argument(
|
||||
"--retry-count",
|
||||
type=int,
|
||||
default=GithubConfig.default_retry_count,
|
||||
help=f"Number of retries for repository metadata and commit requests. Defaults to {GithubConfig.default_retry_count}.",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command == "sample-repos":
|
||||
options = SampleReposOptions(
|
||||
count=args.count,
|
||||
output_root=args.output_root,
|
||||
query=resolve_query(args.query),
|
||||
per_page=args.per_page,
|
||||
mode=args.mode,
|
||||
run_id=args.run_id,
|
||||
)
|
||||
result = sample_repositories(options)
|
||||
print(f"Run ID: {result['run_id']}")
|
||||
print(f"Collected repositories: {result['count_collected']}")
|
||||
print(f"Repositories file: {result['repos_path']}")
|
||||
print(f"Manifest file: {result['manifest_path']}")
|
||||
if result["seen_index_path"] is not None:
|
||||
print(f"Seen repo index: {result['seen_index_path']}")
|
||||
return 0
|
||||
|
||||
if args.command == "fetch-commits":
|
||||
options = FetchCommitsOptions(
|
||||
output_root=args.output_root,
|
||||
run_id=args.run_id,
|
||||
repos_file=args.repos_file,
|
||||
mode=args.mode,
|
||||
per_page=args.per_page,
|
||||
max_pages_per_repo=args.max_pages_per_repo,
|
||||
retry_count=args.retry_count,
|
||||
)
|
||||
result = fetch_commits(options)
|
||||
print(f"Run ID: {result['run_id']}")
|
||||
print(f"Processed repositories: {result['processed_repositories']}")
|
||||
print(f"Completed repositories: {result['completed_repositories']}")
|
||||
print(f"Warning repositories: {result['warning_repositories']}")
|
||||
print(f"Failed repositories: {result['failed_repositories']}")
|
||||
print(f"Commits file: {result['commits_path']}")
|
||||
print(f"Status file: {result['status_path']}")
|
||||
print(f"Manifest file: {result['manifest_path']}")
|
||||
return 0
|
||||
|
||||
parser.error(f"Unsupported command: {args.command}")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
1
src/github_datapipe/core/__init__.py
Normal file
1
src/github_datapipe/core/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Shared building blocks for GitHub pipeline phases."""
|
||||
@@ -4,7 +4,7 @@ from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
@dataclass(frozen=True) # This makes the class immutable so in the process, we do not accidently change the variable values of this
|
||||
@dataclass(frozen=True)
|
||||
class GithubConfig:
|
||||
base_url: str = "https://api.github.com"
|
||||
api_version: str = "2022-11-28"
|
||||
@@ -13,9 +13,9 @@ class GithubConfig:
|
||||
default_per_page: int = 100
|
||||
default_output_root: str = "runs"
|
||||
default_query: str = "is:public stars:>10 size:>1000 archived:false fork:false created:>2025-07-15"
|
||||
default_max_pages_per_repo: int = 1
|
||||
default_retry_count: int = 2
|
||||
user_agent: str = "github-datapipe/0.1.0"
|
||||
|
||||
|
||||
|
||||
|
||||
ROOT_DIR = Path(__file__).resolve().parent
|
||||
ROOT_DIR = Path(__file__).resolve().parents[2]
|
||||
54
src/github_datapipe/core/github_api.py
Normal file
54
src/github_datapipe/core/github_api.py
Normal file
@@ -0,0 +1,54 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Any
|
||||
|
||||
import requests
|
||||
|
||||
from github_datapipe.core.config import GithubConfig
|
||||
|
||||
|
||||
class GithubApiError(RuntimeError):
|
||||
"""Raised when GitHub responds with an error payload."""
|
||||
|
||||
|
||||
class GithubApiClient:
|
||||
def __init__(self, token: str, session: requests.Session | None = None) -> None:
|
||||
self._session = session or requests.Session()
|
||||
self._session.headers.update(
|
||||
{
|
||||
"Accept": "application/vnd.github+json",
|
||||
"Authorization": f"Bearer {token}",
|
||||
"X-GitHub-Api-Version": GithubConfig.api_version,
|
||||
"User-Agent": GithubConfig.user_agent,
|
||||
}
|
||||
)
|
||||
|
||||
def search_repositories(self, query: str, page: int, per_page: int) -> dict[str, Any]:
|
||||
return self._get_json(
|
||||
GithubConfig.search_repositories_endpoint,
|
||||
params={"q": query, "page": page, "per_page": per_page},
|
||||
)
|
||||
|
||||
def get_repository(self, full_name: str) -> dict[str, Any]:
|
||||
return self._get_json(f"/repos/{full_name}")
|
||||
|
||||
def list_commits(self, full_name: str, branch: str, page: int, per_page: int) -> list[dict[str, Any]]:
|
||||
return self._get_json(
|
||||
f"/repos/{full_name}/commits",
|
||||
params={"sha": branch, "page": page, "per_page": per_page},
|
||||
)
|
||||
|
||||
def _get_json(self, path: str, params: dict[str, Any] | None = None) -> Any:
|
||||
response = self._session.get(
|
||||
f"{GithubConfig.base_url}{path}",
|
||||
params=params,
|
||||
timeout=30,
|
||||
)
|
||||
if response.status_code >= 400:
|
||||
try:
|
||||
payload = response.json()
|
||||
except ValueError:
|
||||
payload = {"message": response.text}
|
||||
message = payload.get("message", "GitHub API request failed")
|
||||
raise GithubApiError(f"GitHub request failed ({response.status_code}): {message}")
|
||||
return response.json()
|
||||
43
src/github_datapipe/core/io.py
Normal file
43
src/github_datapipe/core/io.py
Normal file
@@ -0,0 +1,43 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any, Iterable
|
||||
|
||||
|
||||
def write_json(path: Path, payload: Any) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def write_jsonl(path: Path, rows: Iterable[dict[str, Any]]) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with path.open("w", encoding="utf-8") as handle:
|
||||
for row in rows:
|
||||
handle.write(json.dumps(row))
|
||||
handle.write("\n")
|
||||
|
||||
|
||||
def append_jsonl(path: Path, rows: Iterable[dict[str, Any]]) -> int:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
count = 0
|
||||
with path.open("a", encoding="utf-8") as handle:
|
||||
for row in rows:
|
||||
handle.write(json.dumps(row))
|
||||
handle.write("\n")
|
||||
count += 1
|
||||
return count
|
||||
|
||||
|
||||
def read_json(path: Path) -> Any:
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def read_jsonl(path: Path) -> list[dict[str, Any]]:
|
||||
if not path.exists():
|
||||
return []
|
||||
rows: list[dict[str, Any]] = []
|
||||
for line in path.read_text(encoding="utf-8").splitlines():
|
||||
if line.strip():
|
||||
rows.append(json.loads(line))
|
||||
return rows
|
||||
24
src/github_datapipe/core/runtime.py
Normal file
24
src/github_datapipe/core/runtime.py
Normal file
@@ -0,0 +1,24 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from datetime import UTC, datetime
|
||||
from uuid import uuid4
|
||||
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv()
|
||||
|
||||
|
||||
def require_github_token() -> str:
|
||||
github_token = os.getenv("github_token")
|
||||
if not github_token:
|
||||
raise ValueError("GitHub token is not available in .env as `github_token`.")
|
||||
return github_token
|
||||
|
||||
|
||||
def build_run_id() -> str:
|
||||
return f"run-{datetime.now(tz=UTC).strftime('%Y%m%dT%H%M%SZ')}-{uuid4().hex[:8]}"
|
||||
|
||||
|
||||
def utc_now() -> str:
|
||||
return datetime.now(tz=UTC).isoformat()
|
||||
@@ -1,286 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from dataclasses import dataclass
|
||||
from datetime import UTC, datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
from uuid import uuid4
|
||||
|
||||
import requests
|
||||
from dotenv import load_dotenv
|
||||
|
||||
from github_datapipe.config import GithubConfig
|
||||
|
||||
load_dotenv()
|
||||
|
||||
|
||||
class GithubApiError(RuntimeError):
|
||||
"""Raised when GitHub responds with an error payload."""
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SampleReposOptions:
|
||||
"""
|
||||
To set configuration options for the GitHub repository sampling run.
|
||||
Attributes:
|
||||
1. count (int): The total number of unique repositories to collect. Must be > 0. Defaults to GithubConfig.default_repo_count
|
||||
2. output_root: The base directory where the run folder will be created to store the search repository data.
|
||||
3. query: The github search string using github offered query params.
|
||||
4. per_page: The number of results to fetch per API call. Max is 100.
|
||||
5. mode: 'fresh' (starting a fresh run) or "append-deduped" (if you want to resume half-done previous run and avoid duplicates).
|
||||
6. run_id: To keep a trace of current run by giving it a unique id.
|
||||
"""
|
||||
count: int = GithubConfig.default_repo_count
|
||||
output_root: Path = Path(GithubConfig.default_output_root)
|
||||
query: str = GithubConfig.default_query
|
||||
per_page: int = GithubConfig.default_per_page
|
||||
mode: str = "append-deduped"
|
||||
run_id: str | None = None
|
||||
|
||||
|
||||
def resolve_query(query_override: str | None) -> str:
|
||||
""" To override default query params with user inserted query params. """
|
||||
return query_override.strip() if query_override else GithubConfig.default_query
|
||||
|
||||
|
||||
class GithubRepoSampler:
|
||||
def __init__(self, token: str, session: requests.Session | None = None) -> None:
|
||||
"""
|
||||
Everytime an instance of GithubRepoSampler is created, this method will be called first to open a session and set headers.
|
||||
Args:
|
||||
token (str): A valid GitHub Personal Access Token (PAT).
|
||||
session (requests.Session | None, optional): An existing session to use. If None, a new session is created.
|
||||
"""
|
||||
self._session = session or requests.Session()
|
||||
self._session.headers.update(
|
||||
{
|
||||
"Accept": "application/vnd.github+json",
|
||||
"Authorization": f"Bearer {token}",
|
||||
"X-GitHub-Api-Version": GithubConfig.api_version,
|
||||
"User-Agent": GithubConfig.user_agent,
|
||||
}
|
||||
)
|
||||
|
||||
def search_repositories(self, query: str, page: int, per_page: int) -> dict[str, Any]:
|
||||
"""
|
||||
Calls github's /search/repositories endpoint passing query params and other.
|
||||
|
||||
Args:
|
||||
query (str): The GitHub-formatted search query string.
|
||||
page (int): The specific page of results to fetch.
|
||||
per_page (int): The number of items to return per page (max 100).
|
||||
|
||||
Returns:
|
||||
dict[str, Any]: The raw JSON payload returned by the GitHub API.
|
||||
|
||||
Raises:
|
||||
GithubApiError: If the API responds with an HTTP status code >= 400
|
||||
"""
|
||||
response = self._session.get(
|
||||
f"{GithubConfig.base_url}{GithubConfig.search_repositories_endpoint}",
|
||||
params={"q": query, "page": page, "per_page": per_page},
|
||||
timeout=30,
|
||||
)
|
||||
if response.status_code >= 400:
|
||||
try:
|
||||
payload = response.json()
|
||||
except ValueError:
|
||||
payload = {"message": response.text}
|
||||
message = payload.get("message", "GitHub API request failed")
|
||||
raise GithubApiError(f"GitHub search failed ({response.status_code}): {message}")
|
||||
return response.json()
|
||||
|
||||
|
||||
def sample_repositories(options: SampleReposOptions) -> dict[str, Any]:
|
||||
"""
|
||||
Orchestrates the end-to-end execution of a GitHub repository data pull.
|
||||
|
||||
This function handles the complete lifecycle of a sampling run: validating credentials,
|
||||
setting up the local file system for the run, managing state to prevent duplicate
|
||||
downloads, paginating through the GitHub Search API, and saving the normalized data.
|
||||
|
||||
Effects:
|
||||
- Creates a timestamped run directory inside `options.output_root`.
|
||||
- Writes newly fetched repository data to `repos.jsonl`.
|
||||
- Writes or updates a global `seen_repo_ids.json` tracker file.
|
||||
- Writes a `manifest.json` logging the run's parameters and results.
|
||||
|
||||
Args:
|
||||
options (SampleReposOptions): The configuration parameters for the run.
|
||||
|
||||
Returns:
|
||||
dict[str, Any]: A summary dictionary containing the `run_id`, the total
|
||||
`count_collected`, and the absolute paths to the generated output files.
|
||||
"""
|
||||
# 1. SETUP: Validate the token and creds
|
||||
github_token = os.getenv("github_token")
|
||||
if not github_token:
|
||||
raise ValueError("GitHub token is not available in .env as `github_token`.")
|
||||
|
||||
if options.count <= 0:
|
||||
raise ValueError("`count` must be greater than 0.")
|
||||
|
||||
# 2. FILE SYSTEM: Create isolated folders for this specific run
|
||||
run_id = options.run_id or build_run_id()
|
||||
output_root = options.output_root.resolve() # returns an absolute path
|
||||
run_root = output_root / run_id # isolated folder created for one single execution (or "run") of this script.
|
||||
phase1_root = run_root / "phase1" # the search phase is phase 1
|
||||
phase1_root.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Define exact file paths for where data will live
|
||||
seen_index_path = output_root / "seen_repo_ids.json" # Notes the last seen repo's id
|
||||
repos_path = phase1_root / "repos.jsonl" # Actual repo metadata
|
||||
manifest_path = phase1_root / "manifest.json" # log entry
|
||||
|
||||
# 3. STATE MGMT: seen_repo_ids will make sure that duplicated data doesn't end up in the result
|
||||
seen_repo_ids = set() if options.mode == "fresh" else load_seen_repo_ids(seen_index_path)
|
||||
current_run_seen_repo_ids: set[int] = set()
|
||||
|
||||
# 4. NETWORK SETUP: Initialize the session wrapper
|
||||
sampler = GithubRepoSampler(token=github_token)
|
||||
accepted_repos: list[dict[str, Any]] = []
|
||||
page = 1
|
||||
total_count: int | None = None
|
||||
|
||||
# 5. THE MAIN LOOP: Keep fetching until the requested target count is hit
|
||||
while len(accepted_repos) < options.count:
|
||||
payload = sampler.search_repositories(
|
||||
query=options.query,
|
||||
page=page,
|
||||
per_page=options.per_page,
|
||||
)
|
||||
# Github Search API response = JSON Object containing { search metadata, items: [{ repo obj1 }, { repo obj 2} ...] }
|
||||
items = payload.get("items", [])
|
||||
total_count = payload.get("total_count", total_count) # the number of the repos matching your query at exact moment in time
|
||||
|
||||
# Break safety: If GitHub returns an empty list, we've exhausted all results of current api call
|
||||
if not items:
|
||||
break
|
||||
|
||||
sampled_at = utc_now() # to keep note of current time incase we need to revise it
|
||||
|
||||
# Process each repository in the current result present in items
|
||||
for repo in items:
|
||||
repo_id = int(repo["id"])
|
||||
if repo_id in seen_repo_ids or repo_id in current_run_seen_repo_ids:
|
||||
continue
|
||||
|
||||
normalized_repo = normalize_repo_record(
|
||||
repo=repo,
|
||||
run_id=run_id,
|
||||
sample_query=options.query,
|
||||
sample_page=page,
|
||||
sampled_at=sampled_at,
|
||||
)
|
||||
accepted_repos.append(normalized_repo)
|
||||
current_run_seen_repo_ids.add(repo_id)
|
||||
|
||||
if len(accepted_repos) >= options.count:
|
||||
break
|
||||
|
||||
page += 1
|
||||
|
||||
write_jsonl(repos_path, accepted_repos)
|
||||
if options.mode != "fresh":
|
||||
seen_repo_ids.update(current_run_seen_repo_ids)
|
||||
write_json(seen_index_path, sorted(seen_repo_ids))
|
||||
|
||||
# manifest is a log object that denotes what kind of input was given to achieve certain output.
|
||||
manifest = {
|
||||
"run_id": run_id,
|
||||
"command": "sample-repos",
|
||||
"resolved_query": options.query,
|
||||
"count_requested": options.count,
|
||||
"count_collected": len(accepted_repos),
|
||||
"per_page": options.per_page,
|
||||
"mode": options.mode,
|
||||
"page_count_scanned": max(page - 1, 0),
|
||||
"github_total_count": total_count,
|
||||
"output_files": {
|
||||
"repos_jsonl": str(repos_path),
|
||||
"seen_repo_ids_json": str(seen_index_path) if options.mode != "fresh" else None,
|
||||
},
|
||||
"sampled_at": utc_now(),
|
||||
}
|
||||
write_json(manifest_path, manifest)
|
||||
|
||||
return {
|
||||
"run_id": run_id,
|
||||
"count_collected": len(accepted_repos),
|
||||
"repos_path": repos_path,
|
||||
"manifest_path": manifest_path,
|
||||
"seen_index_path": seen_index_path if options.mode != "fresh" else None,
|
||||
}
|
||||
|
||||
|
||||
def normalize_repo_record(
|
||||
repo: dict[str, Any],
|
||||
run_id: str,
|
||||
sample_query: str,
|
||||
sample_page: int,
|
||||
sampled_at: str,
|
||||
) -> dict[str, Any]:
|
||||
|
||||
"""
|
||||
Extracts and flattens relevant fields from a raw GitHub repository payload.
|
||||
|
||||
Args:
|
||||
repo (dict[str, Any]): The raw JSON dictionary for a single repository
|
||||
as returned by the GitHub API.
|
||||
run_id (str): The unique identifier for the current sampling run.
|
||||
sample_query (str): The search string used to find this repository.
|
||||
sample_page (int): The pagination page number this repository was found on.
|
||||
sampled_at (str): ISO-8601 formatted timestamp of when the record was fetched.
|
||||
|
||||
Returns:
|
||||
dict[str, Any]: A flat dictionary containing the filtered repository metrics
|
||||
(e.g., repo_id, language, stargazers_count, size_kb) and traceability data.
|
||||
"""
|
||||
return {
|
||||
"run_id": run_id,
|
||||
"repo_id": repo["id"],
|
||||
"full_name": repo["full_name"],
|
||||
"html_url": repo["html_url"],
|
||||
"api_url": repo["url"],
|
||||
"default_branch": repo.get("default_branch"),
|
||||
"language": repo.get("language"),
|
||||
"description": repo.get("description"),
|
||||
"stargazers_count": repo.get("stargazers_count"),
|
||||
"size_kb": repo.get("size"),
|
||||
"fork": repo.get("fork"),
|
||||
"archived": repo.get("archived"),
|
||||
"visibility": repo.get("visibility"),
|
||||
"sample_query": sample_query,
|
||||
"sample_page": sample_page,
|
||||
"sampled_at": sampled_at,
|
||||
}
|
||||
|
||||
|
||||
def load_seen_repo_ids(path: Path) -> set[int]:
|
||||
if not path.exists():
|
||||
return set()
|
||||
payload = json.loads(path.read_text(encoding="utf-8"))
|
||||
return {int(repo_id) for repo_id in payload}
|
||||
|
||||
|
||||
def write_json(path: Path, payload: Any) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
|
||||
|
||||
|
||||
def write_jsonl(path: Path, rows: list[dict[str, Any]]) -> None:
|
||||
path.parent.mkdir(parents=True, exist_ok=True)
|
||||
with path.open("w", encoding="utf-8") as handle:
|
||||
for row in rows:
|
||||
handle.write(json.dumps(row))
|
||||
handle.write("\n")
|
||||
|
||||
|
||||
def build_run_id() -> str:
|
||||
return f"run-{datetime.now(tz=UTC).strftime('%Y%m%dT%H%M%SZ')}-{uuid4().hex[:8]}"
|
||||
|
||||
|
||||
def utc_now() -> str:
|
||||
return datetime.now(tz=UTC).isoformat()
|
||||
@@ -1,86 +0,0 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
from github_datapipe.config import GithubConfig
|
||||
from github_datapipe.extract_repos import SampleReposOptions, resolve_query, sample_repositories
|
||||
|
||||
### parser that unpacks the command line args passed as queries and
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(description="GitHub data pipeline CLI")
|
||||
subparsers = parser.add_subparsers(dest="command", required=True)
|
||||
|
||||
sample_parser = subparsers.add_parser(
|
||||
"sample-repos",
|
||||
help="Collect repositories from GitHub Search and save phase 1 outputs.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--count",
|
||||
type=int,
|
||||
default=GithubConfig.default_repo_count,
|
||||
help=f"Number of repositories to sample. Defaults to {GithubConfig.default_repo_count}.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--query",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Optional raw GitHub repository search query. Replaces config defaults when provided.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--output-root",
|
||||
type=Path,
|
||||
default=Path(GithubConfig.default_output_root),
|
||||
help=f"Directory where run outputs are stored. Defaults to `{GithubConfig.default_output_root}`.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--mode",
|
||||
choices=("append-deduped", "fresh"),
|
||||
default="append-deduped",
|
||||
help="Whether to dedupe against the persisted seen-repo index or start a fresh phase-1 run.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--per-page",
|
||||
type=int,
|
||||
default=GithubConfig.default_per_page,
|
||||
help=f"GitHub Search page size. Defaults to {GithubConfig.default_per_page}.",
|
||||
)
|
||||
sample_parser.add_argument(
|
||||
"--run-id",
|
||||
type=str,
|
||||
default=None,
|
||||
help="Optional run identifier. If omitted, a run id is generated automatically.",
|
||||
)
|
||||
|
||||
return parser
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.command == "sample-repos":
|
||||
options = SampleReposOptions(
|
||||
count=args.count,
|
||||
output_root=args.output_root,
|
||||
query=resolve_query(args.query),
|
||||
per_page=args.per_page,
|
||||
mode=args.mode,
|
||||
run_id=args.run_id,
|
||||
)
|
||||
# result will return output from the extract repo's sample_repositories
|
||||
result = sample_repositories(options)
|
||||
print(f"Run ID: {result['run_id']}")
|
||||
print(f"Collected repositories: {result['count_collected']}")
|
||||
print(f"Repositories file: {result['repos_path']}")
|
||||
print(f"Manifest file: {result['manifest_path']}")
|
||||
if result["seen_index_path"] is not None:
|
||||
print(f"Seen repo index: {result['seen_index_path']}")
|
||||
return 0
|
||||
|
||||
parser.error(f"Unsupported command: {args.command}")
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
raise SystemExit(main())
|
||||
1
src/github_datapipe/phases/__init__.py
Normal file
1
src/github_datapipe/phases/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Pipeline phases for repository sampling and commit ingestion."""
|
||||
@@ -0,0 +1 @@
|
||||
"""Phase 1: repository discovery and persistence."""
|
||||
148
src/github_datapipe/phases/phase1_repository_sampling/service.py
Normal file
148
src/github_datapipe/phases/phase1_repository_sampling/service.py
Normal file
@@ -0,0 +1,148 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from github_datapipe.core.config import GithubConfig
|
||||
from github_datapipe.core.github_api import GithubApiClient
|
||||
from github_datapipe.core.io import read_json, write_json, write_jsonl
|
||||
from github_datapipe.core.runtime import build_run_id, require_github_token, utc_now
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class SampleReposOptions:
|
||||
count: int = GithubConfig.default_repo_count
|
||||
output_root: Path = Path(GithubConfig.default_output_root)
|
||||
query: str = GithubConfig.default_query
|
||||
per_page: int = GithubConfig.default_per_page
|
||||
mode: str = "append-deduped"
|
||||
run_id: str | None = None
|
||||
|
||||
|
||||
def resolve_query(query_override: str | None) -> str:
|
||||
return query_override.strip() if query_override else GithubConfig.default_query
|
||||
|
||||
|
||||
def sample_repositories(options: SampleReposOptions) -> dict[str, Any]:
|
||||
token = require_github_token()
|
||||
if options.count <= 0:
|
||||
raise ValueError("`count` must be greater than 0.")
|
||||
|
||||
run_id = options.run_id or build_run_id()
|
||||
output_root = options.output_root.resolve()
|
||||
run_root = output_root / run_id
|
||||
phase_root = run_root / "phase1_repository_sampling"
|
||||
phase_root.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
seen_index_path = output_root / "seen_repo_ids.json"
|
||||
repos_path = phase_root / "repos.jsonl"
|
||||
manifest_path = phase_root / "manifest.json"
|
||||
|
||||
seen_repo_ids = set() if options.mode == "fresh" else load_seen_repo_ids(seen_index_path)
|
||||
current_run_seen_repo_ids: set[int] = set()
|
||||
|
||||
client = GithubApiClient(token=token)
|
||||
accepted_repos: list[dict[str, Any]] = []
|
||||
page = 1
|
||||
total_count: int | None = None
|
||||
|
||||
while len(accepted_repos) < options.count:
|
||||
payload = client.search_repositories(
|
||||
query=options.query,
|
||||
page=page,
|
||||
per_page=options.per_page,
|
||||
)
|
||||
items = payload.get("items", [])
|
||||
total_count = payload.get("total_count", total_count)
|
||||
|
||||
if not items:
|
||||
break
|
||||
|
||||
sampled_at = utc_now()
|
||||
for repo in items:
|
||||
repo_id = int(repo["id"])
|
||||
if repo_id in seen_repo_ids or repo_id in current_run_seen_repo_ids:
|
||||
continue
|
||||
|
||||
accepted_repos.append(
|
||||
normalize_repo_record(
|
||||
repo=repo,
|
||||
run_id=run_id,
|
||||
sample_query=options.query,
|
||||
sample_page=page,
|
||||
sampled_at=sampled_at,
|
||||
)
|
||||
)
|
||||
current_run_seen_repo_ids.add(repo_id)
|
||||
|
||||
if len(accepted_repos) >= options.count:
|
||||
break
|
||||
|
||||
page += 1
|
||||
|
||||
write_jsonl(repos_path, accepted_repos)
|
||||
if options.mode != "fresh":
|
||||
seen_repo_ids.update(current_run_seen_repo_ids)
|
||||
write_json(seen_index_path, sorted(seen_repo_ids))
|
||||
|
||||
manifest = {
|
||||
"run_id": run_id,
|
||||
"phase_name": "phase1_repository_sampling",
|
||||
"command": "sample-repos",
|
||||
"resolved_query": options.query,
|
||||
"count_requested": options.count,
|
||||
"count_collected": len(accepted_repos),
|
||||
"per_page": options.per_page,
|
||||
"mode": options.mode,
|
||||
"page_count_scanned": max(page - 1, 0),
|
||||
"github_total_count": total_count,
|
||||
"output_files": {
|
||||
"repos_jsonl": str(repos_path),
|
||||
"seen_repo_ids_json": str(seen_index_path) if options.mode != "fresh" else None,
|
||||
},
|
||||
"sampled_at": utc_now(),
|
||||
}
|
||||
write_json(manifest_path, manifest)
|
||||
|
||||
return {
|
||||
"run_id": run_id,
|
||||
"count_collected": len(accepted_repos),
|
||||
"repos_path": repos_path,
|
||||
"manifest_path": manifest_path,
|
||||
"seen_index_path": seen_index_path if options.mode != "fresh" else None,
|
||||
}
|
||||
|
||||
|
||||
def normalize_repo_record(
|
||||
repo: dict[str, Any],
|
||||
run_id: str,
|
||||
sample_query: str,
|
||||
sample_page: int,
|
||||
sampled_at: str,
|
||||
) -> dict[str, Any]:
|
||||
return {
|
||||
"run_id": run_id,
|
||||
"repo_id": repo["id"],
|
||||
"full_name": repo["full_name"],
|
||||
"html_url": repo["html_url"],
|
||||
"api_url": repo["url"],
|
||||
"default_branch": repo.get("default_branch"),
|
||||
"language": repo.get("language"),
|
||||
"description": repo.get("description"),
|
||||
"stargazers_count": repo.get("stargazers_count"),
|
||||
"size_kb": repo.get("size"),
|
||||
"fork": repo.get("fork"),
|
||||
"archived": repo.get("archived"),
|
||||
"visibility": repo.get("visibility"),
|
||||
"sample_query": sample_query,
|
||||
"sample_page": sample_page,
|
||||
"sampled_at": sampled_at,
|
||||
}
|
||||
|
||||
|
||||
def load_seen_repo_ids(path: Path) -> set[int]:
|
||||
if not path.exists():
|
||||
return set()
|
||||
payload = read_json(path)
|
||||
return {int(repo_id) for repo_id in payload}
|
||||
@@ -0,0 +1 @@
|
||||
"""Phase 2: commit history retrieval and persistence."""
|
||||
293
src/github_datapipe/phases/phase2_commit_ingestion/service.py
Normal file
293
src/github_datapipe/phases/phase2_commit_ingestion/service.py
Normal file
@@ -0,0 +1,293 @@
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from github_datapipe.core.config import GithubConfig
|
||||
from github_datapipe.core.github_api import GithubApiClient, GithubApiError
|
||||
from github_datapipe.core.io import append_jsonl, read_json, read_jsonl, write_json, write_jsonl
|
||||
from github_datapipe.core.runtime import require_github_token, utc_now
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class FetchCommitsOptions:
|
||||
output_root: Path = Path(GithubConfig.default_output_root)
|
||||
run_id: str | None = None
|
||||
repos_file: Path | None = None
|
||||
mode: str = "refresh"
|
||||
per_page: int = GithubConfig.default_per_page
|
||||
max_pages_per_repo: int = GithubConfig.default_max_pages_per_repo
|
||||
retry_count: int = GithubConfig.default_retry_count
|
||||
|
||||
|
||||
def fetch_commits(options: FetchCommitsOptions) -> dict[str, Any]:
|
||||
token = require_github_token()
|
||||
if options.max_pages_per_repo <= 0:
|
||||
raise ValueError("`max_pages_per_repo` must be greater than 0.")
|
||||
if options.per_page <= 0:
|
||||
raise ValueError("`per_page` must be greater than 0.")
|
||||
|
||||
output_root = options.output_root.resolve()
|
||||
repos_path, run_id = resolve_repo_input(options, output_root)
|
||||
|
||||
run_root = output_root / run_id
|
||||
phase_root = run_root / "phase2_commit_ingestion"
|
||||
commits_root = phase_root / "commits"
|
||||
commits_path = commits_root / "commits-0001.jsonl"
|
||||
status_path = phase_root / "repo_status.jsonl"
|
||||
manifest_path = phase_root / "manifest.json"
|
||||
|
||||
repos = read_jsonl(repos_path)
|
||||
if not repos:
|
||||
raise ValueError(f"No repositories found in {repos_path}.")
|
||||
|
||||
prior_statuses = load_status_index(status_path) if options.mode == "resume" else {}
|
||||
|
||||
if options.mode == "refresh" and commits_path.exists():
|
||||
commits_path.unlink()
|
||||
status_rows: list[dict[str, Any]] = []
|
||||
if options.mode == "resume":
|
||||
status_rows.extend(read_jsonl(status_path))
|
||||
|
||||
processed_repositories = 0
|
||||
completed_repositories = 0
|
||||
warning_repositories = 0
|
||||
failed_repositories = 0
|
||||
written_commit_count = 0 if options.mode == "refresh" else count_jsonl_rows(commits_path)
|
||||
client: GithubApiClient | None = None
|
||||
|
||||
for repo in repos:
|
||||
repo_id = int(repo["repo_id"])
|
||||
previous_status = prior_statuses.get(repo_id)
|
||||
if options.mode == "resume" and previous_status == "complete":
|
||||
continue
|
||||
|
||||
if client is None:
|
||||
client = GithubApiClient(token=token)
|
||||
|
||||
processed_repositories += 1
|
||||
status_record, commit_rows = process_repository(
|
||||
client=client,
|
||||
repo=repo,
|
||||
run_id=run_id,
|
||||
per_page=options.per_page,
|
||||
max_pages_per_repo=options.max_pages_per_repo,
|
||||
retry_count=options.retry_count,
|
||||
)
|
||||
|
||||
status_rows.append(status_record)
|
||||
written_commit_count += append_jsonl(commits_path, commit_rows)
|
||||
|
||||
if status_record["status"] == "complete":
|
||||
completed_repositories += 1
|
||||
elif status_record["status"] == "success_with_warning":
|
||||
warning_repositories += 1
|
||||
else:
|
||||
failed_repositories += 1
|
||||
|
||||
write_jsonl(status_path, dedupe_status_rows(status_rows))
|
||||
|
||||
manifest = {
|
||||
"run_id": run_id,
|
||||
"phase_name": "phase2_commit_ingestion",
|
||||
"command": "fetch-commits",
|
||||
"mode": options.mode,
|
||||
"per_page": options.per_page,
|
||||
"max_pages_per_repo": options.max_pages_per_repo,
|
||||
"retry_count": options.retry_count,
|
||||
"processed_repositories": processed_repositories,
|
||||
"completed_repositories": completed_repositories,
|
||||
"warning_repositories": warning_repositories,
|
||||
"failed_repositories": failed_repositories,
|
||||
"written_commit_count": written_commit_count,
|
||||
"input_files": {"repos_jsonl": str(repos_path)},
|
||||
"output_files": {
|
||||
"commits_jsonl": str(commits_path),
|
||||
"repo_status_jsonl": str(status_path),
|
||||
},
|
||||
"fetched_at": utc_now(),
|
||||
}
|
||||
write_json(manifest_path, manifest)
|
||||
|
||||
return {
|
||||
"run_id": run_id,
|
||||
"processed_repositories": processed_repositories,
|
||||
"completed_repositories": completed_repositories,
|
||||
"warning_repositories": warning_repositories,
|
||||
"failed_repositories": failed_repositories,
|
||||
"commits_path": commits_path,
|
||||
"status_path": status_path,
|
||||
"manifest_path": manifest_path,
|
||||
}
|
||||
|
||||
|
||||
def resolve_repo_input(options: FetchCommitsOptions, output_root: Path) -> tuple[Path, str]:
|
||||
if options.run_id is not None:
|
||||
return (
|
||||
output_root / options.run_id / "phase1_repository_sampling" / "repos.jsonl",
|
||||
options.run_id,
|
||||
)
|
||||
|
||||
assert options.repos_file is not None
|
||||
repos_path = options.repos_file.resolve()
|
||||
repos = read_jsonl(repos_path)
|
||||
if not repos:
|
||||
raise ValueError(f"No repositories found in {repos_path}.")
|
||||
return repos_path, str(repos[0]["run_id"])
|
||||
|
||||
|
||||
def process_repository(
|
||||
client: GithubApiClient,
|
||||
repo: dict[str, Any],
|
||||
run_id: str,
|
||||
per_page: int,
|
||||
max_pages_per_repo: int,
|
||||
retry_count: int,
|
||||
) -> tuple[dict[str, Any], list[dict[str, Any]]]:
|
||||
full_name = repo["full_name"]
|
||||
fetched_at = utc_now()
|
||||
|
||||
try:
|
||||
repository = with_retries(
|
||||
lambda: client.get_repository(full_name),
|
||||
retry_count=retry_count,
|
||||
)
|
||||
branch = repository["default_branch"]
|
||||
commit_rows: list[dict[str, Any]] = []
|
||||
page = 1
|
||||
last_page_size = 0
|
||||
|
||||
while page <= max_pages_per_repo:
|
||||
commits = with_retries(
|
||||
lambda page_number=page: client.list_commits(
|
||||
full_name=full_name,
|
||||
branch=branch,
|
||||
page=page_number,
|
||||
per_page=per_page,
|
||||
),
|
||||
retry_count=retry_count,
|
||||
)
|
||||
if not commits:
|
||||
break
|
||||
|
||||
last_page_size = len(commits)
|
||||
commit_rows.extend(
|
||||
normalize_commit_record(
|
||||
repo=repo,
|
||||
repository=repository,
|
||||
commit_payload=commit_payload,
|
||||
run_id=run_id,
|
||||
branch=branch,
|
||||
page_number=page,
|
||||
fetched_at=fetched_at,
|
||||
)
|
||||
for commit_payload in commits
|
||||
)
|
||||
|
||||
if len(commits) < per_page:
|
||||
break
|
||||
|
||||
page += 1
|
||||
|
||||
truncated = bool(commit_rows) and page > max_pages_per_repo and last_page_size == per_page
|
||||
status = "success_with_warning" if truncated else "complete"
|
||||
status_record = {
|
||||
"run_id": run_id,
|
||||
"repo_id": repo["repo_id"],
|
||||
"repo_full_name": full_name,
|
||||
"status": status,
|
||||
"commit_count": len(commit_rows),
|
||||
"default_branch_at_fetch": branch,
|
||||
"truncated": truncated,
|
||||
"truncation_reason": "max_pages" if truncated else None,
|
||||
"failure_reason": None,
|
||||
"fetched_at": fetched_at,
|
||||
}
|
||||
return status_record, commit_rows
|
||||
except GithubApiError as exc:
|
||||
status_record = {
|
||||
"run_id": run_id,
|
||||
"repo_id": repo["repo_id"],
|
||||
"repo_full_name": full_name,
|
||||
"status": "failed",
|
||||
"commit_count": 0,
|
||||
"default_branch_at_fetch": None,
|
||||
"truncated": False,
|
||||
"truncation_reason": None,
|
||||
"failure_reason": str(exc),
|
||||
"fetched_at": fetched_at,
|
||||
}
|
||||
return status_record, []
|
||||
|
||||
|
||||
def normalize_commit_record(
|
||||
repo: dict[str, Any],
|
||||
repository: dict[str, Any],
|
||||
commit_payload: dict[str, Any],
|
||||
run_id: str,
|
||||
branch: str,
|
||||
page_number: int,
|
||||
fetched_at: str,
|
||||
) -> dict[str, Any]:
|
||||
commit = commit_payload["commit"]
|
||||
author = commit.get("author") or {}
|
||||
committer = commit.get("committer") or {}
|
||||
return {
|
||||
"run_id": run_id,
|
||||
"repo_id": repo["repo_id"],
|
||||
"repo_full_name": repo["full_name"],
|
||||
"repo_html_url": repo["html_url"],
|
||||
"default_branch_at_fetch": branch,
|
||||
"sha": commit_payload["sha"],
|
||||
"commit_key": f"{repo['repo_id']}:{commit_payload['sha']}",
|
||||
"parent_shas": [parent["sha"] for parent in commit_payload.get("parents", [])],
|
||||
"author_name": author.get("name"),
|
||||
"author_email": author.get("email"),
|
||||
"author_date": author.get("date"),
|
||||
"committer_name": committer.get("name"),
|
||||
"committer_email": committer.get("email"),
|
||||
"committer_date": committer.get("date"),
|
||||
"message": commit.get("message"),
|
||||
"html_url": commit_payload.get("html_url"),
|
||||
"fetched_at": fetched_at,
|
||||
"source_endpoint": f"/repos/{repository['full_name']}/commits",
|
||||
"page_number": page_number,
|
||||
"truncated": False,
|
||||
}
|
||||
|
||||
|
||||
def with_retries(operation: Any, retry_count: int) -> Any:
|
||||
last_error: Exception | None = None
|
||||
for _ in range(retry_count + 1):
|
||||
try:
|
||||
return operation()
|
||||
except GithubApiError as exc:
|
||||
last_error = exc
|
||||
assert last_error is not None
|
||||
raise last_error
|
||||
|
||||
|
||||
def load_status_index(path: Path) -> dict[int, str]:
|
||||
rows = read_jsonl(path)
|
||||
latest_by_repo: dict[int, str] = {}
|
||||
for row in rows:
|
||||
latest_by_repo[int(row["repo_id"])] = row["status"]
|
||||
return latest_by_repo
|
||||
|
||||
|
||||
def dedupe_status_rows(rows: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
latest_by_repo: dict[int, dict[str, Any]] = {}
|
||||
order: list[int] = []
|
||||
for row in rows:
|
||||
repo_id = int(row["repo_id"])
|
||||
if repo_id not in latest_by_repo:
|
||||
order.append(repo_id)
|
||||
latest_by_repo[repo_id] = row
|
||||
return [latest_by_repo[repo_id] for repo_id in order]
|
||||
|
||||
|
||||
def count_jsonl_rows(path: Path) -> int:
|
||||
if not path.exists():
|
||||
return 0
|
||||
return len(path.read_text(encoding="utf-8").splitlines())
|
||||
@@ -5,8 +5,12 @@ from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from github_datapipe.config import GithubConfig
|
||||
from github_datapipe.extract_repos import SampleReposOptions, resolve_query, sample_repositories
|
||||
from github_datapipe.core.config import GithubConfig
|
||||
from github_datapipe.phases.phase1_repository_sampling.service import (
|
||||
SampleReposOptions,
|
||||
resolve_query,
|
||||
sample_repositories,
|
||||
)
|
||||
|
||||
|
||||
def test_resolve_query_uses_default_when_missing() -> None:
|
||||
@@ -49,7 +53,7 @@ def test_sample_repositories_dedupes_and_persists(monkeypatch, tmp_path: Path) -
|
||||
return {"total_count": 3, "items": [fake_repo(102, "owner/repo-three")]}
|
||||
|
||||
# Inject the mock sampler into the main code
|
||||
monkeypatch.setattr("github_datapipe.extract_repos.GithubRepoSampler", FakeSampler)
|
||||
monkeypatch.setattr("github_datapipe.phases.phase1_repository_sampling.service.GithubApiClient", FakeSampler)
|
||||
|
||||
options = SampleReposOptions(
|
||||
count=3,
|
||||
|
||||
121
tests/test_phase2.py
Normal file
121
tests/test_phase2.py
Normal file
@@ -0,0 +1,121 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
import requests
|
||||
|
||||
from github_datapipe.phases.phase2_commit_ingestion.service import (
|
||||
FetchCommitsOptions,
|
||||
fetch_commits,
|
||||
)
|
||||
|
||||
|
||||
def test_fetch_commits_writes_normalized_records_and_warning(monkeypatch, tmp_path: Path) -> None:
|
||||
monkeypatch.setenv("github_token", "token")
|
||||
repos_path = tmp_path / "run-test" / "phase1_repository_sampling" / "repos.jsonl"
|
||||
repos_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
repos_path.write_text(json.dumps(fake_repo_record()) + "\n", encoding="utf-8")
|
||||
|
||||
class FakeClient:
|
||||
def __init__(self, token: str, session: requests.Session | None = None) -> None:
|
||||
self.token = token
|
||||
|
||||
def get_repository(self, full_name: str) -> dict:
|
||||
return {"full_name": full_name, "default_branch": "main"}
|
||||
|
||||
def list_commits(self, full_name: str, branch: str, page: int, per_page: int) -> list[dict]:
|
||||
if page == 1:
|
||||
return [fake_commit_payload("a" * 40), fake_commit_payload("b" * 40)]
|
||||
return []
|
||||
|
||||
monkeypatch.setattr("github_datapipe.phases.phase2_commit_ingestion.service.GithubApiClient", FakeClient)
|
||||
|
||||
result = fetch_commits(
|
||||
FetchCommitsOptions(
|
||||
output_root=tmp_path,
|
||||
run_id="run-test",
|
||||
mode="refresh",
|
||||
per_page=2,
|
||||
max_pages_per_repo=1,
|
||||
retry_count=0,
|
||||
)
|
||||
)
|
||||
|
||||
commits_path = Path(result["commits_path"])
|
||||
commit_rows = [json.loads(line) for line in commits_path.read_text(encoding="utf-8").splitlines()]
|
||||
assert len(commit_rows) == 2
|
||||
assert commit_rows[0]["commit_key"] == f"{fake_repo_record()['repo_id']}:{'a' * 40}"
|
||||
|
||||
status_path = Path(result["status_path"])
|
||||
status_rows = [json.loads(line) for line in status_path.read_text(encoding="utf-8").splitlines()]
|
||||
assert status_rows[0]["status"] == "success_with_warning"
|
||||
assert status_rows[0]["truncated"] is True
|
||||
|
||||
|
||||
def test_fetch_commits_resume_skips_completed_repos(monkeypatch, tmp_path: Path) -> None:
|
||||
monkeypatch.setenv("github_token", "token")
|
||||
run_root = tmp_path / "run-test"
|
||||
repos_path = run_root / "phase1_repository_sampling" / "repos.jsonl"
|
||||
repos_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
repos_path.write_text(json.dumps(fake_repo_record()) + "\n", encoding="utf-8")
|
||||
|
||||
phase2_root = run_root / "phase2_commit_ingestion"
|
||||
phase2_root.mkdir(parents=True, exist_ok=True)
|
||||
(phase2_root / "repo_status.jsonl").write_text(
|
||||
json.dumps(
|
||||
{
|
||||
"repo_id": fake_repo_record()["repo_id"],
|
||||
"status": "complete",
|
||||
}
|
||||
)
|
||||
+ "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
class FakeClient:
|
||||
def __init__(self, token: str, session: requests.Session | None = None) -> None:
|
||||
raise AssertionError("Client should not be initialized for already completed repos")
|
||||
|
||||
monkeypatch.setattr("github_datapipe.phases.phase2_commit_ingestion.service.GithubApiClient", FakeClient)
|
||||
|
||||
result = fetch_commits(
|
||||
FetchCommitsOptions(
|
||||
output_root=tmp_path,
|
||||
run_id="run-test",
|
||||
mode="resume",
|
||||
retry_count=0,
|
||||
)
|
||||
)
|
||||
|
||||
assert result["processed_repositories"] == 0
|
||||
|
||||
|
||||
def fake_repo_record() -> dict:
|
||||
return {
|
||||
"run_id": "run-test",
|
||||
"repo_id": 100,
|
||||
"full_name": "owner/repo-one",
|
||||
"html_url": "https://github.com/owner/repo-one",
|
||||
}
|
||||
|
||||
|
||||
def fake_commit_payload(sha: str) -> dict:
|
||||
return {
|
||||
"sha": sha,
|
||||
"html_url": f"https://github.com/owner/repo-one/commit/{sha}",
|
||||
"parents": [{"sha": "parent-sha"}],
|
||||
"commit": {
|
||||
"author": {
|
||||
"name": "Alice",
|
||||
"email": "alice@example.com",
|
||||
"date": "2024-01-01T00:00:00Z",
|
||||
},
|
||||
"committer": {
|
||||
"name": "Bob",
|
||||
"email": "bob@example.com",
|
||||
"date": "2024-01-01T00:00:00Z",
|
||||
},
|
||||
"message": "Initial commit",
|
||||
},
|
||||
}
|
||||
Reference in New Issue
Block a user