18
0

datasets/: split parquet scripts; share logic in dumps_helper.py

Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:

- dumps_helper.py — schemas, simdjson parser, a generic parse_record
  loop with per-field handler dispatch, and parse_dump / gen_task_list
  / sort_and_write workers. The only per-type code is the field-handler
  dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
  with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
  for the Spark sort. pyspark is imported lazily inside sort_and_write
  so Part 1 callers don't pay the import cost.

Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.

Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.

Runners and README updated for the new CLIs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 16:51:41 -07:00
parent 8965a251b6
commit 33150243cd
10 changed files with 386 additions and 328 deletions

View File

@@ -7,14 +7,18 @@ consumes.
The pipeline has two stages:
| Script | What it does |
| Stage | What it does |
|---|---|
| `parquet_part1.py` | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
| `parquet_part2.py` | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
Both scripts use a single fire CLI with `comments` and `submissions`
subcommands, so the comments and submissions paths share all of their
plumbing — only the schema and the JSON parser differ.
Each stage has a thin entry-point script per dump type:
| Script | Notes |
|---|---|
| `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
| `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. |
| `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |
## The two workflows
@@ -50,11 +54,14 @@ runner and execute it standalone. For example:
```sh
# parse one specific file (skipping the rest of the workflow)
python3 parquet_part1.py comments parse_dump RC_2025-03.zst
python3 comments_part1.py parse_dump RC_2025-03.zst
# override default dump/output paths from the CLI
python3 parquet_part1.py comments parse_dump RC_2025-03.zst \
python3 comments_part1.py parse_dump RC_2025-03.zst \
--dumpdir=/tmp/test --outdir=/tmp/out
# regenerate just the task list
python3 submissions_part1.py gen_task_list
```
The Spark Part 2 step is launched via `start_spark_and_run.sh` (a