datasets/: split parquet scripts; share logic in dumps_helper.py
Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -7,14 +7,18 @@ consumes.
|
||||
|
||||
The pipeline has two stages:
|
||||
|
||||
| Script | What it does |
|
||||
| Stage | What it does |
|
||||
|---|---|
|
||||
| `parquet_part1.py` | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
|
||||
| `parquet_part2.py` | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
|
||||
| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
|
||||
| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
|
||||
|
||||
Both scripts use a single fire CLI with `comments` and `submissions`
|
||||
subcommands, so the comments and submissions paths share all of their
|
||||
plumbing — only the schema and the JSON parser differ.
|
||||
Each stage has a thin entry-point script per dump type:
|
||||
|
||||
| Script | Notes |
|
||||
|---|---|
|
||||
| `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
|
||||
| `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. |
|
||||
| `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |
|
||||
|
||||
## The two workflows
|
||||
|
||||
@@ -50,11 +54,14 @@ runner and execute it standalone. For example:
|
||||
|
||||
```sh
|
||||
# parse one specific file (skipping the rest of the workflow)
|
||||
python3 parquet_part1.py comments parse_dump RC_2025-03.zst
|
||||
python3 comments_part1.py parse_dump RC_2025-03.zst
|
||||
|
||||
# override default dump/output paths from the CLI
|
||||
python3 parquet_part1.py comments parse_dump RC_2025-03.zst \
|
||||
python3 comments_part1.py parse_dump RC_2025-03.zst \
|
||||
--dumpdir=/tmp/test --outdir=/tmp/out
|
||||
|
||||
# regenerate just the task list
|
||||
python3 submissions_part1.py gen_task_list
|
||||
```
|
||||
|
||||
The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
|
||||
|
||||
Reference in New Issue
Block a user