datasets/: split parquet scripts; share logic in dumps_helper.py

Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:51:41 -07:00
parent 8965a251b6
commit 33150243cd
10 changed files with 386 additions and 328 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -7,14 +7,18 @@ consumes.

 The pipeline has two stages:

-| Script | What it does |
+| Stage | What it does |
 |---|---|
-| `parquet_part1.py` | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
-| `parquet_part2.py` | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
+| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
+| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |

-Both scripts use a single fire CLI with `comments` and `submissions`
-subcommands, so the comments and submissions paths share all of their
-plumbing — only the schema and the JSON parser differ.
+Each stage has a thin entry-point script per dump type:
+
+| Script | Notes |
+|---|---|
+| `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
+| `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. |
+| `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |

 ## The two workflows

@@ -50,11 +54,14 @@ runner and execute it standalone. For example:

 ```sh
 # parse one specific file (skipping the rest of the workflow)
-python3 parquet_part1.py comments parse_dump RC_2025-03.zst
+python3 comments_part1.py parse_dump RC_2025-03.zst

 # override default dump/output paths from the CLI
-python3 parquet_part1.py comments parse_dump RC_2025-03.zst \
+python3 comments_part1.py parse_dump RC_2025-03.zst \
    --dumpdir=/tmp/test --outdir=/tmp/out
+
+# regenerate just the task list
+python3 submissions_part1.py gen_task_list
 ```

 The Spark Part 2 step is launched via `start_spark_and_run.sh` (a