Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:
- dumps_helper.py — schemas, simdjson parser, a generic parse_record
loop with per-field handler dispatch, and parse_dump / gen_task_list
/ sort_and_write workers. The only per-type code is the field-handler
dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
for the Spark sort. pyspark is imported lazily inside sort_and_write
so Part 1 callers don't pay the import cost.
Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.
Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.
Runners and README updated for the new CLIs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>