18
0

datasets/: split parquet scripts; share logic in dumps_helper.py

Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:

- dumps_helper.py — schemas, simdjson parser, a generic parse_record
  loop with per-field handler dispatch, and parse_dump / gen_task_list
  / sort_and_write workers. The only per-type code is the field-handler
  dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
  with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
  for the Spark sort. pyspark is imported lazily inside sort_and_write
  so Part 1 callers don't pay the import cost.

Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.

Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.

Runners and README updated for the new CLIs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 16:51:41 -07:00
parent 8965a251b6
commit 33150243cd
10 changed files with 386 additions and 328 deletions

View File

@@ -12,7 +12,7 @@
#
# Prerequisites:
# - raw .zst dumps already staged in the dumpdir locations (see the
# parquet_part1.py defaults, or override via --dumpdir)
# defaults in dumps_helper.py, or override via --dumpdir)
# - GNU parallel installed
# - start_spark_and_run.sh on PATH (Hyak-provided wrapper)
#
@@ -31,7 +31,7 @@ TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/reddit_submissions.parquet"
rm -rf "$TEMP_COMMENTS"
# generate the per-file parse task list
python3 parquet_part1.py comments gen_task_list
python3 comments_part1.py gen_task_list
# run all comments parse tasks in parallel
parallel --joblog comments_joblog.txt --results comments_logs < parse_comments_task_list
@@ -42,7 +42,7 @@ parallel --joblog comments_joblog.txt --results comments_logs < parse_comments_t
rm -rf "$TEMP_SUBMISSIONS"
# generate the per-file parse task list
python3 parquet_part1.py submissions gen_task_list
python3 submissions_part1.py gen_task_list
# run all submissions parse tasks in parallel
parallel --joblog submissions_joblog.txt --results submissions_logs < parse_submissions_task_list
@@ -50,7 +50,7 @@ parallel --joblog submissions_joblog.txt --results submissions_logs < parse_subm
# --- Part 2: spark sort + repartition --------------------------------------
# sort comments and write reddit_comments_by_{subreddit,author}.parquet
start_spark_and_run.sh 1 parquet_part2.py comments
start_spark_and_run.sh 1 comments_part2.py
# sort submissions and write reddit_submissions_by_{subreddit,author}.parquet
start_spark_and_run.sh 1 parquet_part2.py submissions
start_spark_and_run.sh 1 submissions_part2.py