Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
57 lines
2.0 KiB
Bash
Executable File
57 lines
2.0 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
#
|
|
# Build the sorted, partitioned Reddit parquet datasets from scratch.
|
|
#
|
|
# Wipes the per-source temp directories, processes every RC_* and RS_* dump
|
|
# in the raw_data dumps directory through Part 1 (per-file, parallel), then
|
|
# runs the Part 2 Spark sort + repartition for both comments and submissions.
|
|
#
|
|
# Every command below is independently runnable — to debug a single stage,
|
|
# copy the line out and run it directly. Run the whole script end-to-end
|
|
# only when you trust each step.
|
|
#
|
|
# Prerequisites:
|
|
# - raw .zst dumps already staged in the dumpdir locations (see the
|
|
# defaults in dumps_helper.py, or override via --dumpdir)
|
|
# - GNU parallel installed
|
|
# - start_spark_and_run.sh on PATH (Hyak-provided wrapper)
|
|
#
|
|
# To add one new month to an existing build instead of rebuilding from
|
|
# scratch, use add_new_month.sh.
|
|
|
|
set -e
|
|
cd "$(dirname "$0")"
|
|
|
|
TEMP_COMMENTS="/gscratch/comdata/output/temp/reddit_comments.parquet"
|
|
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/reddit_submissions.parquet"
|
|
|
|
# --- Part 1a: comments ------------------------------------------------------
|
|
|
|
# wipe any existing comments temp output
|
|
rm -rf "$TEMP_COMMENTS"
|
|
|
|
# generate the per-file parse task list
|
|
python3 comments_part1.py gen_task_list
|
|
|
|
# run all comments parse tasks in parallel
|
|
parallel --joblog comments_joblog.txt --results comments_logs < parse_comments_task_list
|
|
|
|
# --- Part 1b: submissions ---------------------------------------------------
|
|
|
|
# wipe any existing submissions temp output
|
|
rm -rf "$TEMP_SUBMISSIONS"
|
|
|
|
# generate the per-file parse task list
|
|
python3 submissions_part1.py gen_task_list
|
|
|
|
# run all submissions parse tasks in parallel
|
|
parallel --joblog submissions_joblog.txt --results submissions_logs < parse_submissions_task_list
|
|
|
|
# --- Part 2: spark sort + repartition --------------------------------------
|
|
|
|
# sort comments and write reddit_comments_by_{subreddit,author}.parquet
|
|
start_spark_and_run.sh 1 comments_part2.py
|
|
|
|
# sort submissions and write reddit_submissions_by_{subreddit,author}.parquet
|
|
start_spark_and_run.sh 1 submissions_part2.py
|