Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
49 lines
1.6 KiB
Bash
Executable File
49 lines
1.6 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
#
|
|
# Add a single new month of dumps to the existing parquet datasets.
|
|
#
|
|
# Processes only the RC_<month>.zst and RS_<month>.zst files (Part 1),
|
|
# leaving the existing per-source temp parquet files untouched, then
|
|
# re-runs the Part 2 Spark sort + repartition over the full temp dir so
|
|
# the final by_subreddit / by_author datasets pick up the new data.
|
|
#
|
|
# Usage:
|
|
# add_new_month.sh YYYY-MM
|
|
#
|
|
# Example:
|
|
# add_new_month.sh 2025-03
|
|
#
|
|
# Every command below is independently runnable — to debug, copy a line
|
|
# out and run it directly. For a full rebuild instead, see
|
|
# build_from_scratch.sh.
|
|
#
|
|
# Note on cost: Part 2 always re-sorts the full corpus (the sort is global,
|
|
# not incremental), so this gets slightly slower each month. For the
|
|
# monthly cadence this is fine; if the sort becomes a bottleneck we'd
|
|
# need to rearchitect Part 2 to merge-append instead of re-sort.
|
|
|
|
set -e
|
|
cd "$(dirname "$0")"
|
|
|
|
MONTH="${1:-}"
|
|
if [ -z "$MONTH" ]; then
|
|
echo "Usage: $0 YYYY-MM" >&2
|
|
exit 1
|
|
fi
|
|
|
|
# --- Part 1: parse the new month's dumps (no wipe) -------------------------
|
|
|
|
# parse the new comments file
|
|
python3 comments_part1.py parse_dump "RC_${MONTH}.zst"
|
|
|
|
# parse the new submissions file
|
|
python3 submissions_part1.py parse_dump "RS_${MONTH}.zst"
|
|
|
|
# --- Part 2: re-sort the full corpus including the new data ---------------
|
|
|
|
# sort comments and overwrite reddit_comments_by_{subreddit,author}.parquet
|
|
start_spark_and_run.sh 1 comments_part2.py
|
|
|
|
# sort submissions and overwrite reddit_submissions_by_{subreddit,author}.parquet
|
|
start_spark_and_run.sh 1 submissions_part2.py
|