Replace the four per-type scripts (comments/submissions x part1/part2) with two merged scripts that share all of their plumbing — only the schema and JSON parser differ between types. Drop the per-source part rolling; one parquet per input zst, since Spark handles big parquet files via internal row groups. Add two thin runner scripts for the two common workflows: build_from_scratch.sh wipes the temp dirs and processes everything, add_new_month.sh takes YYYY-MM and parses just that month before re-running the Spark sort. Every step in the runners is a separate command so individual stages can be copied out and run standalone for debugging. Also fixes several lurking bugs in the original code: the hardcoded /gscratch/comdata/users/nathante/ output path in comments Part 2; the df2 = df.sortWithinPartitions typo in submissions Part 2 that threw away the preceding global sort; references to a missing parse_submissions.sh in the old .sh runners; and the asymmetry where comments_2_parquet_part1.py wasn't per-file/fire-driven the way submissions_2_parquet_part1.py was. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2.9 KiB
Reddit dumps → sorted parquet datasets
This directory holds the pipeline that turns compressed Reddit dump files
(RC_YYYY-MM.zst for comments, RS_YYYY-MM.zst for submissions) into the
sorted, repartitioned parquet datasets that the rest of the project
consumes.
The pipeline has two stages:
| Script | What it does |
|---|---|
parquet_part1.py |
Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
parquet_part2.py |
Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final reddit_*_by_*.parquet datasets. Always re-sorts the full corpus. |
Both scripts use a single fire CLI with comments and submissions
subcommands, so the comments and submissions paths share all of their
plumbing — only the schema and the JSON parser differ.
The two workflows
There are two ways to run the pipeline; pick the one that matches your situation.
Build from scratch — build_from_scratch.sh
Use this when there is no existing parquet output, or when the upstream
data has changed in a way that requires reparsing everything. Wipes the
per-source temp directories, processes every RC_* / RS_* dump in the
raw dumps directory through Part 1, then runs the Part 2 Spark sort.
Add a new month — add_new_month.sh YYYY-MM
Use this when one or more months of new dump files have arrived and you
just want to bring the existing datasets up to date. Processes only the
specified month's RC_<MONTH>.zst and RS_<MONTH>.zst files through
Part 1 (the existing per-source parquet files are left in place), then
re-runs the Part 2 Spark sort over the full temp directory so the final
datasets pick up the new data.
The Part 2 sort is global and not incremental, so each monthly add re-sorts the entire corpus. That's fine for a monthly cadence; it would need a rearchitecture if the cost became a problem.
Running steps individually
Both .sh runners are written so that every meaningful step is a separate,
self-contained command. If something fails partway through, or you want
to inspect intermediate state, you can copy any single line out of the
runner and execute it standalone. For example:
# parse one specific file (skipping the rest of the workflow)
python3 parquet_part1.py comments parse_dump RC_2025-03.zst
# override default dump/output paths from the CLI
python3 parquet_part1.py comments parse_dump RC_2025-03.zst \
--dumpdir=/tmp/test --outdir=/tmp/out
The Spark Part 2 step is launched via start_spark_and_run.sh (a
Hyak-provided wrapper not included in this repo); see the wiki for the
launch convention.
See also
The CDSC wiki page CommunityData:CDSC_Reddit documents the surrounding workflow — where the raw dump files come from (currently ArcticShift via academic torrents), how to stage them on Hyak, and how to run Spark jobs on the cluster.