Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.3 KiB
Reddit dumps → sorted parquet datasets
This directory holds the pipeline that turns compressed Reddit dump files
(RC_YYYY-MM.zst for comments, RS_YYYY-MM.zst for submissions) into the
sorted, repartitioned parquet datasets that the rest of the project
consumes.
The pipeline has two stages:
| Stage | What it does |
|---|---|
| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final reddit_*_by_*.parquet datasets. Always re-sorts the full corpus. |
Each stage has a thin entry-point script per dump type:
| Script | Notes |
|---|---|
comments_part1.py, submissions_part1.py |
Per-file parse. parse_dump <file> and gen_task_list subcommands via fire. |
comments_part2.py, submissions_part2.py |
Spark sort. Launched via start_spark_and_run.sh. |
dumps_helper.py |
Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |
The two workflows
There are two ways to run the pipeline; pick the one that matches your situation.
Build from scratch — build_from_scratch.sh
Use this when there is no existing parquet output, or when the upstream
data has changed in a way that requires reparsing everything. Wipes the
per-source temp directories, processes every RC_* / RS_* dump in the
raw dumps directory through Part 1, then runs the Part 2 Spark sort.
Add a new month — add_new_month.sh YYYY-MM
Use this when one or more months of new dump files have arrived and you
just want to bring the existing datasets up to date. Processes only the
specified month's RC_<MONTH>.zst and RS_<MONTH>.zst files through
Part 1 (the existing per-source parquet files are left in place), then
re-runs the Part 2 Spark sort over the full temp directory so the final
datasets pick up the new data.
The Part 2 sort is global and not incremental, so each monthly add re-sorts the entire corpus. That's fine for a monthly cadence; it would need a rearchitecture if the cost became a problem.
Running steps individually
Both .sh runners are written so that every meaningful step is a separate,
self-contained command. If something fails partway through, or you want
to inspect intermediate state, you can copy any single line out of the
runner and execute it standalone. For example:
# parse one specific file (skipping the rest of the workflow)
python3 comments_part1.py parse_dump RC_2025-03.zst
# override default dump/output paths from the CLI
python3 comments_part1.py parse_dump RC_2025-03.zst \
--dumpdir=/tmp/test --outdir=/tmp/out
# regenerate just the task list
python3 submissions_part1.py gen_task_list
The Spark Part 2 step is launched via start_spark_and_run.sh (a
Hyak-provided wrapper not included in this repo); see the wiki for the
launch convention.
See also
The CDSC wiki page CommunityData:CDSC_Reddit documents the surrounding workflow — where the raw dump files come from (currently ArcticShift via academic torrents), how to stage them on Hyak, and how to run Spark jobs on the cluster.