18
0
Files
cdsc_reddit/datasets/README.md
Benjamin Mako Hill 8965a251b6 refactor datasets/ pipeline; add build/add-month workflows
Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.

Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.

Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:30:54 -07:00

2.9 KiB

Reddit dumps → sorted parquet datasets

This directory holds the pipeline that turns compressed Reddit dump files (RC_YYYY-MM.zst for comments, RS_YYYY-MM.zst for submissions) into the sorted, repartitioned parquet datasets that the rest of the project consumes.

The pipeline has two stages:

Script What it does
parquet_part1.py Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark.
parquet_part2.py Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final reddit_*_by_*.parquet datasets. Always re-sorts the full corpus.

Both scripts use a single fire CLI with comments and submissions subcommands, so the comments and submissions paths share all of their plumbing — only the schema and the JSON parser differ.

The two workflows

There are two ways to run the pipeline; pick the one that matches your situation.

Build from scratch — build_from_scratch.sh

Use this when there is no existing parquet output, or when the upstream data has changed in a way that requires reparsing everything. Wipes the per-source temp directories, processes every RC_* / RS_* dump in the raw dumps directory through Part 1, then runs the Part 2 Spark sort.

Add a new month — add_new_month.sh YYYY-MM

Use this when one or more months of new dump files have arrived and you just want to bring the existing datasets up to date. Processes only the specified month's RC_<MONTH>.zst and RS_<MONTH>.zst files through Part 1 (the existing per-source parquet files are left in place), then re-runs the Part 2 Spark sort over the full temp directory so the final datasets pick up the new data.

The Part 2 sort is global and not incremental, so each monthly add re-sorts the entire corpus. That's fine for a monthly cadence; it would need a rearchitecture if the cost became a problem.

Running steps individually

Both .sh runners are written so that every meaningful step is a separate, self-contained command. If something fails partway through, or you want to inspect intermediate state, you can copy any single line out of the runner and execute it standalone. For example:

# parse one specific file (skipping the rest of the workflow)
python3 parquet_part1.py comments parse_dump RC_2025-03.zst

# override default dump/output paths from the CLI
python3 parquet_part1.py comments parse_dump RC_2025-03.zst \
    --dumpdir=/tmp/test --outdir=/tmp/out

The Spark Part 2 step is launched via start_spark_and_run.sh (a Hyak-provided wrapper not included in this repo); see the wiki for the launch convention.

See also

The CDSC wiki page CommunityData:CDSC_Reddit documents the surrounding workflow — where the raw dump files come from (currently ArcticShift via academic torrents), how to stage them on Hyak, and how to run Spark jobs on the cluster.