18
0
Files
cdsc_reddit/datasets/README.md
Benjamin Mako Hill 33150243cd datasets/: split parquet scripts; share logic in dumps_helper.py
Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:

- dumps_helper.py — schemas, simdjson parser, a generic parse_record
  loop with per-field handler dispatch, and parse_dump / gen_task_list
  / sort_and_write workers. The only per-type code is the field-handler
  dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
  with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
  for the Spark sort. pyspark is imported lazily inside sort_and_write
  so Part 1 callers don't pay the import cost.

Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.

Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.

Runners and README updated for the new CLIs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:51:41 -07:00

3.3 KiB

Reddit dumps → sorted parquet datasets

This directory holds the pipeline that turns compressed Reddit dump files (RC_YYYY-MM.zst for comments, RS_YYYY-MM.zst for submissions) into the sorted, repartitioned parquet datasets that the rest of the project consumes.

The pipeline has two stages:

Stage What it does
Part 1 Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark.
Part 2 Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final reddit_*_by_*.parquet datasets. Always re-sorts the full corpus.

Each stage has a thin entry-point script per dump type:

Script Notes
comments_part1.py, submissions_part1.py Per-file parse. parse_dump <file> and gen_task_list subcommands via fire.
comments_part2.py, submissions_part2.py Spark sort. Launched via start_spark_and_run.sh.
dumps_helper.py Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top.

The two workflows

There are two ways to run the pipeline; pick the one that matches your situation.

Build from scratch — build_from_scratch.sh

Use this when there is no existing parquet output, or when the upstream data has changed in a way that requires reparsing everything. Wipes the per-source temp directories, processes every RC_* / RS_* dump in the raw dumps directory through Part 1, then runs the Part 2 Spark sort.

Add a new month — add_new_month.sh YYYY-MM

Use this when one or more months of new dump files have arrived and you just want to bring the existing datasets up to date. Processes only the specified month's RC_<MONTH>.zst and RS_<MONTH>.zst files through Part 1 (the existing per-source parquet files are left in place), then re-runs the Part 2 Spark sort over the full temp directory so the final datasets pick up the new data.

The Part 2 sort is global and not incremental, so each monthly add re-sorts the entire corpus. That's fine for a monthly cadence; it would need a rearchitecture if the cost became a problem.

Running steps individually

Both .sh runners are written so that every meaningful step is a separate, self-contained command. If something fails partway through, or you want to inspect intermediate state, you can copy any single line out of the runner and execute it standalone. For example:

# parse one specific file (skipping the rest of the workflow)
python3 comments_part1.py parse_dump RC_2025-03.zst

# override default dump/output paths from the CLI
python3 comments_part1.py parse_dump RC_2025-03.zst \
    --dumpdir=/tmp/test --outdir=/tmp/out

# regenerate just the task list
python3 submissions_part1.py gen_task_list

The Spark Part 2 step is launched via start_spark_and_run.sh (a Hyak-provided wrapper not included in this repo); see the wiki for the launch convention.

See also

The CDSC wiki page CommunityData:CDSC_Reddit documents the surrounding workflow — where the raw dump files come from (currently ArcticShift via academic torrents), how to stage them on Hyak, and how to run Spark jobs on the cluster.