# Reddit dumps → sorted parquet datasets This directory holds the pipeline that turns compressed Reddit dump files (`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the sorted, repartitioned parquet datasets that the rest of the project consumes. The pipeline has two stages: | Script | What it does | |---|---| | `parquet_part1.py` | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. | | `parquet_part2.py` | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. | Both scripts use a single fire CLI with `comments` and `submissions` subcommands, so the comments and submissions paths share all of their plumbing — only the schema and the JSON parser differ. ## The two workflows There are two ways to run the pipeline; pick the one that matches your situation. ### Build from scratch — `build_from_scratch.sh` Use this when there is no existing parquet output, or when the upstream data has changed in a way that requires reparsing everything. Wipes the per-source temp directories, processes every `RC_*` / `RS_*` dump in the raw dumps directory through Part 1, then runs the Part 2 Spark sort. ### Add a new month — `add_new_month.sh YYYY-MM` Use this when one or more months of new dump files have arrived and you just want to bring the existing datasets up to date. Processes only the specified month's `RC_.zst` and `RS_.zst` files through Part 1 (the existing per-source parquet files are left in place), then re-runs the Part 2 Spark sort over the full temp directory so the final datasets pick up the new data. The Part 2 sort is global and not incremental, so each monthly add re-sorts the entire corpus. That's fine for a monthly cadence; it would need a rearchitecture if the cost became a problem. ## Running steps individually Both `.sh` runners are written so that every meaningful step is a separate, self-contained command. If something fails partway through, or you want to inspect intermediate state, you can copy any single line out of the runner and execute it standalone. For example: ```sh # parse one specific file (skipping the rest of the workflow) python3 parquet_part1.py comments parse_dump RC_2025-03.zst # override default dump/output paths from the CLI python3 parquet_part1.py comments parse_dump RC_2025-03.zst \ --dumpdir=/tmp/test --outdir=/tmp/out ``` The Spark Part 2 step is launched via `start_spark_and_run.sh` (a Hyak-provided wrapper not included in this repo); see the wiki for the launch convention. ## See also The CDSC wiki page [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit) documents the surrounding workflow — where the raw dump files come from (currently ArcticShift via academic torrents), how to stage them on Hyak, and how to run Spark jobs on the cluster.