# Reddit dumps → sorted parquet datasets This directory holds the pipeline that turns compressed Reddit dump files (`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the sorted, repartitioned parquet datasets that the rest of the project consumes. ## Pipeline overview The raw dumps are huge compressed json files with a lot of metadata that we may not need. They aren't indexed so it's expensive to pull data from just a handful of subreddits. It also turns out that it's a pain to read these compressed files straight into spark. Extracting useful variables from the dumps and building parquet datasets makes them easier to work with. This happens in two steps: 1. Extracting json into (temporary, unpartitioned) parquet files using pyarrow. 2. Repartitioning and sorting the data using pyspark. Breaking this down into two steps is useful because it allows us to decompress and parse the dumps in the backfill queue and then sort them in spark. Partitioning the data makes it possible to efficiently read data for specific subreddits or authors. Sorting it means that you can efficiently compute aggregations at the subreddit or user level. More documentation on using these files is available on the [CDSC wiki][hyak-datasets]. The final datasets are in `/gscratch/comdata/output`: - `reddit_comments_by_author.parquet` has comments partitioned and sorted by username (lowercase). - `reddit_comments_by_subreddit.parquet` has comments partitioned and sorted by subreddit name (lowercase). - `reddit_submissions_by_author.parquet` has submissions partitioned and sorted by username (lowercase). - `reddit_submissions_by_subreddit.parquet` has submissions partitioned and sorted by subreddit name (lowercase). [hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets ## Scripts | Script | Role | |---|---| | `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump ` and `gen_task_list` subcommands via fire. | | `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads the directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Launched via `start_spark_and_run.sh`. | | `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. | | `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). | ## The two workflows There are two ways to run the pipeline; pick the one that matches your situation. ### Build from scratch — `build_from_scratch.sh` Use this when there is no existing parquet output, or when the upstream data has changed in a way that requires reparsing everything. Wipes the per-source temp directories, processes every `RC_*` / `RS_*` dump in the raw dumps directory through Part 1, then runs the Part 2 Spark sort. ### Add a new month — `add_new_month.sh YYYY-MM` Use this when one or more months of new dump files have arrived and you just want to bring the existing datasets up to date. Processes only the specified month's `RC_.zst` and `RS_.zst` files through Part 1 (the existing per-source parquet files are left in place), then re-runs the Part 2 Spark sort over the full temp directory so the final datasets pick up the new data. The Part 2 sort is global and not incremental, so each monthly add re-sorts the entire corpus. That's fine for a monthly cadence; it would need a rearchitecture if the cost became a problem. ## Running steps individually Both `.sh` runners are written so that every meaningful step is a separate, self-contained command. If something fails partway through, or you want to inspect intermediate state, you can copy any single line out of the runner and execute it standalone. For example: ```sh # parse one specific file (skipping the rest of the workflow) python3 comments_part1.py parse_dump RC_2025-03.zst # override default dump/output paths from the CLI python3 comments_part1.py parse_dump RC_2025-03.zst \ --dumpdir=/tmp/test --outdir=/tmp/out # regenerate just the task list python3 submissions_part1.py gen_task_list ``` The Spark Part 2 step is launched via `start_spark_and_run.sh` (a Hyak-provided wrapper not included in this repo); see the wiki for the launch convention. ## Detailed walkthrough: refreshing the data on Hyak This walkthrough describes the process we went through updating Reddit data from the PushShift cutoff up to the end of 2024. Adapting it for newer data should just involve using different academic torrent files that start from 2025 onwards. For a single-month update, the `add_new_month.sh` workflow above is much shorter; this walkthrough is for the bulk-refresh case. ### Prerequisites - [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config and `.bashrc`) - [Go through the Hyak Getting Started tutorial][hyak-syllabus] Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`): - [Watchful1's reddit explanation][watchful1-explainer] (separated by subreddit), the [dataset not divided by subreddits][watchful1-bulk], and the [GitHub repo with scripts for analyzing data][watchful1-repo] - [RaiderBDev monthly dumps][raiderbdev-monthly] and [RaiderBDev's ArcticShift API][arctic-shift] - The [2005-06 to 2024-12 academic torrent][academic-torrent] used for the 2005-2024 refresh CDSC and Hyak docs: - [Hyak docs — how to work with modules][hyak-modules] - [CDSC — how to download Python or R packages][cdsc-pkgs] - [CDSC — Hyak datasets information][hyak-datasets] - [CDSC — Hyak Spark information][hyak-spark] [hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak [hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/ [watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/ [watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/ [watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master [raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/ [arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift [academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4 [hyak-modules]: https://hyak.uw.edu/docs/tools/modules [cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages [hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark ### Step 1: data download on Nada and Hyak We downloaded the [2005-2024 academic torrent][academic-torrent] and put it on Nada (~2 days of downloading). We copied the raw data over to Hyak's scrubbed directory in a new directory, `/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw data sorted into `/comments` or `/submissions`. The `/submissions` directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst` files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal with.) ### Step 2: clone the repo on Hyak On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the working directory next to the raw data, e.g. `/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant code lives entirely in `datasets/`: - `dumps_helper.py` — shared parsing and Spark logic - `helper.py` — file-open helpers - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points - `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts The Spark wrapper scripts (`start_spark_and_run.sh`, `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo; they are part of the CDSC Hyak environment and should already be on PATH. ### Step 3: smoke-test Part 1 on a single file Check out `any_machine`. We'll test submissions Part 1 with just one file: ```sh python3 submissions_part1.py parse_dump RS_2005-06.zst ``` To verify, go to your output directory and examine the start of the file: ```sh python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())" ``` You should see columns like `id`, `author`, `subreddit`, and `title` printed out. Repeat the process with `comments_part1.py`; you should see columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out. **Note**: you may have to install relevant libraries before successfully running the file: ```sh pip install --user pyarrow simdjson zstandard fire ``` ### Step 4: Part 1 — converting `.zst` to `.parquet` files Now we'll convert all of our `.zst` compressed Reddit data to `.parquet` files. First, to generate our task list, we'll run ```sh python3 submissions_part1.py gen_task_list ``` There should be a script, `parse_submissions_task_list`, in the working directory. Check the script (`less parse_submissions_task_list`); it should have many lines that look like our earlier test command, `python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of our `.zst` files. Do the same process with comments to generate `parse_comments_task_list`. From a login node, run `tmux` to keep our job running and then `any_machine` to check out a node to do computational work. We'll run our tasks (from the task list) in parallel to optimize. Start with submissions: ```sh parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list ``` The `--joblog` flag creates a text file where you can see which tasks completed successfully, and the `--results` flag creates a directory where each task has its own stderr output to see the specific error (this is best practice for debugging). Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`). We'll ssh into our computational node (`ssh n1234` — you can get the node name by running `ourjobs`) and run `htop` ([more details on htop][htop-explainer]). You should see that the machine's CPUs are getting close to 100% usage. If all looks good, create a new window and repeat the process for comments. [htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/ Once the job has successfully completed, you'll see that your CPUs are closer to 0% usage in `htop` and your `submissions_joblog.txt` file should show an `exitval` of 0 for all commands. Kill your node by running `scancel 12345678` (the job ID can be found from `ourjobs`). ### Step 5: verify the per-source parquet files We'll want to verify our `.parquet` files at this point. We compared the new files' number of columns and rows to the old data: from the `/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet` directory, run ```sh diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet) ``` and confirm there are no differences (same process with submissions). This may or may not be relevant if we continue using the same academic torrent to update data and have nothing to compare to, but you can still check that the new data's number of columns and rows are fairly continuous with the most recent data we already have. ### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark If the `.parquet` files reasonably appear to be complete, we can now sort them by author and subreddit. The most efficient way to do so is by using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node splits into up to six slices (four in our current case) so the tasks will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog] are good resources for further information). Run `tmux` on a login node, then grab the whole node for up to a week with: ```sh salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G ``` [hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/ Once Slurm drops you onto the compute node, run ```sh ./start_spark_and_run.sh submissions_part2.py ``` Monitor via `htop` (as described in Step 4); the CPUs may not always show high usage but you should see that memory is being used. Repeat for the comments. Successful jobs will result in `/gscratch/comdata/output` having four new directories: `reddit_submissions_by_author.parquet`, `reddit_submissions_by_subreddit.parquet`, `reddit_comments_by_author.parquet`, and `reddit_comments_by_subreddit.parquet`. Each should contain many `snappy.parquet` files (e.g. `part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`) and `_SUCCESS`. ### Step 7: data verification Verify and make sure the new data is reasonably complete before deleting any of the old data. Do a simple time series to see how many posts there are per day and make sure things don't fall off. It is also useful to have lab members test out anything they're working on again with the new parquet files. ## See also The CDSC wiki page [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit) is the landing page for this project on the wiki and provides cross-links to related CDSC and Hyak documentation. The walkthrough above used to live there; it now lives here so that doc and code stay in sync.