move dataset + similarity docs from wiki into repo READMEs

The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_*.sh / check_*_shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:20:21 -07:00
parent 33150243cd
commit 1851132a06
2 changed files with 425 additions and 19 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -5,20 +5,47 @@ This directory holds the pipeline that turns compressed Reddit dump files
 sorted, repartitioned parquet datasets that the rest of the project
 consumes.

-The pipeline has two stages:
+## Pipeline overview

-| Stage | What it does |
+The raw dumps are huge compressed json files with a lot of metadata that
+we may not need. They aren't indexed so it's expensive to pull data from
+just a handful of subreddits. It also turns out that it's a pain to read
+these compressed files straight into spark. Extracting useful variables
+from the dumps and building parquet datasets makes them easier to work
+with. This happens in two steps:
+
+1. Extracting json into (temporary, unpartitioned) parquet files using
+   pyarrow.
+2. Repartitioning and sorting the data using pyspark.
+
+Breaking this down into two steps is useful because it allows us to
+decompress and parse the dumps in the backfill queue and then sort them
+in spark. Partitioning the data makes it possible to efficiently read
+data for specific subreddits or authors. Sorting it means that you can
+efficiently compute aggregations at the subreddit or user level. More
+documentation on using these files is available on the [CDSC wiki][hyak-datasets].
+
+The final datasets are in `/gscratch/comdata/output`:
+
+- `reddit_comments_by_author.parquet` has comments partitioned and sorted
+  by username (lowercase).
+- `reddit_comments_by_subreddit.parquet` has comments partitioned and
+  sorted by subreddit name (lowercase).
+- `reddit_submissions_by_author.parquet` has submissions partitioned and
+  sorted by username (lowercase).
+- `reddit_submissions_by_subreddit.parquet` has submissions partitioned
+  and sorted by subreddit name (lowercase).
+
+[hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets
+
+## Scripts
+
+| Script | Role |
 |---|---|
-| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
-| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
-
-Each stage has a thin entry-point script per dump type:
-
-| Script | Notes |
-|---|---|
-| `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
-| `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. |
-| `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |
+| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
+| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads the directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Launched via `start_spark_and_run.sh`. |
+| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
+| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |

 ## The two workflows

@@ -47,10 +74,10 @@ need a rearchitecture if the cost became a problem.

 ## Running steps individually

-Both `.sh` runners are written so that every meaningful step is a separate,
-self-contained command. If something fails partway through, or you want
-to inspect intermediate state, you can copy any single line out of the
-runner and execute it standalone. For example:
+Both `.sh` runners are written so that every meaningful step is a
+separate, self-contained command. If something fails partway through, or
+you want to inspect intermediate state, you can copy any single line out
+of the runner and execute it standalone. For example:

 ```sh
 # parse one specific file (skipping the rest of the workflow)
@@ -68,10 +95,214 @@ The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
 Hyak-provided wrapper not included in this repo); see the wiki for the
 launch convention.

+## Detailed walkthrough: refreshing the data on Hyak
+
+This walkthrough describes the process we went through updating Reddit
+data from the PushShift cutoff up to the end of 2024. Adapting it for
+newer data should just involve using different academic torrent files
+that start from 2025 onwards. For a single-month update, the
+`add_new_month.sh` workflow above is much shorter; this walkthrough is
+for the bulk-refresh case.
+
+### Prerequisites
+
+- [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config
+  and `.bashrc`)
+- [Go through the Hyak Getting Started tutorial][hyak-syllabus]
+
+Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`):
+
+- [Watchful1's reddit explanation][watchful1-explainer] (separated by
+  subreddit), the [dataset not divided by subreddits][watchful1-bulk],
+  and the [GitHub repo with scripts for analyzing data][watchful1-repo]
+- [RaiderBDev monthly dumps][raiderbdev-monthly] and
+  [RaiderBDev's ArcticShift API][arctic-shift]
+- The [2005-06 to 2024-12 academic torrent][academic-torrent] used for
+  the 2005-2024 refresh
+
+CDSC and Hyak docs:
+
+- [Hyak docs — how to work with modules][hyak-modules]
+- [CDSC — how to download Python or R packages][cdsc-pkgs]
+- [CDSC — Hyak datasets information][hyak-datasets]
+- [CDSC — Hyak Spark information][hyak-spark]
+
+[hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak
+[hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/
+[watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/
+[watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/
+[watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master
+[raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/
+[arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift
+[academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
+[hyak-modules]: https://hyak.uw.edu/docs/tools/modules
+[cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages
+[hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark
+
+### Step 1: data download on Nada and Hyak
+
+We downloaded the [2005-2024 academic torrent][academic-torrent] and put
+it on Nada (~2 days of downloading). We copied the raw data over to
+Hyak's scrubbed directory in a new directory,
+`/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw
+data sorted into `/comments` or `/submissions`. The `/submissions`
+directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst`
+files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal
+with.)
+
+### Step 2: clone the repo on Hyak
+
+On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the
+working directory next to the raw data, e.g.
+`/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant
+code lives entirely in `datasets/`:
+
+- `dumps_helper.py` — shared parsing and Spark logic
+- `helper.py` — file-open helpers
+- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
+- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
+- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
+
+The Spark wrapper scripts (`start_spark_and_run.sh`,
+`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
+they are part of the CDSC Hyak environment and should already be on
+PATH.
+
+### Step 3: smoke-test Part 1 on a single file
+
+Check out `any_machine`. We'll test submissions Part 1 with just one
+file:
+
+```sh
+python3 submissions_part1.py parse_dump RS_2005-06.zst
+```
+
+To verify, go to your output directory and examine the start of the
+file:
+
+```sh
+python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())"
+```
+
+You should see columns like `id`, `author`, `subreddit`, and `title`
+printed out. Repeat the process with `comments_part1.py`; you should see
+columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out.
+
+**Note**: you may have to install relevant libraries before successfully
+running the file:
+
+```sh
+pip install --user pyarrow simdjson zstandard fire
+```
+
+### Step 4: Part 1 — converting `.zst` to `.parquet` files
+
+Now we'll convert all of our `.zst` compressed Reddit data to `.parquet`
+files. First, to generate our task list, we'll run
+
+```sh
+python3 submissions_part1.py gen_task_list
+```
+
+There should be a script, `parse_submissions_task_list`, in the working
+directory. Check the script (`less parse_submissions_task_list`); it
+should have many lines that look like our earlier test command,
+`python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of
+our `.zst` files. Do the same process with comments to generate
+`parse_comments_task_list`.
+
+From a login node, run `tmux` to keep our job running and then
+`any_machine` to check out a node to do computational work. We'll run
+our tasks (from the task list) in parallel to optimize. Start with
+submissions:
+
+```sh
+parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list
+```
+
+The `--joblog` flag creates a text file where you can see which tasks
+completed successfully, and the `--results` flag creates a directory
+where each task has its own stderr output to see the specific error
+(this is best practice for debugging).
+
+Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`).
+We'll ssh into our computational node (`ssh n1234` — you can get the
+node name by running `ourjobs`) and run `htop`
+([more details on htop][htop-explainer]). You should see that the
+machine's CPUs are getting close to 100% usage. If all looks good,
+create a new window and repeat the process for comments.
+
+[htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/
+
+Once the job has successfully completed, you'll see that your CPUs are
+closer to 0% usage in `htop` and your `submissions_joblog.txt` file
+should show an `exitval` of 0 for all commands. Kill your node by
+running `scancel 12345678` (the job ID can be found from `ourjobs`).
+
+### Step 5: verify the per-source parquet files
+
+We'll want to verify our `.parquet` files at this point. We compared the
+new files' number of columns and rows to the old data: from the
+`/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet`
+directory, run
+
+```sh
+diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet)
+```
+
+and confirm there are no differences (same process with submissions).
+This may or may not be relevant if we continue using the same academic
+torrent to update data and have nothing to compare to, but you can still
+check that the new data's number of columns and rows are fairly
+continuous with the most recent data we already have.
+
+### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
+
+If the `.parquet` files reasonably appear to be complete, we can now
+sort them by author and subreddit. The most efficient way to do so is by
+using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node
+splits into up to six slices (four in our current case) so the tasks
+will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog]
+are good resources for further information). Run `tmux` on a login
+node, then grab the whole node for up to a week with:
+
+```sh
+salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G
+```
+
+[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
+
+Once Slurm drops you onto the compute node, run
+
+```sh
+./start_spark_and_run.sh submissions_part2.py
+```
+
+Monitor via `htop` (as described in Step 4); the CPUs may not always
+show high usage but you should see that memory is being used. Repeat
+for the comments. Successful jobs will result in
+`/gscratch/comdata/output` having four new directories:
+`reddit_submissions_by_author.parquet`,
+`reddit_submissions_by_subreddit.parquet`,
+`reddit_comments_by_author.parquet`, and
+`reddit_comments_by_subreddit.parquet`. Each should contain many
+`snappy.parquet` files (e.g.
+`part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`)
+and `_SUCCESS`.
+
+### Step 7: data verification
+
+Verify and make sure the new data is reasonably complete before deleting
+any of the old data. Do a simple time series to see how many posts there
+are per day and make sure things don't fall off. It is also useful to
+have lab members test out anything they're working on again with the
+new parquet files.
+
 ## See also

 The CDSC wiki page
 [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
-documents the surrounding workflow — where the raw dump files come from
-(currently ArcticShift via academic torrents), how to stage them on
-Hyak, and how to run Spark jobs on the cluster.
+is the landing page for this project on the wiki and provides
+cross-links to related CDSC and Hyak documentation. The walkthrough
+above used to live there; it now lives here so that doc and code stay
+in sync.