move dataset + similarity docs from wiki into repo READMEs

The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_*.sh / check_*_shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:20:21 -07:00
parent 33150243cd
commit 1851132a06
2 changed files with 425 additions and 19 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -5,20 +5,47 @@ This directory holds the pipeline that turns compressed Reddit dump files
 sorted, repartitioned parquet datasets that the rest of the project
 consumes.
-The pipeline has two stages:
+## Pipeline overview
-| Stage | What it does |
+The raw dumps are huge compressed json files with a lot of metadata that
 we may not need. They aren't indexed so it's expensive to pull data from
 just a handful of subreddits. It also turns out that it's a pain to read
 these compressed files straight into spark. Extracting useful variables
 from the dumps and building parquet datasets makes them easier to work
 with. This happens in two steps:
 1. Extracting json into (temporary, unpartitioned) parquet files using
   pyarrow.
 2. Repartitioning and sorting the data using pyspark.
 Breaking this down into two steps is useful because it allows us to
 decompress and parse the dumps in the backfill queue and then sort them
 in spark. Partitioning the data makes it possible to efficiently read
 data for specific subreddits or authors. Sorting it means that you can
 efficiently compute aggregations at the subreddit or user level. More
 documentation on using these files is available on the [CDSC wiki][hyak-datasets].
 The final datasets are in `/gscratch/comdata/output`:
 - `reddit_comments_by_author.parquet` has comments partitioned and sorted
  by username (lowercase).
 - `reddit_comments_by_subreddit.parquet` has comments partitioned and
  sorted by subreddit name (lowercase).
 - `reddit_submissions_by_author.parquet` has submissions partitioned and
  sorted by username (lowercase).
 - `reddit_submissions_by_subreddit.parquet` has submissions partitioned
  and sorted by subreddit name (lowercase).
 [hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets
 ## Scripts
 | Script | Role |
 |---|---|
-| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
+| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
-| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
+| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads the directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Launched via `start_spark_and_run.sh`. |
-
+| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
-Each stage has a thin entry-point script per dump type:
+| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |
 | Script | Notes |
 |---|---|
 | `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
 | `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. |
 | `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |
 ## The two workflows
@@ -47,10 +74,10 @@ need a rearchitecture if the cost became a problem.
 ## Running steps individually
-Both `.sh` runners are written so that every meaningful step is a separate,
+Both `.sh` runners are written so that every meaningful step is a
-self-contained command. If something fails partway through, or you want
+separate, self-contained command. If something fails partway through, or
-to inspect intermediate state, you can copy any single line out of the
+you want to inspect intermediate state, you can copy any single line out
-runner and execute it standalone. For example:
+of the runner and execute it standalone. For example:
 ```sh
 # parse one specific file (skipping the rest of the workflow)
@@ -68,10 +95,214 @@ The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
 Hyak-provided wrapper not included in this repo); see the wiki for the
 launch convention.
 ## Detailed walkthrough: refreshing the data on Hyak
 This walkthrough describes the process we went through updating Reddit
 data from the PushShift cutoff up to the end of 2024. Adapting it for
 newer data should just involve using different academic torrent files
 that start from 2025 onwards. For a single-month update, the
 `add_new_month.sh` workflow above is much shorter; this walkthrough is
 for the bulk-refresh case.
 ### Prerequisites
 - [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config
  and `.bashrc`)
 - [Go through the Hyak Getting Started tutorial][hyak-syllabus]
 Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`):
 - [Watchful1's reddit explanation][watchful1-explainer] (separated by
  subreddit), the [dataset not divided by subreddits][watchful1-bulk],
  and the [GitHub repo with scripts for analyzing data][watchful1-repo]
 - [RaiderBDev monthly dumps][raiderbdev-monthly] and
  [RaiderBDev's ArcticShift API][arctic-shift]
 - The [2005-06 to 2024-12 academic torrent][academic-torrent] used for
  the 2005-2024 refresh
 CDSC and Hyak docs:
 - [Hyak docs — how to work with modules][hyak-modules]
 - [CDSC — how to download Python or R packages][cdsc-pkgs]
 - [CDSC — Hyak datasets information][hyak-datasets]
 - [CDSC — Hyak Spark information][hyak-spark]
 [hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak
 [hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/
 [watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/
 [watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/
 [watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master
 [raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/
 [arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift
 [academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
 [hyak-modules]: https://hyak.uw.edu/docs/tools/modules
 [cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages
 [hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark
 ### Step 1: data download on Nada and Hyak
 We downloaded the [2005-2024 academic torrent][academic-torrent] and put
 it on Nada (~2 days of downloading). We copied the raw data over to
 Hyak's scrubbed directory in a new directory,
 `/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw
 data sorted into `/comments` or `/submissions`. The `/submissions`
 directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst`
 files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal
 with.)
 ### Step 2: clone the repo on Hyak
 On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the
 working directory next to the raw data, e.g.
 `/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant
 code lives entirely in `datasets/`:
 - `dumps_helper.py` — shared parsing and Spark logic
 - `helper.py` — file-open helpers
 - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
 - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
 - `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
 The Spark wrapper scripts (`start_spark_and_run.sh`,
 `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
 they are part of the CDSC Hyak environment and should already be on
 PATH.
 ### Step 3: smoke-test Part 1 on a single file
 Check out `any_machine`. We'll test submissions Part 1 with just one
 file:
 ```sh
 python3 submissions_part1.py parse_dump RS_2005-06.zst
 ```
 To verify, go to your output directory and examine the start of the
 file:
 ```sh
 python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())"
 ```
 You should see columns like `id`, `author`, `subreddit`, and `title`
 printed out. Repeat the process with `comments_part1.py`; you should see
 columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out.
 **Note**: you may have to install relevant libraries before successfully
 running the file:
 ```sh
 pip install --user pyarrow simdjson zstandard fire
 ```
 ### Step 4: Part 1 — converting `.zst` to `.parquet` files
 Now we'll convert all of our `.zst` compressed Reddit data to `.parquet`
 files. First, to generate our task list, we'll run
 ```sh
 python3 submissions_part1.py gen_task_list
 ```
 There should be a script, `parse_submissions_task_list`, in the working
 directory. Check the script (`less parse_submissions_task_list`); it
 should have many lines that look like our earlier test command,
 `python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of
 our `.zst` files. Do the same process with comments to generate
 `parse_comments_task_list`.
 From a login node, run `tmux` to keep our job running and then
 `any_machine` to check out a node to do computational work. We'll run
 our tasks (from the task list) in parallel to optimize. Start with
 submissions:
 ```sh
 parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list
 ```
 The `--joblog` flag creates a text file where you can see which tasks
 completed successfully, and the `--results` flag creates a directory
 where each task has its own stderr output to see the specific error
 (this is best practice for debugging).
 Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`).
 We'll ssh into our computational node (`ssh n1234` — you can get the
 node name by running `ourjobs`) and run `htop`
 ([more details on htop][htop-explainer]). You should see that the
 machine's CPUs are getting close to 100% usage. If all looks good,
 create a new window and repeat the process for comments.
 [htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/
 Once the job has successfully completed, you'll see that your CPUs are
 closer to 0% usage in `htop` and your `submissions_joblog.txt` file
 should show an `exitval` of 0 for all commands. Kill your node by
 running `scancel 12345678` (the job ID can be found from `ourjobs`).
 ### Step 5: verify the per-source parquet files
 We'll want to verify our `.parquet` files at this point. We compared the
 new files' number of columns and rows to the old data: from the
 `/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet`
 directory, run
 ```sh
 diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet)
 ```
 and confirm there are no differences (same process with submissions).
 This may or may not be relevant if we continue using the same academic
 torrent to update data and have nothing to compare to, but you can still
 check that the new data's number of columns and rows are fairly
 continuous with the most recent data we already have.
 ### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
 If the `.parquet` files reasonably appear to be complete, we can now
 sort them by author and subreddit. The most efficient way to do so is by
 using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node
 splits into up to six slices (four in our current case) so the tasks
 will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog]
 are good resources for further information). Run `tmux` on a login
 node, then grab the whole node for up to a week with:
 ```sh
 salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G
 ```
 [hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
 Once Slurm drops you onto the compute node, run
 ```sh
 ./start_spark_and_run.sh submissions_part2.py
 ```
 Monitor via `htop` (as described in Step 4); the CPUs may not always
 show high usage but you should see that memory is being used. Repeat
 for the comments. Successful jobs will result in
 `/gscratch/comdata/output` having four new directories:
 `reddit_submissions_by_author.parquet`,
 `reddit_submissions_by_subreddit.parquet`,
 `reddit_comments_by_author.parquet`, and
 `reddit_comments_by_subreddit.parquet`. Each should contain many
 `snappy.parquet` files (e.g.
 `part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`)
 and `_SUCCESS`.
 ### Step 7: data verification
 Verify and make sure the new data is reasonably complete before deleting
 any of the old data. Do a simple time series to see how many posts there
 are per day and make sure things don't fall off. It is also useful to
 have lab members test out anything they're working on again with the
 new parquet files.
 ## See also
 The CDSC wiki page
 [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
-documents the surrounding workflow — where the raw dump files come from
+is the landing page for this project on the wiki and provides
-(currently ArcticShift via academic torrents), how to stage them on
+cross-links to related CDSC and Hyak documentation. The walkthrough
-Hyak, and how to run Spark jobs on the cluster.
+above used to live there; it now lives here so that doc and code stay
 in sync.
--- a/similarities/README.md
+++ b/similarities/README.md
@@ -0,0 +1,175 @@
 # Subreddit similarity
 This directory holds the code that computes pairwise similarities between
 subreddits — both term-based (from TF-IDF over comment text) and
 author-based (from overlapping commenter sets). Similarity matrices
 produced here feed downstream clustering (`../clustering/`) and density
 analysis (`../density/`).
 ## Datasets
 Subreddit similarity datasets based on comment terms and comment authors
 are available on Hyak in `/gscratch/comdata/output/reddit_similarity`.
 The overall approach to subreddit similarity seems to work reasonably
 well and the code is stabilizing. If you want help using these
 similarities in a project, just reach out to
 [Nate](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29).
 By default, the scripts here take a `TopN` parameter which selects the
 subreddits to include in the similarity dataset according to how many
 total comments they have. You can alternatively pass a value to the
 `included_subreddits` parameter for a file with the names of the
 subreddits you would like to include on each line.
 ## Scripts
 | Script | What it does |
 |---|---|
 | `tfidf.py` | Builds TF-IDF vectors for subreddits. Fire CLI subcommands for `authors`, `terms`, `authors_weekly`, `terms_weekly`. |
 | `cosine_similarities.py` | Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands `author`, `term`, `author-tf`. |
 | `weekly_cosine_similarities.py` | Same idea but operating on the weekly TF-IDF vectors. |
 | `wang_similarity.py` | A variant similarity computation based on user overlaps in the style of Wang et al. |
 | `top_subreddits_by_comments.py` | Produces the `subreddits_by_num_comments.csv` ranking used to pick the top-N subreddits for the similarity matrices. |
 | `similarities_helper.py` | Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. |
 | `Makefile` | Wires everything together with the canonical Hyak output paths. |
 ## Methods
 [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common and
 simple information-retrieval technique that we can use to quantify the
 topic of a subreddit. The goal of TF-IDF is to build a vector for each
 subreddit that scores every term (or phrase) according to how
 characteristic it is of the overall lexicon used in that subreddit. For
 example, the most characteristic terms in the subreddit `/r/christianity`
 in the current version of the TF-IDF model are:
 | Term         | tf_idf |
 |:------------:|:------:|
 | christians   | 0.581  |
 | christianity | 0.569  |
 | kjv          | 0.568  |
 | bible        | 0.557  |
 | scripture    | 0.55   |
 TF-IDF stands for "term frequency — inverse document frequency" because
 it is the product of two terms "term frequency" and "inverse document
 frequency." Term frequency quantifies the amount that a term appears in
 a subreddit (document). Inverse document frequency quantifies how much
 that term appears in other subreddits (documents). As you can see on
 the Wikipedia page, there are many possible ways of constructing and
 combining these terms.
 I chose to normalize term frequency by the maximum (raw) term frequency
 for each subreddit:
 $$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}$$
 I use the log inverse document frequency:
 $$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
 I then combine them using some smoothing to get:
 $$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
 (Other normalization strategies are worth trying — see the note in
 `TODO`.)
 ### Building TF-IDF vectors
 The process for building TF-IDF vectors has four steps:
 1. Extracting terms using `../ngrams/tf_comments.py`
 2. Detecting common phrases using `../ngrams/top_comment_phrases.py`
 3. Extracting terms and common phrases using
   `../ngrams/tf_comments.py --mwe-pass='second'`
 4. Building IDF and TF-IDF scores in `tfidf.py`
 #### Running `tf_comments.py` on the backfill queue
 The main reason that I did it in four steps instead of one is to take
 advantage of the backfill queue for running `tf_comments.py`. This step
 requires reading all of the text in every comment and converting it to
 a bag of words at the subreddit level. This is a lot of computation
 that is easily parallelizable. The script `../ngrams/run_tf_jobs.sh`
 partially automates running steps 1 (or 3) on the backfill queue.
 #### Phrase detection using pointwise mutual information
 TF-IDF is simple, but only uses single words (unigrams). Sequences of
 multiple words can be important to account for how words have different
 meanings in different contexts or how sequences of words refer to
 distinct things like names. Dealing with context or longer sequences of
 words is a common challenge in natural language processing since the
 number of possible n-grams grows like crazy as n gets bigger. Phrase
 detection helps this problem by limiting the set of n-grams to those
 most informative.
 But how do we detect phrases? I implemented [pointwise mutual
 information](https://en.wikipedia.org/wiki/Pointwise_mutual_information),
 which is a pretty simple way but seems to work pretty well.
 PMI is a quantity derived from information theory. The intuition is
 that if two words occur together quite frequently compared to how often
 they appear separately then the cooccurrance is likely to be
 informative.
 $$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
 In `../ngrams/tf_comments.py` if `--mwe-pass=first` then a 10% sample
 of 1-4-grams (sequences of terms up to length 4) will be written to a
 file to be consumed by `../ngrams/top_comment_phrases.py`.
 `top_comment_phrases.py` computes the PMI for these possible phrases
 and writes those that occur at least 3500 times in the sample of
 n-grams and have a PMI of at least 3 (about 65000 expressions).
 `tf_comments.py --mwe-pass=second` then uses the detected phrases and
 adds them to the term frequency data.
 ## Cosine similarity
 Once the TF-IDF vectors are built, making a similarity score between
 two subreddits is straightforward using cosine similarity.
 $$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
 Intuitively, we represent two subreddits as lines in a high-dimensional
 space (TF-IDF vectors). In linear algebra, the dot product ($\cdot$)
 between two vectors takes their weighted sum (e.g. linear regression is
 a dot product of a vector of covariates and a vector of weights). The
 vectors might have different lengths — if one subreddit has more words
 in comments than the other — so in cosine similarity the dot product
 is normalized by the magnitude (length) of the vectors. It turns out
 that this is equivalent to taking the cosine of the two vectors. So
 cosine similarity in essence quantifies the angle between the two lines
 in high-dimensional space. If the cosine similarity between two
 subreddits is greater then their TF-IDF vectors are more correlated.
 Cosine similarity with TF-IDF is popular (indeed it has been applied to
 Reddit in research several times before) because it quantifies the
 correlation between the most characteristic terms for two communities.
 Compared to other approaches to similarity like those using word
 embeddings or topic models it may struggle to handle polysemy, synonymy,
 or correlations between different terms. Using phrase detection helps
 with this a little bit. The advantages of this approach are simplicity
 and scalability. I'm thinking about using [latent semantic
 analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
 intermediate step to improve upon similarities based on raw TF-IDFs.
 Even still, computing similarities between a large number of subreddits
 is computationally expensive and requires $n(n-1)/2$ dot-product
 evaluations. This can be sped up by passing
 `similarity-threshold=X` where $X>0$ into `cosine_similarities.py`. I
 used a cosine similarity function that's built into the spark matrix
 library which supports the `DIMSUM` algorithm for approximating
 matrix-matrix products. This algorithm is commonly used in industry
 (i.e. at Twitter, Google) for large-scale similarity scoring.
 ## See also
 The CDSC wiki page
 [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
 is the landing page for this project on the wiki. The methods writeup
 above used to live there; it now lives here so that doc and code stay
 in sync.