From 1851132a06cc43b987526ee9d476ff6588cae070 Mon Sep 17 00:00:00 2001 From: Benjamin Mako Hill Date: Mon, 25 May 2026 17:20:21 -0700 Subject: [PATCH] move dataset + similarity docs from wiki into repo READMEs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_*.sh / check_*_shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 --- datasets/README.md | 269 ++++++++++++++++++++++++++++++++++++++--- similarities/README.md | 175 +++++++++++++++++++++++++++ 2 files changed, 425 insertions(+), 19 deletions(-) create mode 100644 similarities/README.md diff --git a/datasets/README.md b/datasets/README.md index 66c10e3..f5a6a4b 100644 --- a/datasets/README.md +++ b/datasets/README.md @@ -5,20 +5,47 @@ This directory holds the pipeline that turns compressed Reddit dump files sorted, repartitioned parquet datasets that the rest of the project consumes. -The pipeline has two stages: +## Pipeline overview -| Stage | What it does | +The raw dumps are huge compressed json files with a lot of metadata that +we may not need. They aren't indexed so it's expensive to pull data from +just a handful of subreddits. It also turns out that it's a pain to read +these compressed files straight into spark. Extracting useful variables +from the dumps and building parquet datasets makes them easier to work +with. This happens in two steps: + +1. Extracting json into (temporary, unpartitioned) parquet files using + pyarrow. +2. Repartitioning and sorting the data using pyspark. + +Breaking this down into two steps is useful because it allows us to +decompress and parse the dumps in the backfill queue and then sort them +in spark. Partitioning the data makes it possible to efficiently read +data for specific subreddits or authors. Sorting it means that you can +efficiently compute aggregations at the subreddit or user level. More +documentation on using these files is available on the [CDSC wiki][hyak-datasets]. + +The final datasets are in `/gscratch/comdata/output`: + +- `reddit_comments_by_author.parquet` has comments partitioned and sorted + by username (lowercase). +- `reddit_comments_by_subreddit.parquet` has comments partitioned and + sorted by subreddit name (lowercase). +- `reddit_submissions_by_author.parquet` has submissions partitioned and + sorted by username (lowercase). +- `reddit_submissions_by_subreddit.parquet` has submissions partitioned + and sorted by subreddit name (lowercase). + +[hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets + +## Scripts + +| Script | Role | |---|---| -| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. | -| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. | - -Each stage has a thin entry-point script per dump type: - -| Script | Notes | -|---|---| -| `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump ` and `gen_task_list` subcommands via fire. | -| `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. | -| `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. | +| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump ` and `gen_task_list` subcommands via fire. | +| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads the directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Launched via `start_spark_and_run.sh`. | +| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. | +| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). | ## The two workflows @@ -47,10 +74,10 @@ need a rearchitecture if the cost became a problem. ## Running steps individually -Both `.sh` runners are written so that every meaningful step is a separate, -self-contained command. If something fails partway through, or you want -to inspect intermediate state, you can copy any single line out of the -runner and execute it standalone. For example: +Both `.sh` runners are written so that every meaningful step is a +separate, self-contained command. If something fails partway through, or +you want to inspect intermediate state, you can copy any single line out +of the runner and execute it standalone. For example: ```sh # parse one specific file (skipping the rest of the workflow) @@ -68,10 +95,214 @@ The Spark Part 2 step is launched via `start_spark_and_run.sh` (a Hyak-provided wrapper not included in this repo); see the wiki for the launch convention. +## Detailed walkthrough: refreshing the data on Hyak + +This walkthrough describes the process we went through updating Reddit +data from the PushShift cutoff up to the end of 2024. Adapting it for +newer data should just involve using different academic torrent files +that start from 2025 onwards. For a single-month update, the +`add_new_month.sh` workflow above is much shorter; this walkthrough is +for the bulk-refresh case. + +### Prerequisites + +- [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config + and `.bashrc`) +- [Go through the Hyak Getting Started tutorial][hyak-syllabus] + +Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`): + +- [Watchful1's reddit explanation][watchful1-explainer] (separated by + subreddit), the [dataset not divided by subreddits][watchful1-bulk], + and the [GitHub repo with scripts for analyzing data][watchful1-repo] +- [RaiderBDev monthly dumps][raiderbdev-monthly] and + [RaiderBDev's ArcticShift API][arctic-shift] +- The [2005-06 to 2024-12 academic torrent][academic-torrent] used for + the 2005-2024 refresh + +CDSC and Hyak docs: + +- [Hyak docs — how to work with modules][hyak-modules] +- [CDSC — how to download Python or R packages][cdsc-pkgs] +- [CDSC — Hyak datasets information][hyak-datasets] +- [CDSC — Hyak Spark information][hyak-spark] + +[hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak +[hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/ +[watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/ +[watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/ +[watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master +[raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/ +[arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift +[academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4 +[hyak-modules]: https://hyak.uw.edu/docs/tools/modules +[cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages +[hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark + +### Step 1: data download on Nada and Hyak + +We downloaded the [2005-2024 academic torrent][academic-torrent] and put +it on Nada (~2 days of downloading). We copied the raw data over to +Hyak's scrubbed directory in a new directory, +`/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw +data sorted into `/comments` or `/submissions`. The `/submissions` +directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst` +files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal +with.) + +### Step 2: clone the repo on Hyak + +On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the +working directory next to the raw data, e.g. +`/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant +code lives entirely in `datasets/`: + +- `dumps_helper.py` — shared parsing and Spark logic +- `helper.py` — file-open helpers +- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points +- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points +- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts + +The Spark wrapper scripts (`start_spark_and_run.sh`, +`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo; +they are part of the CDSC Hyak environment and should already be on +PATH. + +### Step 3: smoke-test Part 1 on a single file + +Check out `any_machine`. We'll test submissions Part 1 with just one +file: + +```sh +python3 submissions_part1.py parse_dump RS_2005-06.zst +``` + +To verify, go to your output directory and examine the start of the +file: + +```sh +python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())" +``` + +You should see columns like `id`, `author`, `subreddit`, and `title` +printed out. Repeat the process with `comments_part1.py`; you should see +columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out. + +**Note**: you may have to install relevant libraries before successfully +running the file: + +```sh +pip install --user pyarrow simdjson zstandard fire +``` + +### Step 4: Part 1 — converting `.zst` to `.parquet` files + +Now we'll convert all of our `.zst` compressed Reddit data to `.parquet` +files. First, to generate our task list, we'll run + +```sh +python3 submissions_part1.py gen_task_list +``` + +There should be a script, `parse_submissions_task_list`, in the working +directory. Check the script (`less parse_submissions_task_list`); it +should have many lines that look like our earlier test command, +`python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of +our `.zst` files. Do the same process with comments to generate +`parse_comments_task_list`. + +From a login node, run `tmux` to keep our job running and then +`any_machine` to check out a node to do computational work. We'll run +our tasks (from the task list) in parallel to optimize. Start with +submissions: + +```sh +parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list +``` + +The `--joblog` flag creates a text file where you can see which tasks +completed successfully, and the `--results` flag creates a directory +where each task has its own stderr output to see the specific error +(this is best practice for debugging). + +Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`). +We'll ssh into our computational node (`ssh n1234` — you can get the +node name by running `ourjobs`) and run `htop` +([more details on htop][htop-explainer]). You should see that the +machine's CPUs are getting close to 100% usage. If all looks good, +create a new window and repeat the process for comments. + +[htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/ + +Once the job has successfully completed, you'll see that your CPUs are +closer to 0% usage in `htop` and your `submissions_joblog.txt` file +should show an `exitval` of 0 for all commands. Kill your node by +running `scancel 12345678` (the job ID can be found from `ourjobs`). + +### Step 5: verify the per-source parquet files + +We'll want to verify our `.parquet` files at this point. We compared the +new files' number of columns and rows to the old data: from the +`/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet` +directory, run + +```sh +diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet) +``` + +and confirm there are no differences (same process with submissions). +This may or may not be relevant if we continue using the same academic +torrent to update data and have nothing to compare to, but you can still +check that the new data's number of columns and rows are fairly +continuous with the most recent data we already have. + +### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark + +If the `.parquet` files reasonably appear to be complete, we can now +sort them by author and subreddit. The most efficient way to do so is by +using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node +splits into up to six slices (four in our current case) so the tasks +will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog] +are good resources for further information). Run `tmux` on a login +node, then grab the whole node for up to a week with: + +```sh +salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G +``` + +[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/ + +Once Slurm drops you onto the compute node, run + +```sh +./start_spark_and_run.sh submissions_part2.py +``` + +Monitor via `htop` (as described in Step 4); the CPUs may not always +show high usage but you should see that memory is being used. Repeat +for the comments. Successful jobs will result in +`/gscratch/comdata/output` having four new directories: +`reddit_submissions_by_author.parquet`, +`reddit_submissions_by_subreddit.parquet`, +`reddit_comments_by_author.parquet`, and +`reddit_comments_by_subreddit.parquet`. Each should contain many +`snappy.parquet` files (e.g. +`part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`) +and `_SUCCESS`. + +### Step 7: data verification + +Verify and make sure the new data is reasonably complete before deleting +any of the old data. Do a simple time series to see how many posts there +are per day and make sure things don't fall off. It is also useful to +have lab members test out anything they're working on again with the +new parquet files. + ## See also The CDSC wiki page [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit) -documents the surrounding workflow — where the raw dump files come from -(currently ArcticShift via academic torrents), how to stage them on -Hyak, and how to run Spark jobs on the cluster. +is the landing page for this project on the wiki and provides +cross-links to related CDSC and Hyak documentation. The walkthrough +above used to live there; it now lives here so that doc and code stay +in sync. diff --git a/similarities/README.md b/similarities/README.md new file mode 100644 index 0000000..24da777 --- /dev/null +++ b/similarities/README.md @@ -0,0 +1,175 @@ +# Subreddit similarity + +This directory holds the code that computes pairwise similarities between +subreddits — both term-based (from TF-IDF over comment text) and +author-based (from overlapping commenter sets). Similarity matrices +produced here feed downstream clustering (`../clustering/`) and density +analysis (`../density/`). + +## Datasets + +Subreddit similarity datasets based on comment terms and comment authors +are available on Hyak in `/gscratch/comdata/output/reddit_similarity`. +The overall approach to subreddit similarity seems to work reasonably +well and the code is stabilizing. If you want help using these +similarities in a project, just reach out to +[Nate](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29). + +By default, the scripts here take a `TopN` parameter which selects the +subreddits to include in the similarity dataset according to how many +total comments they have. You can alternatively pass a value to the +`included_subreddits` parameter for a file with the names of the +subreddits you would like to include on each line. + +## Scripts + +| Script | What it does | +|---|---| +| `tfidf.py` | Builds TF-IDF vectors for subreddits. Fire CLI subcommands for `authors`, `terms`, `authors_weekly`, `terms_weekly`. | +| `cosine_similarities.py` | Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands `author`, `term`, `author-tf`. | +| `weekly_cosine_similarities.py` | Same idea but operating on the weekly TF-IDF vectors. | +| `wang_similarity.py` | A variant similarity computation based on user overlaps in the style of Wang et al. | +| `top_subreddits_by_comments.py` | Produces the `subreddits_by_num_comments.csv` ranking used to pick the top-N subreddits for the similarity matrices. | +| `similarities_helper.py` | Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. | +| `Makefile` | Wires everything together with the canonical Hyak output paths. | + +## Methods + +[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common and +simple information-retrieval technique that we can use to quantify the +topic of a subreddit. The goal of TF-IDF is to build a vector for each +subreddit that scores every term (or phrase) according to how +characteristic it is of the overall lexicon used in that subreddit. For +example, the most characteristic terms in the subreddit `/r/christianity` +in the current version of the TF-IDF model are: + +| Term | tf_idf | +|:------------:|:------:| +| christians | 0.581 | +| christianity | 0.569 | +| kjv | 0.568 | +| bible | 0.557 | +| scripture | 0.55 | + +TF-IDF stands for "term frequency — inverse document frequency" because +it is the product of two terms "term frequency" and "inverse document +frequency." Term frequency quantifies the amount that a term appears in +a subreddit (document). Inverse document frequency quantifies how much +that term appears in other subreddits (documents). As you can see on +the Wikipedia page, there are many possible ways of constructing and +combining these terms. + +I chose to normalize term frequency by the maximum (raw) term frequency +for each subreddit: + +$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}$$ + +I use the log inverse document frequency: + +$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$ + +I then combine them using some smoothing to get: + +$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$ + +(Other normalization strategies are worth trying — see the note in +`TODO`.) + +### Building TF-IDF vectors + +The process for building TF-IDF vectors has four steps: + +1. Extracting terms using `../ngrams/tf_comments.py` +2. Detecting common phrases using `../ngrams/top_comment_phrases.py` +3. Extracting terms and common phrases using + `../ngrams/tf_comments.py --mwe-pass='second'` +4. Building IDF and TF-IDF scores in `tfidf.py` + +#### Running `tf_comments.py` on the backfill queue + +The main reason that I did it in four steps instead of one is to take +advantage of the backfill queue for running `tf_comments.py`. This step +requires reading all of the text in every comment and converting it to +a bag of words at the subreddit level. This is a lot of computation +that is easily parallelizable. The script `../ngrams/run_tf_jobs.sh` +partially automates running steps 1 (or 3) on the backfill queue. + +#### Phrase detection using pointwise mutual information + +TF-IDF is simple, but only uses single words (unigrams). Sequences of +multiple words can be important to account for how words have different +meanings in different contexts or how sequences of words refer to +distinct things like names. Dealing with context or longer sequences of +words is a common challenge in natural language processing since the +number of possible n-grams grows like crazy as n gets bigger. Phrase +detection helps this problem by limiting the set of n-grams to those +most informative. + +But how do we detect phrases? I implemented [pointwise mutual +information](https://en.wikipedia.org/wiki/Pointwise_mutual_information), +which is a pretty simple way but seems to work pretty well. + +PMI is a quantity derived from information theory. The intuition is +that if two words occur together quite frequently compared to how often +they appear separately then the cooccurrance is likely to be +informative. + +$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$ + +In `../ngrams/tf_comments.py` if `--mwe-pass=first` then a 10% sample +of 1-4-grams (sequences of terms up to length 4) will be written to a +file to be consumed by `../ngrams/top_comment_phrases.py`. +`top_comment_phrases.py` computes the PMI for these possible phrases +and writes those that occur at least 3500 times in the sample of +n-grams and have a PMI of at least 3 (about 65000 expressions). + +`tf_comments.py --mwe-pass=second` then uses the detected phrases and +adds them to the term frequency data. + +## Cosine similarity + +Once the TF-IDF vectors are built, making a similarity score between +two subreddits is straightforward using cosine similarity. + +$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$ + +Intuitively, we represent two subreddits as lines in a high-dimensional +space (TF-IDF vectors). In linear algebra, the dot product ($\cdot$) +between two vectors takes their weighted sum (e.g. linear regression is +a dot product of a vector of covariates and a vector of weights). The +vectors might have different lengths — if one subreddit has more words +in comments than the other — so in cosine similarity the dot product +is normalized by the magnitude (length) of the vectors. It turns out +that this is equivalent to taking the cosine of the two vectors. So +cosine similarity in essence quantifies the angle between the two lines +in high-dimensional space. If the cosine similarity between two +subreddits is greater then their TF-IDF vectors are more correlated. + +Cosine similarity with TF-IDF is popular (indeed it has been applied to +Reddit in research several times before) because it quantifies the +correlation between the most characteristic terms for two communities. + +Compared to other approaches to similarity like those using word +embeddings or topic models it may struggle to handle polysemy, synonymy, +or correlations between different terms. Using phrase detection helps +with this a little bit. The advantages of this approach are simplicity +and scalability. I'm thinking about using [latent semantic +analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an +intermediate step to improve upon similarities based on raw TF-IDFs. + +Even still, computing similarities between a large number of subreddits +is computationally expensive and requires $n(n-1)/2$ dot-product +evaluations. This can be sped up by passing +`similarity-threshold=X` where $X>0$ into `cosine_similarities.py`. I +used a cosine similarity function that's built into the spark matrix +library which supports the `DIMSUM` algorithm for approximating +matrix-matrix products. This algorithm is commonly used in industry +(i.e. at Twitter, Google) for large-scale similarity scoring. + +## See also + +The CDSC wiki page +[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit) +is the landing page for this project on the wiki. The methods writeup +above used to live there; it now lives here so that doc and code stay +in sync.