18
0
Benjamin Mako Hill 2390d2d10c datasets/README: fix stale add_new_month references
After the rename to add_months.sh and addition of merge_layers.sh /
*_merge.py, the Hyak walkthrough section still pointed at the old script
names. Update the Step 2 inventory and the "for incremental updates"
aside to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 19:24:38 -07:00
2021-02-22 16:03:48 -08:00
2020-12-08 17:32:20 -08:00
2021-02-22 16:03:48 -08:00
2020-07-07 12:28:44 -07:00

title
title
Utilities for Reddit Data Science

cdsc_reddit is a collection of tools for working with Reddit data on the Hyak super computing system at the University of Washington. It is built around PySpark and pyarrow so that the underlying pipelines scale to the full Pushshift archive.

The project was originally developed by Nate TeBlunthuis and is now maintained by a rotating set of researchers in the Community Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and others.

At a high level, the codebase covers four kinds of work:

  • Ingest. Turning Pushshift comment and submission dumps into partitioned Parquet datasets that are fast to query by subreddit or by author.
  • Text features. Building per-subreddit TF-IDF vectors over comment text, including a phrase-detection pass based on pointwise mutual information.
  • Similarity, clustering, and density. Computing cosine similarities between subreddits (by terms or by overlapping authors), clustering the resulting similarity matrices, and summarizing how dense each neighborhood is.
  • Time series and visualization. Pulling activity time series per subreddit and producing t-SNE plots of the clustering output.

Several pieces are still rough — the user interfaces for many of the scripts assume familiarity with the project, and the TF-IDF pipeline does not yet strip hyperlinks or bot comments, so subreddits with similar automod messages can look misleadingly similar.

Repository layout

Directory What's in it
datasets/ Scripts that convert the raw dumps into partitioned, sorted Parquet datasets.
ngrams/ Term extraction from comments, phrase detection via PMI, and supporting batch scripts.
similarities/ TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant.
clustering/ Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization.
density/ Per-subreddit overlap density measures derived from the similarity matrices.
timeseries/ Per-subreddit activity time series, plus tooling for choosing among clustering runs.
visualization/ Altair-based interactive plots of subreddit clusters.
bots/ Heuristics for flagging likely bot accounts.
examples/ Small standalone examples using pyarrow.

Sourcing the dumps

Pushshift was effectively wound down after Reddit cut off third-party API access in 2023, and the original files.pushshift.io archive is gone. Collection of new Reddit comment and submission data has since been picked up by ArcticShift, which publishes both the historical Pushshift archive and the new data it continues to collect, with monthly updates redistributed as academic torrents by Reddit users u/Watchful1 and u/RaiderBDev. Fetching the dumps from a torrent client is a manual prerequisite to running the rest of this pipeline; step-by-step instructions for the current CDSC workflow — including which torrents to pull and how to stage the .zst files on Hyak — live on the CDSC wiki at CommunityData:CDSC_Reddit. The earlier dumps/ directory of pull_pushshift_*.sh and SHA-check scripts has been removed since the URLs they pointed at no longer resolve.

Building Parquet datasets

The raw dumps are huge compressed JSON files with a lot of metadata that we usually don't need. They aren't indexed, so it's expensive to pull data for just a handful of subreddits, and they are awkward to read directly into Spark. Extracting the useful fields and rewriting the data as Parquet makes everything downstream cheaper. The conversion happens in two steps:

  1. Extracting JSON into temporary, unpartitioned Parquet files using pyarrow (comments_2_parquet_part1.py, submissions_2_parquet_part1.py).
  2. Repartitioning and sorting the data using PySpark (comments_2_parquet_part2.py, submissions_2_parquet_part2.py).

The final datasets live in /gscratch/comdata/output/:

  • reddit_comments_by_author.parquet — comments partitioned and sorted by author (lowercase).
  • reddit_comments_by_subreddit.parquet — comments partitioned and sorted by subreddit (lowercase).
  • reddit_submissions_by_author.parquet — submissions partitioned and sorted by author (lowercase).
  • reddit_submissions_by_subreddit.parquet — submissions partitioned and sorted by subreddit (lowercase).

Splitting the work this way lets us decompress and parse the dumps in the Hyak backfill queue and then sort them in Spark. Partitioning makes it possible to read data for specific subreddits or authors efficiently, and sorting makes per-subreddit or per-user aggregations cheap. More documentation on using these files lives on the CDSC wiki.

TF-IDF subreddit similarity

TF-IDF is a simple information-retrieval technique we use to quantify the topic of a subreddit. The goal is to build a vector for each subreddit that scores every term (or phrase) according to how characteristic it is of the lexicon used there. For example, the most characteristic terms in /r/christianity in the current model are:

Term tf_idf
christians 0.581
christianity 0.569
kjv 0.568
bible 0.557
scripture 0.55

TF-IDF is the product of two pieces: term frequency (how often a term appears in a subreddit) and inverse document frequency (how rare the term is across other subreddits). There are many ways to construct and combine these; the Wikipedia page catalogs the common variants.

We normalize term frequency by the maximum raw term frequency for each subreddit:

\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}

and use the log inverse document frequency:

\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}

combined with a smoothing term:

\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}

(Other normalization strategies are worth trying — see the note in similarities/TODO.)

Building TF-IDF vectors

The pipeline has four steps:

  1. Extract terms with ngrams/tf_comments.py.
  2. Detect common phrases with ngrams/top_comment_phrases.py.
  3. Re-extract terms together with detected phrases via ngrams/tf_comments.py --mwe-pass=second.
  4. Compute IDF and TF-IDF scores in similarities/tfidf.py.

Running tf_comments.py on the backfill queue

The main reason for the four-step layout is that tf_comments.py is trivially parallel — it reads every comment and rewrites each subreddit as a bag of words — so it benefits from being farmed out to the Hyak backfill queue. ngrams/run_tf_jobs.sh partially automates the dispatch.

Phrase detection using pointwise mutual information

TF-IDF over unigrams misses the fact that sequences of words often carry distinct meaning (names, fixed expressions, in-jokes). Considering every possible n-gram is prohibitive because the candidate set explodes with n, so we use phrase detection to limit ourselves to informative n-grams.

We use pointwise mutual information (PMI), which is simple and works well in practice. The intuition is that if two words co-occur much more often than their marginal frequencies would predict, the pair is probably meaningful:

\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}

When tf_comments.py is run with --mwe-pass=first, it writes a 10% sample of 1- to 4-grams to a file. top_comment_phrases.py then computes PMI over that sample and keeps phrases that occur at least 3,500 times and have PMI of at least 3 — roughly 65,000 expressions. A second pass of tf_comments.py --mwe-pass=second folds those phrases back into the term-frequency data.

Cosine similarity

Once the TF-IDF vectors are built, computing a similarity score between two subreddits is straightforward with cosine similarity:

\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}

Each subreddit is a vector in a high-dimensional term space. The dot product gives a weighted sum of shared terms, and dividing by the vector magnitudes removes the effect of differing vocabulary size — what remains is the cosine of the angle between the two vectors. Cosine similarity with TF-IDF is popular (and has been used on Reddit several times in prior research) because it captures correlation between the most characteristic terms of two communities.

Compared to approaches based on word embeddings or topic models, this method can struggle with polysemy, synonymy, and correlations between related terms. Phrase detection helps a little. The trade-off is simplicity and scalability. Adding latent semantic analysis as an intermediate step is on the wish-list for improving on raw TF-IDF similarities.

Even with these simplifications, similarity between a large number of subreddits is expensive — naively n^2 dot-products. Passing --similarity-threshold=X (with X>0) to the similarity scripts lets Spark's built-in matrix library use the DIMSUM approximation, which is the same algorithm Twitter and Google have used for large-scale similarity scoring.

Clustering, density, and time series

The similarity matrices feed three follow-on analyses:

  • clustering/clustering.py clusters a similarity matrix using affinity propagation; clustering/selection.py and clustering/fit_tsne.py are supporting scripts for hyperparameter selection and 2-D embeddings.
  • density/overlap_density.py computes a per-subreddit overlap density measure from the similarity matrix.
  • timeseries/cluster_timeseries.py and timeseries/choose_clusters.py pull subreddit-level activity time series and join them against clustering output.

visualization/tsne_vis.py renders interactive Altair plots of the clustering output — see the prebuilt HTML files in visualization/ for examples.

Bot detection

bots/good_bad_bot.py computes user-level features (compression rate of comment text, frequency of self-identification as a bot, etc.) that are useful for filtering bot accounts out of downstream analyses. This is preliminary work; nothing in the pipeline currently consumes it automatically.

Description
No description provided
Readme 2.8 MiB
Languages
Python 75.2%
Shell 14%
Makefile 10.8%