The script now exits after Part 2 so the copy and cleanup commands must be run manually. This prevents the live datasets from being touched without a deliberate verification step in between. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
title
| title |
|---|
| Utilities for Reddit Data Science |
cdsc_reddit is a collection of tools for working with Reddit data on the
Hyak super computing system at the University of Washington. It is built
around PySpark
and pyarrow so that the underlying
pipelines scale to the full Pushshift archive.
The project was originally developed by Nate TeBlunthuis and is now maintained by a rotating set of researchers in the Community Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and others.
At a high level, the codebase covers four kinds of work:
- Ingest. Turning Pushshift comment and submission dumps into partitioned Parquet datasets that are fast to query by subreddit or by author.
- Text features. Building per-subreddit TF-IDF vectors over comment text, including a phrase-detection pass based on pointwise mutual information.
- Similarity, clustering, and density. Computing cosine similarities between subreddits (by terms or by overlapping authors), clustering the resulting similarity matrices, and summarizing how dense each neighborhood is.
- Time series and visualization. Pulling activity time series per subreddit and producing t-SNE plots of the clustering output.
Several pieces are still rough — the user interfaces for many of the scripts assume familiarity with the project, and the TF-IDF pipeline does not yet strip hyperlinks or bot comments, so subreddits with similar automod messages can look misleadingly similar.
Repository layout
| Directory | What's in it |
|---|---|
datasets/ |
Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
ngrams/ |
Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
similarities/ |
TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
clustering/ |
Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
density/ |
Per-subreddit overlap density measures derived from the similarity matrices. |
timeseries/ |
Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
visualization/ |
Altair-based interactive plots of subreddit clusters. |
bots/ |
Heuristics for flagging likely bot accounts. |
examples/ |
Small standalone examples using pyarrow. |
Sourcing the dumps
Pushshift was effectively wound down after Reddit cut off third-party API
access in 2023, and the original files.pushshift.io archive is gone.
Collection of new Reddit comment and submission data has since been
picked up by ArcticShift,
which publishes both the historical Pushshift archive and the new data
it continues to collect, with monthly updates redistributed as academic
torrents by Reddit users u/Watchful1 and u/RaiderBDev. Fetching the
dumps from a torrent client is a manual prerequisite to running the rest
of this pipeline; step-by-step instructions for the current CDSC
workflow — including which torrents to pull and how to stage the .zst
files on Hyak — live on the CDSC wiki at
CommunityData:CDSC_Reddit.
The earlier dumps/ directory of pull_pushshift_*.sh and SHA-check
scripts has been removed since the URLs they pointed at no longer
resolve.
Building Parquet datasets
The raw dumps are huge compressed JSON files with a lot of metadata that we usually don't need. They aren't indexed, so it's expensive to pull data for just a handful of subreddits, and they are awkward to read directly into Spark. Extracting the useful fields and rewriting the data as Parquet makes everything downstream cheaper. The conversion happens in two steps:
- Extracting JSON into temporary, unpartitioned Parquet files using
pyarrow (
comments_2_parquet_part1.py,submissions_2_parquet_part1.py). - Repartitioning and sorting the data using PySpark
(
comments_2_parquet_part2.py,submissions_2_parquet_part2.py).
The final datasets live in /gscratch/comdata/output/:
reddit_comments_by_author.parquet— comments partitioned and sorted by author (lowercase).reddit_comments_by_subreddit.parquet— comments partitioned and sorted by subreddit (lowercase).reddit_submissions_by_author.parquet— submissions partitioned and sorted by author (lowercase).reddit_submissions_by_subreddit.parquet— submissions partitioned and sorted by subreddit (lowercase).
Splitting the work this way lets us decompress and parse the dumps in the Hyak backfill queue and then sort them in Spark. Partitioning makes it possible to read data for specific subreddits or authors efficiently, and sorting makes per-subreddit or per-user aggregations cheap. More documentation on using these files lives on the CDSC wiki.
TF-IDF subreddit similarity
TF-IDF is a simple
information-retrieval technique we use to quantify the topic of a
subreddit. The goal is to build a vector for each subreddit that scores
every term (or phrase) according to how characteristic it is of the
lexicon used there. For example, the most characteristic terms in
/r/christianity in the current model are:
| Term | tf_idf |
|---|---|
| christians | 0.581 |
| christianity | 0.569 |
| kjv | 0.568 |
| bible | 0.557 |
| scripture | 0.55 |
TF-IDF is the product of two pieces: term frequency (how often a term appears in a subreddit) and inverse document frequency (how rare the term is across other subreddits). There are many ways to construct and combine these; the Wikipedia page catalogs the common variants.
We normalize term frequency by the maximum raw term frequency for each subreddit:
\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}
and use the log inverse document frequency:
\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}
combined with a smoothing term:
\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}
(Other normalization strategies are worth trying — see the note in
similarities/TODO.)
Building TF-IDF vectors
The pipeline has four steps:
- Extract terms with
ngrams/tf_comments.py. - Detect common phrases with
ngrams/top_comment_phrases.py. - Re-extract terms together with detected phrases via
ngrams/tf_comments.py --mwe-pass=second. - Compute IDF and TF-IDF scores in
similarities/tfidf.py.
Running tf_comments.py on the backfill queue
The main reason for the four-step layout is that tf_comments.py is
trivially parallel — it reads every comment and rewrites each subreddit
as a bag of words — so it benefits from being farmed out to the Hyak
backfill queue. ngrams/run_tf_jobs.sh partially automates the dispatch.
Phrase detection using pointwise mutual information
TF-IDF over unigrams misses the fact that sequences of words often carry
distinct meaning (names, fixed expressions, in-jokes). Considering every
possible n-gram is prohibitive because the candidate set explodes with
n, so we use phrase detection to limit ourselves to informative
n-grams.
We use pointwise mutual information (PMI), which is simple and works well in practice. The intuition is that if two words co-occur much more often than their marginal frequencies would predict, the pair is probably meaningful:
\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}
When tf_comments.py is run with --mwe-pass=first, it writes a 10%
sample of 1- to 4-grams to a file. top_comment_phrases.py then
computes PMI over that sample and keeps phrases that occur at least
3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
A second pass of tf_comments.py --mwe-pass=second folds those phrases
back into the term-frequency data.
Cosine similarity
Once the TF-IDF vectors are built, computing a similarity score between two subreddits is straightforward with cosine similarity:
\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}
Each subreddit is a vector in a high-dimensional term space. The dot product gives a weighted sum of shared terms, and dividing by the vector magnitudes removes the effect of differing vocabulary size — what remains is the cosine of the angle between the two vectors. Cosine similarity with TF-IDF is popular (and has been used on Reddit several times in prior research) because it captures correlation between the most characteristic terms of two communities.
Compared to approaches based on word embeddings or topic models, this method can struggle with polysemy, synonymy, and correlations between related terms. Phrase detection helps a little. The trade-off is simplicity and scalability. Adding latent semantic analysis as an intermediate step is on the wish-list for improving on raw TF-IDF similarities.
Even with these simplifications, similarity between a large number of
subreddits is expensive — naively n^2 dot-products. Passing
--similarity-threshold=X (with X>0) to the similarity scripts lets
Spark's built-in matrix library use the DIMSUM approximation, which is
the same algorithm Twitter and Google have used for large-scale
similarity scoring.
Clustering, density, and time series
The similarity matrices feed three follow-on analyses:
clustering/clustering.pyclusters a similarity matrix using affinity propagation;clustering/selection.pyandclustering/fit_tsne.pyare supporting scripts for hyperparameter selection and 2-D embeddings.density/overlap_density.pycomputes a per-subreddit overlap density measure from the similarity matrix.timeseries/cluster_timeseries.pyandtimeseries/choose_clusters.pypull subreddit-level activity time series and join them against clustering output.
visualization/tsne_vis.py renders interactive Altair plots of the
clustering output — see the prebuilt HTML files in visualization/ for
examples.
Bot detection
bots/good_bad_bot.py computes user-level features (compression rate
of comment text, frequency of self-identification as a bot, etc.) that
are useful for filtering bot accounts out of downstream analyses. This
is preliminary work; nothing in the pipeline currently consumes it
automatically.