The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_*.sh / check_*_shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Subreddit similarity
This directory holds the code that computes pairwise similarities between
subreddits — both term-based (from TF-IDF over comment text) and
author-based (from overlapping commenter sets). Similarity matrices
produced here feed downstream clustering (../clustering/) and density
analysis (../density/).
Datasets
Subreddit similarity datasets based on comment terms and comment authors
are available on Hyak in /gscratch/comdata/output/reddit_similarity.
The overall approach to subreddit similarity seems to work reasonably
well and the code is stabilizing. If you want help using these
similarities in a project, just reach out to
Nate.
By default, the scripts here take a TopN parameter which selects the
subreddits to include in the similarity dataset according to how many
total comments they have. You can alternatively pass a value to the
included_subreddits parameter for a file with the names of the
subreddits you would like to include on each line.
Scripts
| Script | What it does |
|---|---|
tfidf.py |
Builds TF-IDF vectors for subreddits. Fire CLI subcommands for authors, terms, authors_weekly, terms_weekly. |
cosine_similarities.py |
Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands author, term, author-tf. |
weekly_cosine_similarities.py |
Same idea but operating on the weekly TF-IDF vectors. |
wang_similarity.py |
A variant similarity computation based on user overlaps in the style of Wang et al. |
top_subreddits_by_comments.py |
Produces the subreddits_by_num_comments.csv ranking used to pick the top-N subreddits for the similarity matrices. |
similarities_helper.py |
Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. |
Makefile |
Wires everything together with the canonical Hyak output paths. |
Methods
TF-IDF is a common and
simple information-retrieval technique that we can use to quantify the
topic of a subreddit. The goal of TF-IDF is to build a vector for each
subreddit that scores every term (or phrase) according to how
characteristic it is of the overall lexicon used in that subreddit. For
example, the most characteristic terms in the subreddit /r/christianity
in the current version of the TF-IDF model are:
| Term | tf_idf |
|---|---|
| christians | 0.581 |
| christianity | 0.569 |
| kjv | 0.568 |
| bible | 0.557 |
| scripture | 0.55 |
TF-IDF stands for "term frequency — inverse document frequency" because it is the product of two terms "term frequency" and "inverse document frequency." Term frequency quantifies the amount that a term appears in a subreddit (document). Inverse document frequency quantifies how much that term appears in other subreddits (documents). As you can see on the Wikipedia page, there are many possible ways of constructing and combining these terms.
I chose to normalize term frequency by the maximum (raw) term frequency for each subreddit:
\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}
I use the log inverse document frequency:
\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}
I then combine them using some smoothing to get:
\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}
(Other normalization strategies are worth trying — see the note in
TODO.)
Building TF-IDF vectors
The process for building TF-IDF vectors has four steps:
- Extracting terms using
../ngrams/tf_comments.py - Detecting common phrases using
../ngrams/top_comment_phrases.py - Extracting terms and common phrases using
../ngrams/tf_comments.py --mwe-pass='second' - Building IDF and TF-IDF scores in
tfidf.py
Running tf_comments.py on the backfill queue
The main reason that I did it in four steps instead of one is to take
advantage of the backfill queue for running tf_comments.py. This step
requires reading all of the text in every comment and converting it to
a bag of words at the subreddit level. This is a lot of computation
that is easily parallelizable. The script ../ngrams/run_tf_jobs.sh
partially automates running steps 1 (or 3) on the backfill queue.
Phrase detection using pointwise mutual information
TF-IDF is simple, but only uses single words (unigrams). Sequences of multiple words can be important to account for how words have different meanings in different contexts or how sequences of words refer to distinct things like names. Dealing with context or longer sequences of words is a common challenge in natural language processing since the number of possible n-grams grows like crazy as n gets bigger. Phrase detection helps this problem by limiting the set of n-grams to those most informative.
But how do we detect phrases? I implemented pointwise mutual information, which is a pretty simple way but seems to work pretty well.
PMI is a quantity derived from information theory. The intuition is that if two words occur together quite frequently compared to how often they appear separately then the cooccurrance is likely to be informative.
\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}
In ../ngrams/tf_comments.py if --mwe-pass=first then a 10% sample
of 1-4-grams (sequences of terms up to length 4) will be written to a
file to be consumed by ../ngrams/top_comment_phrases.py.
top_comment_phrases.py computes the PMI for these possible phrases
and writes those that occur at least 3500 times in the sample of
n-grams and have a PMI of at least 3 (about 65000 expressions).
tf_comments.py --mwe-pass=second then uses the detected phrases and
adds them to the term frequency data.
Cosine similarity
Once the TF-IDF vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity.
\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}
Intuitively, we represent two subreddits as lines in a high-dimensional
space (TF-IDF vectors). In linear algebra, the dot product (\cdot)
between two vectors takes their weighted sum (e.g. linear regression is
a dot product of a vector of covariates and a vector of weights). The
vectors might have different lengths — if one subreddit has more words
in comments than the other — so in cosine similarity the dot product
is normalized by the magnitude (length) of the vectors. It turns out
that this is equivalent to taking the cosine of the two vectors. So
cosine similarity in essence quantifies the angle between the two lines
in high-dimensional space. If the cosine similarity between two
subreddits is greater then their TF-IDF vectors are more correlated.
Cosine similarity with TF-IDF is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.
Compared to other approaches to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms. Using phrase detection helps with this a little bit. The advantages of this approach are simplicity and scalability. I'm thinking about using latent semantic analysis as an intermediate step to improve upon similarities based on raw TF-IDFs.
Even still, computing similarities between a large number of subreddits
is computationally expensive and requires n(n-1)/2 dot-product
evaluations. This can be sped up by passing
similarity-threshold=X where X>0 into cosine_similarities.py. I
used a cosine similarity function that's built into the spark matrix
library which supports the DIMSUM algorithm for approximating
matrix-matrix products. This algorithm is commonly used in industry
(i.e. at Twitter, Google) for large-scale similarity scoring.
See also
The CDSC wiki page CommunityData:CDSC_Reddit is the landing page for this project on the wiki. The methods writeup above used to live there; it now lives here so that doc and code stay in sync.