The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_*.sh / check_*_shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
176 lines
8.4 KiB
Markdown
176 lines
8.4 KiB
Markdown
# Subreddit similarity
|
|
|
|
This directory holds the code that computes pairwise similarities between
|
|
subreddits — both term-based (from TF-IDF over comment text) and
|
|
author-based (from overlapping commenter sets). Similarity matrices
|
|
produced here feed downstream clustering (`../clustering/`) and density
|
|
analysis (`../density/`).
|
|
|
|
## Datasets
|
|
|
|
Subreddit similarity datasets based on comment terms and comment authors
|
|
are available on Hyak in `/gscratch/comdata/output/reddit_similarity`.
|
|
The overall approach to subreddit similarity seems to work reasonably
|
|
well and the code is stabilizing. If you want help using these
|
|
similarities in a project, just reach out to
|
|
[Nate](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29).
|
|
|
|
By default, the scripts here take a `TopN` parameter which selects the
|
|
subreddits to include in the similarity dataset according to how many
|
|
total comments they have. You can alternatively pass a value to the
|
|
`included_subreddits` parameter for a file with the names of the
|
|
subreddits you would like to include on each line.
|
|
|
|
## Scripts
|
|
|
|
| Script | What it does |
|
|
|---|---|
|
|
| `tfidf.py` | Builds TF-IDF vectors for subreddits. Fire CLI subcommands for `authors`, `terms`, `authors_weekly`, `terms_weekly`. |
|
|
| `cosine_similarities.py` | Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands `author`, `term`, `author-tf`. |
|
|
| `weekly_cosine_similarities.py` | Same idea but operating on the weekly TF-IDF vectors. |
|
|
| `wang_similarity.py` | A variant similarity computation based on user overlaps in the style of Wang et al. |
|
|
| `top_subreddits_by_comments.py` | Produces the `subreddits_by_num_comments.csv` ranking used to pick the top-N subreddits for the similarity matrices. |
|
|
| `similarities_helper.py` | Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. |
|
|
| `Makefile` | Wires everything together with the canonical Hyak output paths. |
|
|
|
|
## Methods
|
|
|
|
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common and
|
|
simple information-retrieval technique that we can use to quantify the
|
|
topic of a subreddit. The goal of TF-IDF is to build a vector for each
|
|
subreddit that scores every term (or phrase) according to how
|
|
characteristic it is of the overall lexicon used in that subreddit. For
|
|
example, the most characteristic terms in the subreddit `/r/christianity`
|
|
in the current version of the TF-IDF model are:
|
|
|
|
| Term | tf_idf |
|
|
|:------------:|:------:|
|
|
| christians | 0.581 |
|
|
| christianity | 0.569 |
|
|
| kjv | 0.568 |
|
|
| bible | 0.557 |
|
|
| scripture | 0.55 |
|
|
|
|
TF-IDF stands for "term frequency — inverse document frequency" because
|
|
it is the product of two terms "term frequency" and "inverse document
|
|
frequency." Term frequency quantifies the amount that a term appears in
|
|
a subreddit (document). Inverse document frequency quantifies how much
|
|
that term appears in other subreddits (documents). As you can see on
|
|
the Wikipedia page, there are many possible ways of constructing and
|
|
combining these terms.
|
|
|
|
I chose to normalize term frequency by the maximum (raw) term frequency
|
|
for each subreddit:
|
|
|
|
$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}$$
|
|
|
|
I use the log inverse document frequency:
|
|
|
|
$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
|
|
|
|
I then combine them using some smoothing to get:
|
|
|
|
$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
|
|
|
|
(Other normalization strategies are worth trying — see the note in
|
|
`TODO`.)
|
|
|
|
### Building TF-IDF vectors
|
|
|
|
The process for building TF-IDF vectors has four steps:
|
|
|
|
1. Extracting terms using `../ngrams/tf_comments.py`
|
|
2. Detecting common phrases using `../ngrams/top_comment_phrases.py`
|
|
3. Extracting terms and common phrases using
|
|
`../ngrams/tf_comments.py --mwe-pass='second'`
|
|
4. Building IDF and TF-IDF scores in `tfidf.py`
|
|
|
|
#### Running `tf_comments.py` on the backfill queue
|
|
|
|
The main reason that I did it in four steps instead of one is to take
|
|
advantage of the backfill queue for running `tf_comments.py`. This step
|
|
requires reading all of the text in every comment and converting it to
|
|
a bag of words at the subreddit level. This is a lot of computation
|
|
that is easily parallelizable. The script `../ngrams/run_tf_jobs.sh`
|
|
partially automates running steps 1 (or 3) on the backfill queue.
|
|
|
|
#### Phrase detection using pointwise mutual information
|
|
|
|
TF-IDF is simple, but only uses single words (unigrams). Sequences of
|
|
multiple words can be important to account for how words have different
|
|
meanings in different contexts or how sequences of words refer to
|
|
distinct things like names. Dealing with context or longer sequences of
|
|
words is a common challenge in natural language processing since the
|
|
number of possible n-grams grows like crazy as n gets bigger. Phrase
|
|
detection helps this problem by limiting the set of n-grams to those
|
|
most informative.
|
|
|
|
But how do we detect phrases? I implemented [pointwise mutual
|
|
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information),
|
|
which is a pretty simple way but seems to work pretty well.
|
|
|
|
PMI is a quantity derived from information theory. The intuition is
|
|
that if two words occur together quite frequently compared to how often
|
|
they appear separately then the cooccurrance is likely to be
|
|
informative.
|
|
|
|
$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
|
|
|
|
In `../ngrams/tf_comments.py` if `--mwe-pass=first` then a 10% sample
|
|
of 1-4-grams (sequences of terms up to length 4) will be written to a
|
|
file to be consumed by `../ngrams/top_comment_phrases.py`.
|
|
`top_comment_phrases.py` computes the PMI for these possible phrases
|
|
and writes those that occur at least 3500 times in the sample of
|
|
n-grams and have a PMI of at least 3 (about 65000 expressions).
|
|
|
|
`tf_comments.py --mwe-pass=second` then uses the detected phrases and
|
|
adds them to the term frequency data.
|
|
|
|
## Cosine similarity
|
|
|
|
Once the TF-IDF vectors are built, making a similarity score between
|
|
two subreddits is straightforward using cosine similarity.
|
|
|
|
$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
|
|
|
|
Intuitively, we represent two subreddits as lines in a high-dimensional
|
|
space (TF-IDF vectors). In linear algebra, the dot product ($\cdot$)
|
|
between two vectors takes their weighted sum (e.g. linear regression is
|
|
a dot product of a vector of covariates and a vector of weights). The
|
|
vectors might have different lengths — if one subreddit has more words
|
|
in comments than the other — so in cosine similarity the dot product
|
|
is normalized by the magnitude (length) of the vectors. It turns out
|
|
that this is equivalent to taking the cosine of the two vectors. So
|
|
cosine similarity in essence quantifies the angle between the two lines
|
|
in high-dimensional space. If the cosine similarity between two
|
|
subreddits is greater then their TF-IDF vectors are more correlated.
|
|
|
|
Cosine similarity with TF-IDF is popular (indeed it has been applied to
|
|
Reddit in research several times before) because it quantifies the
|
|
correlation between the most characteristic terms for two communities.
|
|
|
|
Compared to other approaches to similarity like those using word
|
|
embeddings or topic models it may struggle to handle polysemy, synonymy,
|
|
or correlations between different terms. Using phrase detection helps
|
|
with this a little bit. The advantages of this approach are simplicity
|
|
and scalability. I'm thinking about using [latent semantic
|
|
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
|
|
intermediate step to improve upon similarities based on raw TF-IDFs.
|
|
|
|
Even still, computing similarities between a large number of subreddits
|
|
is computationally expensive and requires $n(n-1)/2$ dot-product
|
|
evaluations. This can be sped up by passing
|
|
`similarity-threshold=X` where $X>0$ into `cosine_similarities.py`. I
|
|
used a cosine similarity function that's built into the spark matrix
|
|
library which supports the `DIMSUM` algorithm for approximating
|
|
matrix-matrix products. This algorithm is commonly used in industry
|
|
(i.e. at Twitter, Google) for large-scale similarity scoring.
|
|
|
|
## See also
|
|
|
|
The CDSC wiki page
|
|
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
|
|
is the landing page for this project on the wiki. The methods writeup
|
|
above used to live there; it now lives here so that doc and code stay
|
|
in sync.
|