cdsc_reddit/README.md

---
title: Utilities for Reddit Data Science
---

`cdsc_reddit` is a collection of tools for working with Reddit data on the
Hyak super computing system at the University of Washington. It is built
around [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
and [pyarrow](https://arrow.apache.org/docs/python/) so that the underlying
pipelines scale to the full Pushshift archive.

The project was originally developed by [Nate
TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29)
and is now maintained by a rotating set of researchers in the Community
Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and
others.

At a high level, the codebase covers four kinds of work:

- **Ingest.** Turning Pushshift comment and submission dumps into
  partitioned Parquet datasets that are fast to query by subreddit or by
  author.
- **Text features.** Building per-subreddit TF-IDF vectors over comment
  text, including a phrase-detection pass based on pointwise mutual
  information.
- **Similarity, clustering, and density.** Computing cosine similarities
  between subreddits (by terms or by overlapping authors), clustering the
  resulting similarity matrices, and summarizing how dense each
  neighborhood is.
- **Time series and visualization.** Pulling activity time series per
  subreddit and producing t-SNE plots of the clustering output.

Several pieces are still rough — the user interfaces for many of the
scripts assume familiarity with the project, and the TF-IDF pipeline does
not yet strip hyperlinks or bot comments, so subreddits with similar
automod messages can look misleadingly similar.

## Repository layout

| Directory | What's in it |
|---|---|
| `datasets/` | Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
| `ngrams/` | Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
| `similarities/` | TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
| `clustering/` | Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
| `density/` | Per-subreddit overlap density measures derived from the similarity matrices. |
| `timeseries/` | Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
| `visualization/` | Altair-based interactive plots of subreddit clusters. |
| `bots/` | Heuristics for flagging likely bot accounts. |
| `examples/` | Small standalone examples using pyarrow. |

## Sourcing the dumps

Pushshift was effectively wound down after Reddit cut off third-party API
access in 2023, and the original `files.pushshift.io` archive is gone.
Collection of new Reddit comment and submission data has since been
picked up by [ArcticShift](https://github.com/ArthurHeitmann/arctic_shift),
which publishes both the historical Pushshift archive and the new data
it continues to collect, with monthly updates redistributed as academic
torrents by Reddit users `u/Watchful1` and `u/RaiderBDev`. Fetching the
dumps from a torrent client is a manual prerequisite to running the rest
of this pipeline; step-by-step instructions for the current CDSC
workflow — including which torrents to pull and how to stage the `.zst`
files on Hyak — live on the CDSC wiki at
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit).
The earlier `dumps/` directory of `pull_pushshift_*.sh` and SHA-check
scripts has been removed since the URLs they pointed at no longer
resolve.

## Building Parquet datasets

The raw dumps are huge compressed JSON files with a lot of metadata that
we usually don't need. They aren't indexed, so it's expensive to pull data
for just a handful of subreddits, and they are awkward to read directly
into Spark. Extracting the useful fields and rewriting the data as
Parquet makes everything downstream cheaper. The conversion happens in
two steps:

1. Extracting JSON into temporary, unpartitioned Parquet files using
   pyarrow (`comments_2_parquet_part1.py`,
   `submissions_2_parquet_part1.py`).
2. Repartitioning and sorting the data using PySpark
   (`comments_2_parquet_part2.py`, `submissions_2_parquet_part2.py`).

The final datasets live in `/gscratch/comdata/output/`:

- `reddit_comments_by_author.parquet` — comments partitioned and sorted by
  author (lowercase).
- `reddit_comments_by_subreddit.parquet` — comments partitioned and sorted
  by subreddit (lowercase).
- `reddit_submissions_by_author.parquet` — submissions partitioned and
  sorted by author (lowercase).
- `reddit_submissions_by_subreddit.parquet` — submissions partitioned and
  sorted by subreddit (lowercase).

Splitting the work this way lets us decompress and parse the dumps in the
Hyak backfill queue and then sort them in Spark. Partitioning makes it
possible to read data for specific subreddits or authors efficiently, and
sorting makes per-subreddit or per-user aggregations cheap. More
documentation on using these files lives on the [CDSC
wiki](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets).

## TF-IDF subreddit similarity

[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a simple
information-retrieval technique we use to quantify the topic of a
subreddit. The goal is to build a vector for each subreddit that scores
every term (or phrase) according to how characteristic it is of the
lexicon used there. For example, the most characteristic terms in
`/r/christianity` in the current model are:

| Term         | tf_idf |
|:------------:|:------:|
| christians   | 0.581  |
| christianity | 0.569  |
| kjv          | 0.568  |
| bible        | 0.557  |
| scripture    | 0.55   |

TF-IDF is the product of two pieces: *term frequency* (how often a term
appears in a subreddit) and *inverse document frequency* (how rare the
term is across other subreddits). There are many ways to construct and
combine these; the [Wikipedia
page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) catalogs the common
variants.

We normalize term frequency by the maximum raw term frequency for each
subreddit:

$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}$$

and use the log inverse document frequency:

$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$

combined with a smoothing term:

$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$

(Other normalization strategies are worth trying — see the note in
`similarities/TODO`.)

### Building TF-IDF vectors

The pipeline has four steps:

1. Extract terms with `ngrams/tf_comments.py`.
2. Detect common phrases with `ngrams/top_comment_phrases.py`.
3. Re-extract terms together with detected phrases via
   `ngrams/tf_comments.py --mwe-pass=second`.
4. Compute IDF and TF-IDF scores in `similarities/tfidf.py`.

#### Running `tf_comments.py` on the backfill queue

The main reason for the four-step layout is that `tf_comments.py` is
trivially parallel — it reads every comment and rewrites each subreddit
as a bag of words — so it benefits from being farmed out to the Hyak
backfill queue. `ngrams/run_tf_jobs.sh` partially automates the dispatch.

#### Phrase detection using pointwise mutual information

TF-IDF over unigrams misses the fact that sequences of words often carry
distinct meaning (names, fixed expressions, in-jokes). Considering every
possible n-gram is prohibitive because the candidate set explodes with
`n`, so we use phrase detection to limit ourselves to informative
n-grams.

We use [pointwise mutual
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
(PMI), which is simple and works well in practice. The intuition is that
if two words co-occur much more often than their marginal frequencies
would predict, the pair is probably meaningful:

$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$

When `tf_comments.py` is run with `--mwe-pass=first`, it writes a 10%
sample of 1- to 4-grams to a file. `top_comment_phrases.py` then
computes PMI over that sample and keeps phrases that occur at least
3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
A second pass of `tf_comments.py --mwe-pass=second` folds those phrases
back into the term-frequency data.

### Cosine similarity

Once the TF-IDF vectors are built, computing a similarity score between
two subreddits is straightforward with cosine similarity:

$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$

Each subreddit is a vector in a high-dimensional term space. The dot
product gives a weighted sum of shared terms, and dividing by the
vector magnitudes removes the effect of differing vocabulary size — what
remains is the cosine of the angle between the two vectors. Cosine
similarity with TF-IDF is popular (and has been used on Reddit several
times in prior research) because it captures correlation between the
*most characteristic* terms of two communities.

Compared to approaches based on word embeddings or topic models, this
method can struggle with polysemy, synonymy, and correlations between
related terms. Phrase detection helps a little. The trade-off is
simplicity and scalability. Adding [latent semantic
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
intermediate step is on the wish-list for improving on raw TF-IDF
similarities.

Even with these simplifications, similarity between a large number of
subreddits is expensive — naively $n^2$ dot-products. Passing
`--similarity-threshold=X` (with `X>0`) to the similarity scripts lets
Spark's built-in matrix library use the DIMSUM approximation, which is
the same algorithm Twitter and Google have used for large-scale
similarity scoring.

## Clustering, density, and time series

The similarity matrices feed three follow-on analyses:

- `clustering/clustering.py` clusters a similarity matrix using
  affinity propagation; `clustering/selection.py` and
  `clustering/fit_tsne.py` are supporting scripts for hyperparameter
  selection and 2-D embeddings.
- `density/overlap_density.py` computes a per-subreddit overlap density
  measure from the similarity matrix.
- `timeseries/cluster_timeseries.py` and `timeseries/choose_clusters.py`
  pull subreddit-level activity time series and join them against
  clustering output.

`visualization/tsne_vis.py` renders interactive Altair plots of the
clustering output — see the prebuilt HTML files in `visualization/` for
examples.

## Bot detection

`bots/good_bad_bot.py` computes user-level features (compression rate
of comment text, frequency of self-identification as a bot, etc.) that
are useful for filtering bot accounts out of downstream analyses. This
is preliminary work; nothing in the pipeline currently consumes it
automatically.