Pushshift's files.pushshift.io archive is gone since Reddit cut off third-party API access in 2023, so the dumps/ pull and SHA-check scripts no longer work. The old/ directory of pre-refactor scripts was likewise superseded by current versions in similarities/. README rewritten to credit Nate as original developer, name current maintainers, document the directory layout, point at the CDSC wiki for the ArcticShift/torrent-based workflow, fix several stale script paths, and correct an incorrect tf-normalization formula (max, not sum). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
237 lines
11 KiB
Markdown
237 lines
11 KiB
Markdown
---
|
|
title: Utilities for Reddit Data Science
|
|
---
|
|
|
|
`cdsc_reddit` is a collection of tools for working with Reddit data on the
|
|
Hyak super computing system at the University of Washington. It is built
|
|
around [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
|
|
and [pyarrow](https://arrow.apache.org/docs/python/) so that the underlying
|
|
pipelines scale to the full Pushshift archive.
|
|
|
|
The project was originally developed by [Nate
|
|
TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29)
|
|
and is now maintained by a rotating set of researchers in the Community
|
|
Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and
|
|
others.
|
|
|
|
At a high level, the codebase covers four kinds of work:
|
|
|
|
- **Ingest.** Turning Pushshift comment and submission dumps into
|
|
partitioned Parquet datasets that are fast to query by subreddit or by
|
|
author.
|
|
- **Text features.** Building per-subreddit TF-IDF vectors over comment
|
|
text, including a phrase-detection pass based on pointwise mutual
|
|
information.
|
|
- **Similarity, clustering, and density.** Computing cosine similarities
|
|
between subreddits (by terms or by overlapping authors), clustering the
|
|
resulting similarity matrices, and summarizing how dense each
|
|
neighborhood is.
|
|
- **Time series and visualization.** Pulling activity time series per
|
|
subreddit and producing t-SNE plots of the clustering output.
|
|
|
|
Several pieces are still rough — the user interfaces for many of the
|
|
scripts assume familiarity with the project, and the TF-IDF pipeline does
|
|
not yet strip hyperlinks or bot comments, so subreddits with similar
|
|
automod messages can look misleadingly similar.
|
|
|
|
## Repository layout
|
|
|
|
| Directory | What's in it |
|
|
|---|---|
|
|
| `datasets/` | Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
|
|
| `ngrams/` | Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
|
|
| `similarities/` | TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
|
|
| `clustering/` | Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
|
|
| `density/` | Per-subreddit overlap density measures derived from the similarity matrices. |
|
|
| `timeseries/` | Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
|
|
| `visualization/` | Altair-based interactive plots of subreddit clusters. |
|
|
| `bots/` | Heuristics for flagging likely bot accounts. |
|
|
| `examples/` | Small standalone examples using pyarrow. |
|
|
|
|
## Sourcing the dumps
|
|
|
|
Pushshift was effectively wound down after Reddit cut off third-party API
|
|
access in 2023, and the original `files.pushshift.io` archive is gone.
|
|
Collection of new Reddit comment and submission data has since been
|
|
picked up by [ArcticShift](https://github.com/ArthurHeitmann/arctic_shift),
|
|
which publishes both the historical Pushshift archive and the new data
|
|
it continues to collect, with monthly updates redistributed as academic
|
|
torrents by Reddit users `u/Watchful1` and `u/RaiderBDev`. Fetching the
|
|
dumps from a torrent client is a manual prerequisite to running the rest
|
|
of this pipeline; step-by-step instructions for the current CDSC
|
|
workflow — including which torrents to pull and how to stage the `.zst`
|
|
files on Hyak — live on the CDSC wiki at
|
|
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit).
|
|
The earlier `dumps/` directory of `pull_pushshift_*.sh` and SHA-check
|
|
scripts has been removed since the URLs they pointed at no longer
|
|
resolve.
|
|
|
|
## Building Parquet datasets
|
|
|
|
The raw dumps are huge compressed JSON files with a lot of metadata that
|
|
we usually don't need. They aren't indexed, so it's expensive to pull data
|
|
for just a handful of subreddits, and they are awkward to read directly
|
|
into Spark. Extracting the useful fields and rewriting the data as
|
|
Parquet makes everything downstream cheaper. The conversion happens in
|
|
two steps:
|
|
|
|
1. Extracting JSON into temporary, unpartitioned Parquet files using
|
|
pyarrow (`comments_2_parquet_part1.py`,
|
|
`submissions_2_parquet_part1.py`).
|
|
2. Repartitioning and sorting the data using PySpark
|
|
(`comments_2_parquet_part2.py`, `submissions_2_parquet_part2.py`).
|
|
|
|
The final datasets live in `/gscratch/comdata/output/`:
|
|
|
|
- `reddit_comments_by_author.parquet` — comments partitioned and sorted by
|
|
author (lowercase).
|
|
- `reddit_comments_by_subreddit.parquet` — comments partitioned and sorted
|
|
by subreddit (lowercase).
|
|
- `reddit_submissions_by_author.parquet` — submissions partitioned and
|
|
sorted by author (lowercase).
|
|
- `reddit_submissions_by_subreddit.parquet` — submissions partitioned and
|
|
sorted by subreddit (lowercase).
|
|
|
|
Splitting the work this way lets us decompress and parse the dumps in the
|
|
Hyak backfill queue and then sort them in Spark. Partitioning makes it
|
|
possible to read data for specific subreddits or authors efficiently, and
|
|
sorting makes per-subreddit or per-user aggregations cheap. More
|
|
documentation on using these files lives on the [CDSC
|
|
wiki](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets).
|
|
|
|
## TF-IDF subreddit similarity
|
|
|
|
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a simple
|
|
information-retrieval technique we use to quantify the topic of a
|
|
subreddit. The goal is to build a vector for each subreddit that scores
|
|
every term (or phrase) according to how characteristic it is of the
|
|
lexicon used there. For example, the most characteristic terms in
|
|
`/r/christianity` in the current model are:
|
|
|
|
| Term | tf_idf |
|
|
|:------------:|:------:|
|
|
| christians | 0.581 |
|
|
| christianity | 0.569 |
|
|
| kjv | 0.568 |
|
|
| bible | 0.557 |
|
|
| scripture | 0.55 |
|
|
|
|
TF-IDF is the product of two pieces: *term frequency* (how often a term
|
|
appears in a subreddit) and *inverse document frequency* (how rare the
|
|
term is across other subreddits). There are many ways to construct and
|
|
combine these; the [Wikipedia
|
|
page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) catalogs the common
|
|
variants.
|
|
|
|
We normalize term frequency by the maximum raw term frequency for each
|
|
subreddit:
|
|
|
|
$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}$$
|
|
|
|
and use the log inverse document frequency:
|
|
|
|
$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
|
|
|
|
combined with a smoothing term:
|
|
|
|
$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
|
|
|
|
(Other normalization strategies are worth trying — see the note in
|
|
`similarities/TODO`.)
|
|
|
|
### Building TF-IDF vectors
|
|
|
|
The pipeline has four steps:
|
|
|
|
1. Extract terms with `ngrams/tf_comments.py`.
|
|
2. Detect common phrases with `ngrams/top_comment_phrases.py`.
|
|
3. Re-extract terms together with detected phrases via
|
|
`ngrams/tf_comments.py --mwe-pass=second`.
|
|
4. Compute IDF and TF-IDF scores in `similarities/tfidf.py`.
|
|
|
|
#### Running `tf_comments.py` on the backfill queue
|
|
|
|
The main reason for the four-step layout is that `tf_comments.py` is
|
|
trivially parallel — it reads every comment and rewrites each subreddit
|
|
as a bag of words — so it benefits from being farmed out to the Hyak
|
|
backfill queue. `ngrams/run_tf_jobs.sh` partially automates the dispatch.
|
|
|
|
#### Phrase detection using pointwise mutual information
|
|
|
|
TF-IDF over unigrams misses the fact that sequences of words often carry
|
|
distinct meaning (names, fixed expressions, in-jokes). Considering every
|
|
possible n-gram is prohibitive because the candidate set explodes with
|
|
`n`, so we use phrase detection to limit ourselves to informative
|
|
n-grams.
|
|
|
|
We use [pointwise mutual
|
|
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
|
|
(PMI), which is simple and works well in practice. The intuition is that
|
|
if two words co-occur much more often than their marginal frequencies
|
|
would predict, the pair is probably meaningful:
|
|
|
|
$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
|
|
|
|
When `tf_comments.py` is run with `--mwe-pass=first`, it writes a 10%
|
|
sample of 1- to 4-grams to a file. `top_comment_phrases.py` then
|
|
computes PMI over that sample and keeps phrases that occur at least
|
|
3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
|
|
A second pass of `tf_comments.py --mwe-pass=second` folds those phrases
|
|
back into the term-frequency data.
|
|
|
|
### Cosine similarity
|
|
|
|
Once the TF-IDF vectors are built, computing a similarity score between
|
|
two subreddits is straightforward with cosine similarity:
|
|
|
|
$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
|
|
|
|
Each subreddit is a vector in a high-dimensional term space. The dot
|
|
product gives a weighted sum of shared terms, and dividing by the
|
|
vector magnitudes removes the effect of differing vocabulary size — what
|
|
remains is the cosine of the angle between the two vectors. Cosine
|
|
similarity with TF-IDF is popular (and has been used on Reddit several
|
|
times in prior research) because it captures correlation between the
|
|
*most characteristic* terms of two communities.
|
|
|
|
Compared to approaches based on word embeddings or topic models, this
|
|
method can struggle with polysemy, synonymy, and correlations between
|
|
related terms. Phrase detection helps a little. The trade-off is
|
|
simplicity and scalability. Adding [latent semantic
|
|
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
|
|
intermediate step is on the wish-list for improving on raw TF-IDF
|
|
similarities.
|
|
|
|
Even with these simplifications, similarity between a large number of
|
|
subreddits is expensive — naively $n^2$ dot-products. Passing
|
|
`--similarity-threshold=X` (with `X>0`) to the similarity scripts lets
|
|
Spark's built-in matrix library use the DIMSUM approximation, which is
|
|
the same algorithm Twitter and Google have used for large-scale
|
|
similarity scoring.
|
|
|
|
## Clustering, density, and time series
|
|
|
|
The similarity matrices feed three follow-on analyses:
|
|
|
|
- `clustering/clustering.py` clusters a similarity matrix using
|
|
affinity propagation; `clustering/selection.py` and
|
|
`clustering/fit_tsne.py` are supporting scripts for hyperparameter
|
|
selection and 2-D embeddings.
|
|
- `density/overlap_density.py` computes a per-subreddit overlap density
|
|
measure from the similarity matrix.
|
|
- `timeseries/cluster_timeseries.py` and `timeseries/choose_clusters.py`
|
|
pull subreddit-level activity time series and join them against
|
|
clustering output.
|
|
|
|
`visualization/tsne_vis.py` renders interactive Altair plots of the
|
|
clustering output — see the prebuilt HTML files in `visualization/` for
|
|
examples.
|
|
|
|
## Bot detection
|
|
|
|
`bots/good_bad_bot.py` computes user-level features (compression rate
|
|
of comment text, frequency of self-identification as a bot, etc.) that
|
|
are useful for filtering bot accounts out of downstream analyses. This
|
|
is preliminary work; nothing in the pipeline currently consumes it
|
|
automatically.
|