rewrite README, remove dead pushshift scripts and old/

Pushshift's files.pushshift.io archive is gone since Reddit cut off third-party API access in 2023, so the dumps/ pull and SHA-check scripts no longer work. The old/ directory of pre-refactor scripts was likewise superseded by current versions in similarities/. README rewritten to credit Nate as original developer, name current maintainers, document the directory layout, point at the CDSC wiki for the ArcticShift/torrent-based workflow, fix several stale script paths, and correct an incorrect tf-normalization formula (max, not sum). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 15:53:33 -07:00
parent 53f5b8c03c
commit d201930951
13 changed files with 183 additions and 455 deletions
--- a/README.md
+++ b/README.md
@@ -2,51 +2,111 @@
 title: Utilities for Reddit Data Science
 ---

+`cdsc_reddit` is a collection of tools for working with Reddit data on the
+Hyak super computing system at the University of Washington. It is built
+around [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
+and [pyarrow](https://arrow.apache.org/docs/python/) so that the underlying
+pipelines scale to the full Pushshift archive.

-The reddit_cdsc project contains tools for working with Reddit data.  The project is designed for the hyak super computing system at The University of Washington.  It consists of a set of python and bash scripts and uses the [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html "Pyspark documentation") and [pyarrow](https://arrow.apache.org/docs/python/ "documentation of python arrow bindings") to process large datasets.  As of November 1st 2020, the project is under active development by [Nate TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Washington.29 "Nate's profile on the Community Data Science Collective Wiki") and provides scripts for:
+The project was originally developed by [Nate
+TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29)
+and is now maintained by a rotating set of researchers in the Community
+Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and
+others.

- Pulling and updating dumps from [Pushshift](https://pushshift.io "Pushshift.io") in `pull_pushshift_comments.sh` and `pull_pushshift_submissions.sh`.
- Uncompressing and parsing the dumps into [Parquet](https://parquet.apache.org/ "apahce parquet website") [datasets](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
- Running text analysis based on [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") including 
-  - Extracting terms from Reddit comments in `tf_comments.py`
-  - Detecting common phrases based on [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information")
-  - Building TF-IDF vectors for each subreddit `idf_comments.py` and (more experimentally) at the subreddit-week level `idf_comments_weekly.py` 
-  - Computing cosine similarities between subreddits based on TF-IDF `term_cosine_similarity.py`. 
+At a high level, the codebase covers four kinds of work:

-Right now, two steps are still in earlier stages of progress:
+- **Ingest.** Turning Pushshift comment and submission dumps into
+  partitioned Parquet datasets that are fast to query by subreddit or by
+  author.
+- **Text features.** Building per-subreddit TF-IDF vectors over comment
+  text, including a phrase-detection pass based on pointwise mutual
+  information.
+- **Similarity, clustering, and density.** Computing cosine similarities
+  between subreddits (by terms or by overlapping authors), clustering the
+  resulting similarity matrices, and summarizing how dense each
+  neighborhood is.
+- **Time series and visualization.** Pulling activity time series per
+  subreddit and producing t-SNE plots of the clustering output.

- Approach comparable to tf-idf for similarity between subreddits in terms of comment authors. 
- Clustering subreddits based on cosine-similarities using [power iteration clustering (PIC)](http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf "Paper on power iteration clustering")
+Several pieces are still rough — the user interfaces for many of the
+scripts assume familiarity with the project, and the TF-IDF pipeline does
+not yet strip hyperlinks or bot comments, so subreddits with similar
+automod messages can look misleadingly similar.

-The TF-IDF for comments still has some kinks to iron out to remove hyper links and bot comments. Right now subreddits that have similar automoderation messages appear very similar.
+## Repository layout

-The user interfaces for most of the scripts are pretty crappy and need to be refined for re-use by others. 
+| Directory | What's in it |
+|---|---|
+| `datasets/` | Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
+| `ngrams/` | Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
+| `similarities/` | TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
+| `clustering/` | Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
+| `density/` | Per-subreddit overlap density measures derived from the similarity matrices. |
+| `timeseries/` | Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
+| `visualization/` | Altair-based interactive plots of subreddit clusters. |
+| `bots/` | Heuristics for flagging likely bot accounts. |
+| `examples/` | Small standalone examples using pyarrow. |

-## Pulling data from [Pushshift](https://pushshift.io "Pushshift.io") ##
+## Sourcing the dumps

- `pull_pushshift_comments.sh` uses wget to download comment dumps to  `/gscratch/comdata/raw_data/reddit_dumps/comments`. It doesn't download files that already exists and runs `check_comments_shas.sh` to verify the files downloaded correctly. 
+Pushshift was effectively wound down after Reddit cut off third-party API
+access in 2023, and the original `files.pushshift.io` archive is gone.
+Collection of new Reddit comment and submission data has since been
+picked up by [ArcticShift](https://github.com/ArthurHeitmann/arctic_shift),
+which publishes both the historical Pushshift archive and the new data
+it continues to collect, with monthly updates redistributed as academic
+torrents by Reddit users `u/Watchful1` and `u/RaiderBDev`. Fetching the
+dumps from a torrent client is a manual prerequisite to running the rest
+of this pipeline; step-by-step instructions for the current CDSC
+workflow — including which torrents to pull and how to stage the `.zst`
+files on Hyak — live on the CDSC wiki at
+[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit).
+The earlier `dumps/` directory of `pull_pushshift_*.sh` and SHA-check
+scripts has been removed since the URLs they pointed at no longer
+resolve.

- `pull_pushshift_submissions.sh` does the same for submissions and puts them in `/gscratch/comdata/raw_data/reddit_dumps/comments`.
+## Building Parquet datasets

-## Building Parquet Datasets ##
+The raw dumps are huge compressed JSON files with a lot of metadata that
+we usually don't need. They aren't indexed, so it's expensive to pull data
+for just a handful of subreddits, and they are awkward to read directly
+into Spark. Extracting the useful fields and rewriting the data as
+Parquet makes everything downstream cheaper. The conversion happens in
+two steps:

-Pushshift dumps are huge compressed json files with a lot of metadata that we may not need. It isn't indexed so it's expensive to pull data from just a handful of subreddits. It also turns out that it's a pain to read these compressed files straight into spark. Extracting useful variables from the dumps and building parquet datasets will make them easier to work with.  This happens in two steps:
+1. Extracting JSON into temporary, unpartitioned Parquet files using
+   pyarrow (`comments_2_parquet_part1.py`,
+   `submissions_2_parquet_part1.py`).
+2. Repartitioning and sorting the data using PySpark
+   (`comments_2_parquet_part2.py`, `submissions_2_parquet_part2.py`).

-1. Extracting json into (temporary, unpartitioned) parquet files using pyarrow.
-2. Repartitioning and sorting the data using pyspark.
+The final datasets live in `/gscratch/comdata/output/`:

-The final datasets are in `/gscratch/comdata/output.`
+- `reddit_comments_by_author.parquet` — comments partitioned and sorted by
+  author (lowercase).
+- `reddit_comments_by_subreddit.parquet` — comments partitioned and sorted
+  by subreddit (lowercase).
+- `reddit_submissions_by_author.parquet` — submissions partitioned and
+  sorted by author (lowercase).
+- `reddit_submissions_by_subreddit.parquet` — submissions partitioned and
+  sorted by subreddit (lowercase).

- `reddit_comments_by_author.parquet` has comments partitioned and sorted by username (lowercase).
- `reddit_comments_by_subreddit.parquet` has comments partitioned and sorted by subreddit name (lowercase).
- `reddit_submissions_by_author.parquet` has submissions partitioned and sorted by username (lowercase).
- `reddit_submissions_by_subreddit.parquet` has submissions partitioned and sorted by subreddit name (lowercase).
+Splitting the work this way lets us decompress and parse the dumps in the
+Hyak backfill queue and then sort them in Spark. Partitioning makes it
+possible to read data for specific subreddits or authors efficiently, and
+sorting makes per-subreddit or per-user aggregations cheap. More
+documentation on using these files lives on the [CDSC
+wiki](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets).

-Breaking this down into two steps is useful because it allows us to decompress and parse the dumps in the backfill queue and then sort them in spark. Partitioning the data makes it possible to efficiently read data for specific subreddits or authors.  Sorting it means that you can efficiently compute agreggations at the subreddit or user level. More documentation on using these files is available [here](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
+## TF-IDF subreddit similarity

-## TF-IDF Subreddit Similarity ##
-
-[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") is common and simple information retrieval technique that we can use to quantify the topic of a subreddit.  The goal of TF-IDF is to build a vector for each subreddit that scores every term (or phrase) according to how characteristic it is of the overall lexicon used in that subreddit. For example, the most characteristic terms in the subreddit /r/christianity in the current version of the TF-IDF model are:
+[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a simple
+information-retrieval technique we use to quantify the topic of a
+subreddit. The goal is to build a vector for each subreddit that scores
+every term (or phrase) according to how characteristic it is of the
+lexicon used there. For example, the most characteristic terms in
+`/r/christianity` in the current model are:

 | Term         | tf_idf |
 |:------------:|:------:|
@@ -56,61 +116,121 @@ Breaking this down into two steps is useful because it allows us to decompress a
 | bible        | 0.557  |
 | scripture    | 0.55   |

-TF-IDF stands for "term frequency - inverse document frequency" because it is the product of two terms "term frequency" and "inverse document frequency." Term frequency quantifies the amount that a term appears in a subreddit (document). Inverse document frequency quantifies how much that term appears in other subreddits (documents). As you can see on the Wikipedia page, there are many possible ways of constructing and combining these terms. 
+TF-IDF is the product of two pieces: *term frequency* (how often a term
+appears in a subreddit) and *inverse document frequency* (how rare the
+term is across other subreddits). There are many ways to construct and
+combine these; the [Wikipedia
+page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) catalogs the common
+variants.

-$x + y = z_{1,d}$ 
+We normalize term frequency by the maximum raw term frequency for each
+subreddit:

-I chose to normalize term frequency by the maximum (raw) term frequency for each subreddit:
-$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\sum_{t^{'} \in d}{f_{t^{'},d}}}$ 
+$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}$$

-I use the log inverse document frequency:
-$\mathrm{idf}_{t} = log\frac{N}{| {d \in D : t \in d} |}$
+and use the log inverse document frequency:

-I then combine them using some smoothing to get:
+$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$

-$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$ 
+combined with a smoothing term:

-### Building TF-IDF vectors ###
+$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$

-The process for building TF-IDF vectors has four steps:
+(Other normalization strategies are worth trying — see the note in
+`similarities/TODO`.)

-1. Extracting terms using `tf_comments.py`
-2. Detecting common phrases using `top_comment_phrases.py`
-3. Extracting terms and common phrases using `tf_comments.py --mwe-pass='second'`
-4. Building idf and tf-idf scores in `idf_comments.py`
+### Building TF-IDF vectors

-#### Running `tf_comments.py` on the backfill queue ####
+The pipeline has four steps:

-The main reason that I did it in 4 steps instead of one is to take advantage of the backfill queue for running `tf_comments.py`.  This step requires reading all of the text in every comment and converting it to a bag of words at the subreddit-level.  This is a lot of computation that is easily parallelizable. The script `run_tf_jobs.sh` partially automates running steps 1 (or 3) on the backfill queue. 
+1. Extract terms with `ngrams/tf_comments.py`.
+2. Detect common phrases with `ngrams/top_comment_phrases.py`.
+3. Re-extract terms together with detected phrases via
+   `ngrams/tf_comments.py --mwe-pass=second`.
+4. Compute IDF and TF-IDF scores in `similarities/tfidf.py`.

-#### Phrase detection using Pointwise Mutual Information ####
+#### Running `tf_comments.py` on the backfill queue

-TF-IDF is simple, but only uses single words (unigrams).  Sequences of multiple words can be important to account for how words have different meanings in different contexts or how sequences of words refer to distinct things like names. Dealing with context or longer sequences of words is a common challenge in natural language processing since the number of possible n-grams grows like crazy as n gets bigger. Phrase detection helps this  problem by limiting the set of n-grams to those most informative. 
+The main reason for the four-step layout is that `tf_comments.py` is
+trivially parallel — it reads every comment and rewrites each subreddit
+as a bag of words — so it benefits from being farmed out to the Hyak
+backfill queue. `ngrams/run_tf_jobs.sh` partially automates the dispatch.

-But how do we detect phrases?  I implemented [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information"), which is a pretty simple way, but seems to work pretty well. 
+#### Phrase detection using pointwise mutual information

-PMI is an quantity derived from information theory. The intuition is that if two words occur together quite frequently compared to how often they appear separately then the cooccurrance is likely to be informative. 
+TF-IDF over unigrams misses the fact that sequences of words often carry
+distinct meaning (names, fixed expressions, in-jokes). Considering every
+possible n-gram is prohibitive because the candidate set explodes with
+`n`, so we use phrase detection to limit ourselves to informative
+n-grams.

-$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.$
+We use [pointwise mutual
+information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
+(PMI), which is simple and works well in practice. The intuition is that
+if two words co-occur much more often than their marginal frequencies
+would predict, the pair is probably meaningful:

-In `tf_comments.py` if `--mwe-pass=first` then a 10\% sample of 1-4-grams (sequences of terms up to length 4) will be written to a file to be consumed by `top_comment_phrases.py`.  `top_comment_phrases.py` computes the PMI for these possible phrases and writes those that occur at least 3500 times in the sample of n-grams and have a PWMI of at least 3 (about 65000 expressions). 
+$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$

-`tf_comments.py --mwe-pass=second` then uses the detected phrases and adds them to the term frequency data. 
+When `tf_comments.py` is run with `--mwe-pass=first`, it writes a 10%
+sample of 1- to 4-grams to a file. `top_comment_phrases.py` then
+computes PMI over that sample and keeps phrases that occur at least
+3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
+A second pass of `tf_comments.py --mwe-pass=second` folds those phrases
+back into the term-frequency data.

-### Cosine Similarity ###
+### Cosine similarity

-Once the tf-idf vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity. 
+Once the TF-IDF vectors are built, computing a similarity score between
+two subreddits is straightforward with cosine similarity:

-$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$
+$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$

-Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). 
-In linear algebra, the dot product ($\cdot$) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).  
-The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. 
-It turns out that this is equivalent to taking the cosine of the two vectors.  So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space.  If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated. 
+Each subreddit is a vector in a high-dimensional term space. The dot
+product gives a weighted sum of shared terms, and dividing by the
+vector magnitudes removes the effect of differing vocabulary size — what
+remains is the cosine of the angle between the two vectors. Cosine
+similarity with TF-IDF is popular (and has been used on Reddit several
+times in prior research) because it captures correlation between the
+*most characteristic* terms of two communities.

-Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.
+Compared to approaches based on word embeddings or topic models, this
+method can struggle with polysemy, synonymy, and correlations between
+related terms. Phrase detection helps a little. The trade-off is
+simplicity and scalability. Adding [latent semantic
+analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
+intermediate step is on the wish-list for improving on raw TF-IDF
+similarities.

-Compared to other approach to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms.  Using phrase detection helps with this a little bit.  The advantages of this approach are simplicity and scalability.  I'm thinking about using [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis "Wikipedia article on Latent semantic analysis") as an intermediate step to improve upon similarities based on raw tf-idfs. 
+Even with these simplifications, similarity between a large number of
+subreddits is expensive — naively $n^2$ dot-products. Passing
+`--similarity-threshold=X` (with `X>0`) to the similarity scripts lets
+Spark's built-in matrix library use the DIMSUM approximation, which is
+the same algorithm Twitter and Google have used for large-scale
+similarity scoring.

-Even still, computing similarities between a large number of subreddits is computationally expensive and requires $n^2$ dot-product evaluations. 
-This can be sped up by passing `similarity-threshold=X` where $X>0$ into `term_comment_similarity.py`.  I used a cosine similarity function that's built into the spark matrix library which supports the `DIMSUM` algorithm for approximating matrix-matrix products.  This algorithm is commonly used in industry (i.e. at Twitter, Google) for large-scale similarity scoring.
+## Clustering, density, and time series
+
+The similarity matrices feed three follow-on analyses:
+
+- `clustering/clustering.py` clusters a similarity matrix using
+  affinity propagation; `clustering/selection.py` and
+  `clustering/fit_tsne.py` are supporting scripts for hyperparameter
+  selection and 2-D embeddings.
+- `density/overlap_density.py` computes a per-subreddit overlap density
+  measure from the similarity matrix.
+- `timeseries/cluster_timeseries.py` and `timeseries/choose_clusters.py`
+  pull subreddit-level activity time series and join them against
+  clustering output.
+
+`visualization/tsne_vis.py` renders interactive Altair plots of the
+clustering output — see the prebuilt HTML files in `visualization/` for
+examples.
+
+## Bot detection
+
+`bots/good_bad_bot.py` computes user-level features (compression rate
+of comment text, frequency of self-identification as a bot, etc.) that
+are useful for filtering bot accounts out of downstream analyses. This
+is preliminary work; nothing in the pipeline currently consumes it
+automatically.