Compare commits
18 Commits
charliepat
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 2390d2d10c | |||
| 0ea57b2377 | |||
| 6c6e05c360 | |||
| 4854d4f537 | |||
| bf6ccbc84a | |||
| 18925dfe5b | |||
| 926e9bc364 | |||
| 526dc03732 | |||
| 6b18840604 | |||
| 2d1d760142 | |||
| 1851132a06 | |||
| 33150243cd | |||
| 8965a251b6 | |||
| d201930951 | |||
| 53f5b8c03c | |||
| 14ab979f59 | |||
|
|
c6122bb429 | ||
|
|
596e1ff339 |
246
README.md
246
README.md
@@ -2,51 +2,111 @@
|
|||||||
title: Utilities for Reddit Data Science
|
title: Utilities for Reddit Data Science
|
||||||
---
|
---
|
||||||
|
|
||||||
|
`cdsc_reddit` is a collection of tools for working with Reddit data on the
|
||||||
|
Hyak super computing system at the University of Washington. It is built
|
||||||
|
around [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
|
||||||
|
and [pyarrow](https://arrow.apache.org/docs/python/) so that the underlying
|
||||||
|
pipelines scale to the full Pushshift archive.
|
||||||
|
|
||||||
The reddit_cdsc project contains tools for working with Reddit data. The project is designed for the hyak super computing system at The University of Washington. It consists of a set of python and bash scripts and uses the [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html "Pyspark documentation") and [pyarrow](https://arrow.apache.org/docs/python/ "documentation of python arrow bindings") to process large datasets. As of November 1st 2020, the project is under active development by [Nate TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Washington.29 "Nate's profile on the Community Data Science Collective Wiki") and provides scripts for:
|
The project was originally developed by [Nate
|
||||||
|
TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29)
|
||||||
|
and is now maintained by a rotating set of researchers in the Community
|
||||||
|
Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and
|
||||||
|
others.
|
||||||
|
|
||||||
- Pulling and updating dumps from [Pushshift](https://pushshift.io "Pushshift.io") in `pull_pushshift_comments.sh` and `pull_pushshift_submissions.sh`.
|
At a high level, the codebase covers four kinds of work:
|
||||||
- Uncompressing and parsing the dumps into [Parquet](https://parquet.apache.org/ "apahce parquet website") [datasets](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
|
|
||||||
- Running text analysis based on [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") including
|
|
||||||
- Extracting terms from Reddit comments in `tf_comments.py`
|
|
||||||
- Detecting common phrases based on [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information")
|
|
||||||
- Building TF-IDF vectors for each subreddit `idf_comments.py` and (more experimentally) at the subreddit-week level `idf_comments_weekly.py`
|
|
||||||
- Computing cosine similarities between subreddits based on TF-IDF `term_cosine_similarity.py`.
|
|
||||||
|
|
||||||
Right now, two steps are still in earlier stages of progress:
|
- **Ingest.** Turning Pushshift comment and submission dumps into
|
||||||
|
partitioned Parquet datasets that are fast to query by subreddit or by
|
||||||
|
author.
|
||||||
|
- **Text features.** Building per-subreddit TF-IDF vectors over comment
|
||||||
|
text, including a phrase-detection pass based on pointwise mutual
|
||||||
|
information.
|
||||||
|
- **Similarity, clustering, and density.** Computing cosine similarities
|
||||||
|
between subreddits (by terms or by overlapping authors), clustering the
|
||||||
|
resulting similarity matrices, and summarizing how dense each
|
||||||
|
neighborhood is.
|
||||||
|
- **Time series and visualization.** Pulling activity time series per
|
||||||
|
subreddit and producing t-SNE plots of the clustering output.
|
||||||
|
|
||||||
- Approach comparable to tf-idf for similarity between subreddits in terms of comment authors.
|
Several pieces are still rough — the user interfaces for many of the
|
||||||
- Clustering subreddits based on cosine-similarities using [power iteration clustering (PIC)](http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf "Paper on power iteration clustering")
|
scripts assume familiarity with the project, and the TF-IDF pipeline does
|
||||||
|
not yet strip hyperlinks or bot comments, so subreddits with similar
|
||||||
|
automod messages can look misleadingly similar.
|
||||||
|
|
||||||
The TF-IDF for comments still has some kinks to iron out to remove hyper links and bot comments. Right now subreddits that have similar automoderation messages appear very similar.
|
## Repository layout
|
||||||
|
|
||||||
The user interfaces for most of the scripts are pretty crappy and need to be refined for re-use by others.
|
| Directory | What's in it |
|
||||||
|
|---|---|
|
||||||
|
| `datasets/` | Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
|
||||||
|
| `ngrams/` | Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
|
||||||
|
| `similarities/` | TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
|
||||||
|
| `clustering/` | Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
|
||||||
|
| `density/` | Per-subreddit overlap density measures derived from the similarity matrices. |
|
||||||
|
| `timeseries/` | Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
|
||||||
|
| `visualization/` | Altair-based interactive plots of subreddit clusters. |
|
||||||
|
| `bots/` | Heuristics for flagging likely bot accounts. |
|
||||||
|
| `examples/` | Small standalone examples using pyarrow. |
|
||||||
|
|
||||||
## Pulling data from [Pushshift](https://pushshift.io "Pushshift.io") ##
|
## Sourcing the dumps
|
||||||
|
|
||||||
- `pull_pushshift_comments.sh` uses wget to download comment dumps to `/gscratch/comdata/raw_data/reddit_dumps/comments`. It doesn't download files that already exists and runs `check_comments_shas.sh` to verify the files downloaded correctly.
|
Pushshift was effectively wound down after Reddit cut off third-party API
|
||||||
|
access in 2023, and the original `files.pushshift.io` archive is gone.
|
||||||
|
Collection of new Reddit comment and submission data has since been
|
||||||
|
picked up by [ArcticShift](https://github.com/ArthurHeitmann/arctic_shift),
|
||||||
|
which publishes both the historical Pushshift archive and the new data
|
||||||
|
it continues to collect, with monthly updates redistributed as academic
|
||||||
|
torrents by Reddit users `u/Watchful1` and `u/RaiderBDev`. Fetching the
|
||||||
|
dumps from a torrent client is a manual prerequisite to running the rest
|
||||||
|
of this pipeline; step-by-step instructions for the current CDSC
|
||||||
|
workflow — including which torrents to pull and how to stage the `.zst`
|
||||||
|
files on Hyak — live on the CDSC wiki at
|
||||||
|
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit).
|
||||||
|
The earlier `dumps/` directory of `pull_pushshift_*.sh` and SHA-check
|
||||||
|
scripts has been removed since the URLs they pointed at no longer
|
||||||
|
resolve.
|
||||||
|
|
||||||
- `pull_pushshift_submissions.sh` does the same for submissions and puts them in `/gscratch/comdata/raw_data/reddit_dumps/comments`.
|
## Building Parquet datasets
|
||||||
|
|
||||||
## Building Parquet Datasets ##
|
The raw dumps are huge compressed JSON files with a lot of metadata that
|
||||||
|
we usually don't need. They aren't indexed, so it's expensive to pull data
|
||||||
|
for just a handful of subreddits, and they are awkward to read directly
|
||||||
|
into Spark. Extracting the useful fields and rewriting the data as
|
||||||
|
Parquet makes everything downstream cheaper. The conversion happens in
|
||||||
|
two steps:
|
||||||
|
|
||||||
Pushshift dumps are huge compressed json files with a lot of metadata that we may not need. It isn't indexed so it's expensive to pull data from just a handful of subreddits. It also turns out that it's a pain to read these compressed files straight into spark. Extracting useful variables from the dumps and building parquet datasets will make them easier to work with. This happens in two steps:
|
1. Extracting JSON into temporary, unpartitioned Parquet files using
|
||||||
|
pyarrow (`comments_2_parquet_part1.py`,
|
||||||
|
`submissions_2_parquet_part1.py`).
|
||||||
|
2. Repartitioning and sorting the data using PySpark
|
||||||
|
(`comments_2_parquet_part2.py`, `submissions_2_parquet_part2.py`).
|
||||||
|
|
||||||
1. Extracting json into (temporary, unpartitioned) parquet files using pyarrow.
|
The final datasets live in `/gscratch/comdata/output/`:
|
||||||
2. Repartitioning and sorting the data using pyspark.
|
|
||||||
|
|
||||||
The final datasets are in `/gscratch/comdata/output.`
|
- `reddit_comments_by_author.parquet` — comments partitioned and sorted by
|
||||||
|
author (lowercase).
|
||||||
|
- `reddit_comments_by_subreddit.parquet` — comments partitioned and sorted
|
||||||
|
by subreddit (lowercase).
|
||||||
|
- `reddit_submissions_by_author.parquet` — submissions partitioned and
|
||||||
|
sorted by author (lowercase).
|
||||||
|
- `reddit_submissions_by_subreddit.parquet` — submissions partitioned and
|
||||||
|
sorted by subreddit (lowercase).
|
||||||
|
|
||||||
- `reddit_comments_by_author.parquet` has comments partitioned and sorted by username (lowercase).
|
Splitting the work this way lets us decompress and parse the dumps in the
|
||||||
- `reddit_comments_by_subreddit.parquet` has comments partitioned and sorted by subreddit name (lowercase).
|
Hyak backfill queue and then sort them in Spark. Partitioning makes it
|
||||||
- `reddit_submissions_by_author.parquet` has submissions partitioned and sorted by username (lowercase).
|
possible to read data for specific subreddits or authors efficiently, and
|
||||||
- `reddit_submissions_by_subreddit.parquet` has submissions partitioned and sorted by subreddit name (lowercase).
|
sorting makes per-subreddit or per-user aggregations cheap. More
|
||||||
|
documentation on using these files lives on the [CDSC
|
||||||
|
wiki](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets).
|
||||||
|
|
||||||
Breaking this down into two steps is useful because it allows us to decompress and parse the dumps in the backfill queue and then sort them in spark. Partitioning the data makes it possible to efficiently read data for specific subreddits or authors. Sorting it means that you can efficiently compute agreggations at the subreddit or user level. More documentation on using these files is available [here](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
|
## TF-IDF subreddit similarity
|
||||||
|
|
||||||
## TF-IDF Subreddit Similarity ##
|
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a simple
|
||||||
|
information-retrieval technique we use to quantify the topic of a
|
||||||
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") is common and simple information retrieval technique that we can use to quantify the topic of a subreddit. The goal of TF-IDF is to build a vector for each subreddit that scores every term (or phrase) according to how characteristic it is of the overall lexicon used in that subreddit. For example, the most characteristic terms in the subreddit /r/christianity in the current version of the TF-IDF model are:
|
subreddit. The goal is to build a vector for each subreddit that scores
|
||||||
|
every term (or phrase) according to how characteristic it is of the
|
||||||
|
lexicon used there. For example, the most characteristic terms in
|
||||||
|
`/r/christianity` in the current model are:
|
||||||
|
|
||||||
| Term | tf_idf |
|
| Term | tf_idf |
|
||||||
|:------------:|:------:|
|
|:------------:|:------:|
|
||||||
@@ -56,61 +116,121 @@ Breaking this down into two steps is useful because it allows us to decompress a
|
|||||||
| bible | 0.557 |
|
| bible | 0.557 |
|
||||||
| scripture | 0.55 |
|
| scripture | 0.55 |
|
||||||
|
|
||||||
TF-IDF stands for "term frequency - inverse document frequency" because it is the product of two terms "term frequency" and "inverse document frequency." Term frequency quantifies the amount that a term appears in a subreddit (document). Inverse document frequency quantifies how much that term appears in other subreddits (documents). As you can see on the Wikipedia page, there are many possible ways of constructing and combining these terms.
|
TF-IDF is the product of two pieces: *term frequency* (how often a term
|
||||||
|
appears in a subreddit) and *inverse document frequency* (how rare the
|
||||||
|
term is across other subreddits). There are many ways to construct and
|
||||||
|
combine these; the [Wikipedia
|
||||||
|
page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) catalogs the common
|
||||||
|
variants.
|
||||||
|
|
||||||
$x + y = z_{1,d}$
|
We normalize term frequency by the maximum raw term frequency for each
|
||||||
|
subreddit:
|
||||||
|
|
||||||
I chose to normalize term frequency by the maximum (raw) term frequency for each subreddit:
|
$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}$$
|
||||||
$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\sum_{t^{'} \in d}{f_{t^{'},d}}}$
|
|
||||||
|
|
||||||
I use the log inverse document frequency:
|
and use the log inverse document frequency:
|
||||||
$\mathrm{idf}_{t} = log\frac{N}{| {d \in D : t \in d} |}$
|
|
||||||
|
|
||||||
I then combine them using some smoothing to get:
|
$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
|
||||||
|
|
||||||
$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$
|
combined with a smoothing term:
|
||||||
|
|
||||||
### Building TF-IDF vectors ###
|
$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
|
||||||
|
|
||||||
The process for building TF-IDF vectors has four steps:
|
(Other normalization strategies are worth trying — see the note in
|
||||||
|
`similarities/TODO`.)
|
||||||
|
|
||||||
1. Extracting terms using `tf_comments.py`
|
### Building TF-IDF vectors
|
||||||
2. Detecting common phrases using `top_comment_phrases.py`
|
|
||||||
3. Extracting terms and common phrases using `tf_comments.py --mwe-pass='second'`
|
|
||||||
4. Building idf and tf-idf scores in `idf_comments.py`
|
|
||||||
|
|
||||||
#### Running `tf_comments.py` on the backfill queue ####
|
The pipeline has four steps:
|
||||||
|
|
||||||
The main reason that I did it in 4 steps instead of one is to take advantage of the backfill queue for running `tf_comments.py`. This step requires reading all of the text in every comment and converting it to a bag of words at the subreddit-level. This is a lot of computation that is easily parallelizable. The script `run_tf_jobs.sh` partially automates running steps 1 (or 3) on the backfill queue.
|
1. Extract terms with `ngrams/tf_comments.py`.
|
||||||
|
2. Detect common phrases with `ngrams/top_comment_phrases.py`.
|
||||||
|
3. Re-extract terms together with detected phrases via
|
||||||
|
`ngrams/tf_comments.py --mwe-pass=second`.
|
||||||
|
4. Compute IDF and TF-IDF scores in `similarities/tfidf.py`.
|
||||||
|
|
||||||
#### Phrase detection using Pointwise Mutual Information ####
|
#### Running `tf_comments.py` on the backfill queue
|
||||||
|
|
||||||
TF-IDF is simple, but only uses single words (unigrams). Sequences of multiple words can be important to account for how words have different meanings in different contexts or how sequences of words refer to distinct things like names. Dealing with context or longer sequences of words is a common challenge in natural language processing since the number of possible n-grams grows like crazy as n gets bigger. Phrase detection helps this problem by limiting the set of n-grams to those most informative.
|
The main reason for the four-step layout is that `tf_comments.py` is
|
||||||
|
trivially parallel — it reads every comment and rewrites each subreddit
|
||||||
|
as a bag of words — so it benefits from being farmed out to the Hyak
|
||||||
|
backfill queue. `ngrams/run_tf_jobs.sh` partially automates the dispatch.
|
||||||
|
|
||||||
But how do we detect phrases? I implemented [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information"), which is a pretty simple way, but seems to work pretty well.
|
#### Phrase detection using pointwise mutual information
|
||||||
|
|
||||||
PMI is an quantity derived from information theory. The intuition is that if two words occur together quite frequently compared to how often they appear separately then the cooccurrance is likely to be informative.
|
TF-IDF over unigrams misses the fact that sequences of words often carry
|
||||||
|
distinct meaning (names, fixed expressions, in-jokes). Considering every
|
||||||
|
possible n-gram is prohibitive because the candidate set explodes with
|
||||||
|
`n`, so we use phrase detection to limit ourselves to informative
|
||||||
|
n-grams.
|
||||||
|
|
||||||
$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.$
|
We use [pointwise mutual
|
||||||
|
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
|
||||||
|
(PMI), which is simple and works well in practice. The intuition is that
|
||||||
|
if two words co-occur much more often than their marginal frequencies
|
||||||
|
would predict, the pair is probably meaningful:
|
||||||
|
|
||||||
In `tf_comments.py` if `--mwe-pass=first` then a 10\% sample of 1-4-grams (sequences of terms up to length 4) will be written to a file to be consumed by `top_comment_phrases.py`. `top_comment_phrases.py` computes the PMI for these possible phrases and writes those that occur at least 3500 times in the sample of n-grams and have a PWMI of at least 3 (about 65000 expressions).
|
$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
|
||||||
|
|
||||||
`tf_comments.py --mwe-pass=second` then uses the detected phrases and adds them to the term frequency data.
|
When `tf_comments.py` is run with `--mwe-pass=first`, it writes a 10%
|
||||||
|
sample of 1- to 4-grams to a file. `top_comment_phrases.py` then
|
||||||
|
computes PMI over that sample and keeps phrases that occur at least
|
||||||
|
3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
|
||||||
|
A second pass of `tf_comments.py --mwe-pass=second` folds those phrases
|
||||||
|
back into the term-frequency data.
|
||||||
|
|
||||||
### Cosine Similarity ###
|
### Cosine similarity
|
||||||
|
|
||||||
Once the tf-idf vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity.
|
Once the TF-IDF vectors are built, computing a similarity score between
|
||||||
|
two subreddits is straightforward with cosine similarity:
|
||||||
|
|
||||||
$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$
|
$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
|
||||||
|
|
||||||
Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors).
|
Each subreddit is a vector in a high-dimensional term space. The dot
|
||||||
In linear algebra, the dot product ($\cdot$) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).
|
product gives a weighted sum of shared terms, and dividing by the
|
||||||
The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors.
|
vector magnitudes removes the effect of differing vocabulary size — what
|
||||||
It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated.
|
remains is the cosine of the angle between the two vectors. Cosine
|
||||||
|
similarity with TF-IDF is popular (and has been used on Reddit several
|
||||||
|
times in prior research) because it captures correlation between the
|
||||||
|
*most characteristic* terms of two communities.
|
||||||
|
|
||||||
Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.
|
Compared to approaches based on word embeddings or topic models, this
|
||||||
|
method can struggle with polysemy, synonymy, and correlations between
|
||||||
|
related terms. Phrase detection helps a little. The trade-off is
|
||||||
|
simplicity and scalability. Adding [latent semantic
|
||||||
|
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
|
||||||
|
intermediate step is on the wish-list for improving on raw TF-IDF
|
||||||
|
similarities.
|
||||||
|
|
||||||
Compared to other approach to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms. Using phrase detection helps with this a little bit. The advantages of this approach are simplicity and scalability. I'm thinking about using [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis "Wikipedia article on Latent semantic analysis") as an intermediate step to improve upon similarities based on raw tf-idfs.
|
Even with these simplifications, similarity between a large number of
|
||||||
|
subreddits is expensive — naively $n^2$ dot-products. Passing
|
||||||
|
`--similarity-threshold=X` (with `X>0`) to the similarity scripts lets
|
||||||
|
Spark's built-in matrix library use the DIMSUM approximation, which is
|
||||||
|
the same algorithm Twitter and Google have used for large-scale
|
||||||
|
similarity scoring.
|
||||||
|
|
||||||
Even still, computing similarities between a large number of subreddits is computationally expensive and requires $n^2$ dot-product evaluations.
|
## Clustering, density, and time series
|
||||||
This can be sped up by passing `similarity-threshold=X` where $X>0$ into `term_comment_similarity.py`. I used a cosine similarity function that's built into the spark matrix library which supports the `DIMSUM` algorithm for approximating matrix-matrix products. This algorithm is commonly used in industry (i.e. at Twitter, Google) for large-scale similarity scoring.
|
|
||||||
|
The similarity matrices feed three follow-on analyses:
|
||||||
|
|
||||||
|
- `clustering/clustering.py` clusters a similarity matrix using
|
||||||
|
affinity propagation; `clustering/selection.py` and
|
||||||
|
`clustering/fit_tsne.py` are supporting scripts for hyperparameter
|
||||||
|
selection and 2-D embeddings.
|
||||||
|
- `density/overlap_density.py` computes a per-subreddit overlap density
|
||||||
|
measure from the similarity matrix.
|
||||||
|
- `timeseries/cluster_timeseries.py` and `timeseries/choose_clusters.py`
|
||||||
|
pull subreddit-level activity time series and join them against
|
||||||
|
clustering output.
|
||||||
|
|
||||||
|
`visualization/tsne_vis.py` renders interactive Altair plots of the
|
||||||
|
clustering output — see the prebuilt HTML files in `visualization/` for
|
||||||
|
examples.
|
||||||
|
|
||||||
|
## Bot detection
|
||||||
|
|
||||||
|
`bots/good_bad_bot.py` computes user-level features (compression rate
|
||||||
|
of comment text, frequency of self-identification as a bot, etc.) that
|
||||||
|
are useful for filtering bot accounts out of downstream analyses. This
|
||||||
|
is preliminary work; nothing in the pipeline currently consumes it
|
||||||
|
automatically.
|
||||||
|
|||||||
@@ -2,41 +2,20 @@
|
|||||||
srun_singularity=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity.sh
|
srun_singularity=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity.sh
|
||||||
similarity_data=/gscratch/comdata/output/reddit_similarity
|
similarity_data=/gscratch/comdata/output/reddit_similarity
|
||||||
clustering_data=/gscratch/comdata/output/reddit_clustering
|
clustering_data=/gscratch/comdata/output/reddit_clustering
|
||||||
kmeans_selection_grid="--max_iter=3000 --n_init=[10] --n_clusters=[100,500,1000,1500,2000,2500,3000,2350,3500,3570,4000]"
|
selection_grid="--max_iter=3000 --convergence_iter=15,30,100 --damping=0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99, --preference_quantile=0.1,0.3,0.5,0.7,0.9"
|
||||||
#selection_grid="--max_iter=3000 --convergence_iter=[15] --preference_quantile=[0.5] --damping=[0.99]"
|
#selection_grid="--max_iter=3000 --convergence_iter=[15] --preference_quantile=[0.5] --damping=[0.99]"
|
||||||
all:$(clustering_data)/subreddit_comment_authors_10k/kmeans/selection_data.csv $(clustering_data)/subreddit_comment_authors-tf_10k/kmeans/selection_data.csv $(clustering_data)/subreddit_comment_terms_10k/kmeans/selection_data.csv $(clustering_data)/subreddit_comment_terms_10k/affinity/selection_data.csv $(clustering_data)/subreddit_comment_authors_10k/affinity/selection_data.csv $(clustering_data)/subreddit_comment_authors-tf_10k/affinity/selection_data.csv
|
all:$(clustering_data)/subreddit_comment_authors_10k/selection_data.csv $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv
|
||||||
# $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS $(clustering_data)/subreddit_authors-tf_similarities_30k.feather/SUCCESS
|
# $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS $(clustering_data)/subreddit_authors-tf_similarities_30k.feather/SUCCESS
|
||||||
# $(clustering_data)/subreddit_comment_terms_30k.feather/SUCCESS
|
# $(clustering_data)/subreddit_comment_terms_30k.feather/SUCCESS
|
||||||
|
|
||||||
$(clustering_data)/subreddit_comment_authors_10k/kmeans/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_authors_10k.feather clustering.py
|
$(clustering_data)/subreddit_comment_authors_10k/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_authors_10k.feather clustering.py
|
||||||
$(srun_singularity) python3 selection.py kmeans $(similarity_data)/subreddit_comment_authors_10k.feather $(clustering_data)/subreddit_comment_authors_10k/kmeans $(clustering_data)/subreddit_comment_authors_10k/kmeans/selection_data.csv $(kmeans_selection_grid)
|
$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_10k.feather $(clustering_data)/subreddit_comment_authors_10k $(clustering_data)/subreddit_comment_authors_10k/selection_data.csv $(selection_grid) -J 20
|
||||||
|
|
||||||
$(clustering_data)/subreddit_comment_terms_10k/kmeans/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_terms_10k.feather clustering.py
|
$(clustering_data)/subreddit_comment_terms_10k/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_terms_10k.feather clustering.py
|
||||||
$(srun_singularity) python3 selection.py kmeans $(similarity_data)/subreddit_comment_terms_10k.feather $(clustering_data)/subreddit_comment_terms_10k/kmeans $(clustering_data)/subreddit_comment_terms_10k/kmeans/selection_data.csv $(kmeans_selection_grid)
|
$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_terms_10k.feather $(clustering_data)/subreddit_comment_terms_10k $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv $(selection_grid) -J 20
|
||||||
|
|
||||||
$(clustering_data)/subreddit_comment_authors-tf_10k/kmeans/selection_data.csv:clustering.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather
|
$(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv:clustering.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather
|
||||||
$(srun_singularity) python3 selection.py kmeans $(similarity_data)/subreddit_comment_authors-tf_10k.feather $(clustering_data)/subreddit_comment_authors-tf_10k/kmeans $(clustering_data)/subreddit_comment_authors-tf_10k/kmeans/selection_data.csv $(kmeans_selection_grid)
|
$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather $(clustering_data)/subreddit_comment_authors-tf_10k $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv $(selection_grid) -J 20
|
||||||
|
|
||||||
|
|
||||||
affinity_selection_grid="--max_iter=3000 --convergence_iter=[15] --preference_quantile=[0.5] --damping=[0.99]"
|
|
||||||
$(clustering_data)/subreddit_comment_authors_10k/affinity/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_authors_10k.feather clustering.py
|
|
||||||
$(srun_singularity) python3 selection.py affinity $(similarity_data)/subreddit_comment_authors_10k.feather $(clustering_data)/subreddit_comment_authors_10k/affinity $(clustering_data)/subreddit_comment_authors_10k/affinity/selection_data.csv $(affinity_selection_grid) -J 20
|
|
||||||
|
|
||||||
$(clustering_data)/subreddit_comment_terms_10k/affinity/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_terms_10k.feather clustering.py
|
|
||||||
$(srun_singularity) python3 selection.py affinity $(similarity_data)/subreddit_comment_terms_10k.feather $(clustering_data)/subreddit_comment_terms_10k/affinity $(clustering_data)/subreddit_comment_terms_10k/affinity/selection_data.csv $(affinity_selection_grid) -J 20
|
|
||||||
|
|
||||||
$(clustering_data)/subreddit_comment_authors-tf_10k/affinity/selection_data.csv:clustering.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather
|
|
||||||
$(srun_singularity) python3 selection.py affinity $(similarity_data)/subreddit_comment_authors-tf_10k.feather $(clustering_data)/subreddit_comment_authors-tf_10k/affinity $(clustering_data)/subreddit_comment_authors-tf_10k/affinity/selection_data.csv $(affinity_selection_grid) -J 20
|
|
||||||
|
|
||||||
clean:
|
|
||||||
rm -f $(clustering_data)/subreddit_comment_authors-tf_10k/affinity/selection_data.csv
|
|
||||||
rm -f $(clustering_data)/subreddit_comment_authors_10k/affinity/selection_data.csv
|
|
||||||
rm -f $(clustering_data)/subreddit_comment_terms_10k/affinity/selection_data.csv
|
|
||||||
rm -f $(clustering_data)/subreddit_comment_authors-tf_10k/kmeans/selection_data.csv
|
|
||||||
rm -f $(clustering_data)/subreddit_comment_authors_10k/kmeans/selection_data.csv
|
|
||||||
rm -f $(clustering_data)/subreddit_comment_terms_10k/kmeans/selection_data.csv
|
|
||||||
|
|
||||||
PHONY: clean
|
|
||||||
|
|
||||||
# $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS:selection.py $(similarity_data)/subreddit_comment_authors_30k.feather clustering.py
|
# $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS:selection.py $(similarity_data)/subreddit_comment_authors_30k.feather clustering.py
|
||||||
# $(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_30k.feather $(clustering_data)/subreddit_comment_authors_30k $(selection_grid) -J 10 && touch $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS
|
# $(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_30k.feather $(clustering_data)/subreddit_comment_authors_30k $(selection_grid) -J 10 && touch $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS
|
||||||
|
|||||||
@@ -3,23 +3,24 @@
|
|||||||
import sys
|
import sys
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import numpy as np
|
import numpy as np
|
||||||
from sklearn.cluster import AffinityPropagation, KMeans
|
from sklearn.cluster import AffinityPropagation
|
||||||
import fire
|
import fire
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from multiprocessing import cpu_count
|
|
||||||
from dataclasses import dataclass
|
|
||||||
from clustering_base import sim_to_dist, process_clustering_result, clustering_result, read_similarity_mat
|
|
||||||
|
|
||||||
def affinity_clustering(similarities, output, *args, **kwargs):
|
def read_similarity_mat(similarities, use_threads=True):
|
||||||
|
df = pd.read_feather(similarities, use_threads=use_threads)
|
||||||
|
mat = np.array(df.drop('_subreddit',1))
|
||||||
|
n = mat.shape[0]
|
||||||
|
mat[range(n),range(n)] = 1
|
||||||
|
return (df._subreddit,mat)
|
||||||
|
|
||||||
|
def affinity_clustering(similarities, *args, **kwargs):
|
||||||
subreddits, mat = read_similarity_mat(similarities)
|
subreddits, mat = read_similarity_mat(similarities)
|
||||||
clustering = _affinity_clustering(mat, *args, **kwargs)
|
return _affinity_clustering(mat, subreddits, *args, **kwargs)
|
||||||
cluster_data = process_clustering_result(clustering, subreddits)
|
|
||||||
cluster_data['algorithm'] = 'affinity'
|
|
||||||
return(cluster_data)
|
|
||||||
|
|
||||||
def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True):
|
def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True):
|
||||||
'''
|
'''
|
||||||
similarities: matrix of similarity scores
|
similarities: feather file with a dataframe of similarity scores
|
||||||
preference_quantile: parameter controlling how many clusters to make. higher values = more clusters. 0.85 is a good value with 3000 subreddits.
|
preference_quantile: parameter controlling how many clusters to make. higher values = more clusters. 0.85 is a good value with 3000 subreddits.
|
||||||
damping: parameter controlling how iterations are merged. Higher values make convergence faster and more dependable. 0.85 is a good value for the 10000 subreddits by author.
|
damping: parameter controlling how iterations are merged. Higher values make convergence faster and more dependable. 0.85 is a good value for the 10000 subreddits by author.
|
||||||
'''
|
'''
|
||||||
@@ -39,32 +40,25 @@ def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000,
|
|||||||
verbose=verbose,
|
verbose=verbose,
|
||||||
random_state=random_state).fit(mat)
|
random_state=random_state).fit(mat)
|
||||||
|
|
||||||
cluster_data = process_clustering_result(clustering, subreddits)
|
|
||||||
output = Path(output)
|
print(f"clustering took {clustering.n_iter_} iterations")
|
||||||
output.parent.mkdir(parents=True,exist_ok=True)
|
clusters = clustering.labels_
|
||||||
|
|
||||||
|
print(f"found {len(set(clusters))} clusters")
|
||||||
|
|
||||||
|
cluster_data = pd.DataFrame({'subreddit': subreddits,'cluster':clustering.labels_})
|
||||||
|
|
||||||
|
cluster_sizes = cluster_data.groupby("cluster").count()
|
||||||
|
print(f"the largest cluster has {cluster_sizes.subreddit.max()} members")
|
||||||
|
|
||||||
|
print(f"the median cluster has {cluster_sizes.subreddit.median()} members")
|
||||||
|
|
||||||
|
print(f"{(cluster_sizes.subreddit==1).sum()} clusters have 1 member")
|
||||||
|
|
||||||
|
sys.stdout.flush()
|
||||||
cluster_data.to_feather(output)
|
cluster_data.to_feather(output)
|
||||||
print(f"saved {output}")
|
print(f"saved {output}")
|
||||||
return clustering
|
return clustering
|
||||||
|
|
||||||
def kmeans_clustering(similarities, *args, **kwargs):
|
|
||||||
subreddits, mat = read_similarity_mat(similarities)
|
|
||||||
mat = sim_to_dist(mat)
|
|
||||||
clustering = _kmeans_clustering(mat, *args, **kwargs)
|
|
||||||
cluster_data = process_clustering_result(clustering, subreddits)
|
|
||||||
return(cluster_data)
|
|
||||||
|
|
||||||
def _kmeans_clustering(mat, output, n_clusters, n_init=10, max_iter=100000, random_state=1968, verbose=True):
|
|
||||||
|
|
||||||
clustering = KMeans(n_clusters=n_clusters,
|
|
||||||
n_init=n_init,
|
|
||||||
max_iter=max_iter,
|
|
||||||
random_state=random_state,
|
|
||||||
verbose=verbose
|
|
||||||
).fit(mat)
|
|
||||||
|
|
||||||
return clustering
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
fire.Fire(affinity_clustering)
|
fire.Fire(affinity_clustering)
|
||||||
|
|||||||
@@ -1,49 +0,0 @@
|
|||||||
from pathlib import Path
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
from dataclasses import dataclass
|
|
||||||
|
|
||||||
def sim_to_dist(mat):
|
|
||||||
dist = 1-mat
|
|
||||||
dist[dist < 0] = 0
|
|
||||||
np.fill_diagonal(dist,0)
|
|
||||||
return dist
|
|
||||||
|
|
||||||
def process_clustering_result(clustering, subreddits):
|
|
||||||
|
|
||||||
if hasattr(clustering,'n_iter_'):
|
|
||||||
print(f"clustering took {clustering.n_iter_} iterations")
|
|
||||||
|
|
||||||
clusters = clustering.labels_
|
|
||||||
|
|
||||||
print(f"found {len(set(clusters))} clusters")
|
|
||||||
|
|
||||||
cluster_data = pd.DataFrame({'subreddit': subreddits,'cluster':clustering.labels_})
|
|
||||||
|
|
||||||
cluster_sizes = cluster_data.groupby("cluster").count().reset_index()
|
|
||||||
print(f"the largest cluster has {cluster_sizes.loc[cluster_sizes.cluster!=-1].subreddit.max()} members")
|
|
||||||
|
|
||||||
print(f"the median cluster has {cluster_sizes.subreddit.median()} members")
|
|
||||||
|
|
||||||
print(f"{(cluster_sizes.subreddit==1).sum()} clusters have 1 member")
|
|
||||||
|
|
||||||
print(f"{(cluster_sizes.loc[cluster_sizes.cluster==-1,['subreddit']])} subreddits are in cluster -1",flush=True)
|
|
||||||
|
|
||||||
return cluster_data
|
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class clustering_result:
|
|
||||||
outpath:Path
|
|
||||||
max_iter:int
|
|
||||||
silhouette_score:float
|
|
||||||
alt_silhouette_score:float
|
|
||||||
name:str
|
|
||||||
n_clusters:int
|
|
||||||
|
|
||||||
def read_similarity_mat(similarities, use_threads=True):
|
|
||||||
df = pd.read_feather(similarities, use_threads=use_threads)
|
|
||||||
mat = np.array(df.drop('_subreddit',1))
|
|
||||||
n = mat.shape[0]
|
|
||||||
mat[range(n),range(n)] = 1
|
|
||||||
return (df._subreddit,mat)
|
|
||||||
@@ -1,172 +0,0 @@
|
|||||||
from clustering_base import sim_to_dist, process_clustering_result, clustering_result, read_similarity_mat
|
|
||||||
from dataclasses import dataclass
|
|
||||||
import hdbscan
|
|
||||||
from sklearn.neighbors import NearestNeighbors
|
|
||||||
import plotnine as pn
|
|
||||||
import numpy as np
|
|
||||||
from itertools import product, starmap
|
|
||||||
import pandas as pd
|
|
||||||
from sklearn.metrics import silhouette_score, silhouette_samples
|
|
||||||
from pathlib import Path
|
|
||||||
from multiprocessing import Pool, cpu_count
|
|
||||||
import fire
|
|
||||||
from pyarrow.feather import write_feather
|
|
||||||
|
|
||||||
def test_select_hdbscan_clustering():
|
|
||||||
select_hdbscan_clustering("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_30k_LSI",
|
|
||||||
"test_hdbscan_author30k",
|
|
||||||
min_cluster_sizes=[2],
|
|
||||||
min_samples=[1,2],
|
|
||||||
cluster_selection_epsilons=[0,0.05,0.1,0.15],
|
|
||||||
cluster_selection_methods=['eom','leaf'],
|
|
||||||
lsi_dimensions='all')
|
|
||||||
inpath = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_30k_LSI"
|
|
||||||
outpath = "test_hdbscan";
|
|
||||||
min_cluster_sizes=[2,3,4];
|
|
||||||
min_samples=[1,2,3];
|
|
||||||
cluster_selection_epsilons=[0,0.1,0.3,0.5];
|
|
||||||
cluster_selection_methods=['eom'];
|
|
||||||
lsi_dimensions='all'
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class hdbscan_clustering_result(clustering_result):
|
|
||||||
min_cluster_size:int
|
|
||||||
min_samples:int
|
|
||||||
cluster_selection_epsilon:float
|
|
||||||
cluster_selection_method:str
|
|
||||||
lsi_dimensions:int
|
|
||||||
n_isolates:int
|
|
||||||
silhouette_samples:str
|
|
||||||
|
|
||||||
def select_hdbscan_clustering(inpath,
|
|
||||||
outpath,
|
|
||||||
outfile=None,
|
|
||||||
min_cluster_sizes=[2],
|
|
||||||
min_samples=[1],
|
|
||||||
cluster_selection_epsilons=[0],
|
|
||||||
cluster_selection_methods=['eom'],
|
|
||||||
lsi_dimensions='all'
|
|
||||||
):
|
|
||||||
|
|
||||||
inpath = Path(inpath)
|
|
||||||
outpath = Path(outpath)
|
|
||||||
outpath.mkdir(exist_ok=True, parents=True)
|
|
||||||
|
|
||||||
if lsi_dimensions == 'all':
|
|
||||||
lsi_paths = list(inpath.glob("*"))
|
|
||||||
|
|
||||||
else:
|
|
||||||
lsi_paths = [inpath / (dim + '.feather') for dim in lsi_dimensions]
|
|
||||||
|
|
||||||
lsi_nums = [p.stem for p in lsi_paths]
|
|
||||||
grid = list(product(lsi_nums,
|
|
||||||
min_cluster_sizes,
|
|
||||||
min_samples,
|
|
||||||
cluster_selection_epsilons,
|
|
||||||
cluster_selection_methods))
|
|
||||||
|
|
||||||
# fix the output file names
|
|
||||||
names = list(map(lambda t:'_'.join(map(str,t)),grid))
|
|
||||||
|
|
||||||
grid = [(inpath/(str(t[0])+'.feather'),outpath/(name + '.feather'), t[0], name) + t[1:] for t, name in zip(grid, names)]
|
|
||||||
|
|
||||||
with Pool(int(cpu_count()/4)) as pool:
|
|
||||||
mods = starmap(hdbscan_clustering, grid)
|
|
||||||
|
|
||||||
res = pd.DataFrame(mods)
|
|
||||||
if outfile is None:
|
|
||||||
outfile = outpath / "selection_data.csv"
|
|
||||||
|
|
||||||
res.to_csv(outfile)
|
|
||||||
|
|
||||||
def hdbscan_clustering(similarities, output, lsi_dim, name, min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0, cluster_selection_method='eom'):
|
|
||||||
subreddits, mat = read_similarity_mat(similarities)
|
|
||||||
mat = sim_to_dist(mat)
|
|
||||||
clustering = _hdbscan_clustering(mat,
|
|
||||||
min_cluster_size=min_cluster_size,
|
|
||||||
min_samples=min_samples,
|
|
||||||
cluster_selection_epsilon=cluster_selection_epsilon,
|
|
||||||
cluster_selection_method=cluster_selection_method,
|
|
||||||
metric='precomputed',
|
|
||||||
core_dist_n_jobs=cpu_count()
|
|
||||||
)
|
|
||||||
|
|
||||||
cluster_data = process_clustering_result(clustering, subreddits)
|
|
||||||
isolates = clustering.labels_ == -1
|
|
||||||
scoremat = mat[~isolates][:,~isolates]
|
|
||||||
score = silhouette_score(scoremat, clustering.labels_[~isolates], metric='precomputed')
|
|
||||||
cluster_data.to_feather(output)
|
|
||||||
|
|
||||||
silhouette_samp = silhouette_samples(mat, clustering.labels_, metric='precomputed')
|
|
||||||
silhouette_samp = pd.DataFrame({'subreddit':subreddits,'score':silhouette_samp})
|
|
||||||
silsampout = output.parent / ("silhouette_samples" + output.name)
|
|
||||||
silhouette_samp.to_feather(silsampout)
|
|
||||||
|
|
||||||
result = hdbscan_clustering_result(outpath=output,
|
|
||||||
max_iter=None,
|
|
||||||
silhouette_samples=silsampout,
|
|
||||||
silhouette_score=score,
|
|
||||||
alt_silhouette_score=score,
|
|
||||||
name=name,
|
|
||||||
min_cluster_size=min_cluster_size,
|
|
||||||
min_samples=min_samples,
|
|
||||||
cluster_selection_epsilon=cluster_selection_epsilon,
|
|
||||||
cluster_selection_method=cluster_selection_method,
|
|
||||||
lsi_dimensions=lsi_dim,
|
|
||||||
n_isolates=isolates.sum(),
|
|
||||||
n_clusters=len(set(clustering.labels_))
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
return(result)
|
|
||||||
|
|
||||||
# for all runs we should try cluster_selection_epsilon = None
|
|
||||||
# for terms we should try cluster_selection_epsilon around 0.56-0.66
|
|
||||||
# for authors we should try cluster_selection_epsilon around 0.98-0.99
|
|
||||||
def _hdbscan_clustering(mat, *args, **kwargs):
|
|
||||||
print(f"running hdbscan clustering. args:{args}. kwargs:{kwargs}")
|
|
||||||
|
|
||||||
print(mat)
|
|
||||||
clusterer = hdbscan.HDBSCAN(*args,
|
|
||||||
**kwargs,
|
|
||||||
)
|
|
||||||
|
|
||||||
clustering = clusterer.fit(mat.astype('double'))
|
|
||||||
|
|
||||||
return(clustering)
|
|
||||||
|
|
||||||
def KNN_distances_plot(mat,outname,k=2):
|
|
||||||
nbrs = NearestNeighbors(n_neighbors=k,algorithm='auto',metric='precomputed').fit(mat)
|
|
||||||
distances, indices = nbrs.kneighbors(mat)
|
|
||||||
d2 = distances[:,-1]
|
|
||||||
df = pd.DataFrame({'dist':d2})
|
|
||||||
df = df.sort_values("dist",ascending=False)
|
|
||||||
df['idx'] = np.arange(0,d2.shape[0]) + 1
|
|
||||||
p = pn.qplot(x='idx',y='dist',data=df,geom='line') + pn.scales.scale_y_continuous(minor_breaks = np.arange(0,50)/50,
|
|
||||||
breaks = np.arange(0,10)/10)
|
|
||||||
p.save(outname,width=16,height=10)
|
|
||||||
|
|
||||||
def make_KNN_plots():
|
|
||||||
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10k.feather"
|
|
||||||
subreddits, mat = read_similarity_mat(similarities)
|
|
||||||
mat = sim_to_dist(mat)
|
|
||||||
|
|
||||||
KNN_distances_plot(mat,k=2,outname='terms_knn_dist2.png')
|
|
||||||
|
|
||||||
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10k.feather"
|
|
||||||
subreddits, mat = read_similarity_mat(similarities)
|
|
||||||
mat = sim_to_dist(mat)
|
|
||||||
KNN_distances_plot(mat,k=2,outname='authors_knn_dist2.png')
|
|
||||||
|
|
||||||
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k.feather"
|
|
||||||
subreddits, mat = read_similarity_mat(similarities)
|
|
||||||
mat = sim_to_dist(mat)
|
|
||||||
KNN_distances_plot(mat,k=2,outname='authors-tf_knn_dist2.png')
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
df = pd.read_csv("test_hdbscan/selection_data.csv")
|
|
||||||
test_select_hdbscan_clustering()
|
|
||||||
check_clusters = pd.read_feather("test_hdbscan/500_2_2_0.1_eom.feather")
|
|
||||||
silscores = pd.read_feather("test_hdbscan/silhouette_samples500_2_2_0.1_eom.feather")
|
|
||||||
c = check_clusters.merge(silscores,on='subreddit')# fire.Fire(select_hdbscan_clustering)
|
|
||||||
@@ -1,132 +0,0 @@
|
|||||||
from sklearn.metrics import silhouette_score
|
|
||||||
from sklearn.cluster import AffinityPropagation
|
|
||||||
from functools import partial
|
|
||||||
from dataclasses import dataclass
|
|
||||||
from clustering import _affinity_clustering, read_similarity_mat, sim_to_dist, process_clustering_result, clustering_result
|
|
||||||
from multiprocessing import Pool, cpu_count, Array, Process
|
|
||||||
from pathlib import Path
|
|
||||||
from itertools import product, starmap
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
import fire
|
|
||||||
import sys
|
|
||||||
|
|
||||||
# silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying.
|
|
||||||
@dataclass
|
|
||||||
class affinity_clustering_result(clustering_result):
|
|
||||||
damping:float
|
|
||||||
convergence_iter:int
|
|
||||||
preference_quantile:float
|
|
||||||
|
|
||||||
def do_affinity_clustering(damping, convergence_iter, preference_quantile, name, mat, subreddits, max_iter, outdir:Path, random_state, verbose, alt_mat, overwrite=False):
|
|
||||||
if name is None:
|
|
||||||
name = f"damping-{damping}_convergenceIter-{convergence_iter}_preferenceQuantile-{preference_quantile}"
|
|
||||||
print(name)
|
|
||||||
sys.stdout.flush()
|
|
||||||
outpath = outdir / (str(name) + ".feather")
|
|
||||||
outpath.parent.mkdir(parents=True,exist_ok=True)
|
|
||||||
print(outpath)
|
|
||||||
clustering = _affinity_clustering(mat, outpath, damping, max_iter, convergence_iter, preference_quantile, random_state, verbose)
|
|
||||||
cluster_data = process_clustering_result(clustering, subreddits)
|
|
||||||
mat = sim_to_dist(clustering.affinity_matrix_)
|
|
||||||
|
|
||||||
try:
|
|
||||||
score = silhouette_score(mat, clustering.labels_, metric='precomputed')
|
|
||||||
except ValueError:
|
|
||||||
score = None
|
|
||||||
|
|
||||||
if alt_mat is not None:
|
|
||||||
alt_distances = sim_to_dist(alt_mat)
|
|
||||||
try:
|
|
||||||
alt_score = silhouette_score(alt_mat, clustering.labels_, metric='precomputed')
|
|
||||||
except ValueError:
|
|
||||||
alt_score = None
|
|
||||||
|
|
||||||
res = affinity_clustering_result(outpath=outpath,
|
|
||||||
damping=damping,
|
|
||||||
max_iter=max_iter,
|
|
||||||
convergence_iter=convergence_iter,
|
|
||||||
preference_quantile=preference_quantile,
|
|
||||||
silhouette_score=score,
|
|
||||||
alt_silhouette_score=score,
|
|
||||||
name=str(name))
|
|
||||||
|
|
||||||
return res
|
|
||||||
|
|
||||||
def do_affinity_clustering(damping, convergence_iter, preference_quantile, name, mat, subreddits, max_iter, outdir:Path, random_state, verbose, alt_mat, overwrite=False):
|
|
||||||
if name is None:
|
|
||||||
name = f"damping-{damping}_convergenceIter-{convergence_iter}_preferenceQuantile-{preference_quantile}"
|
|
||||||
print(name)
|
|
||||||
sys.stdout.flush()
|
|
||||||
outpath = outdir / (str(name) + ".feather")
|
|
||||||
outpath.parent.mkdir(parents=True,exist_ok=True)
|
|
||||||
print(outpath)
|
|
||||||
clustering = _affinity_clustering(mat, subreddits, outpath, damping, max_iter, convergence_iter, preference_quantile, random_state, verbose)
|
|
||||||
mat = sim_to_dist(clustering.affinity_matrix_)
|
|
||||||
|
|
||||||
try:
|
|
||||||
score = silhouette_score(mat, clustering.labels_, metric='precomputed')
|
|
||||||
except ValueError:
|
|
||||||
score = None
|
|
||||||
|
|
||||||
if alt_mat is not None:
|
|
||||||
alt_distances = sim_to_dist(alt_mat)
|
|
||||||
try:
|
|
||||||
alt_score = silhouette_score(alt_mat, clustering.labels_, metric='precomputed')
|
|
||||||
except ValueError:
|
|
||||||
alt_score = None
|
|
||||||
|
|
||||||
res = clustering_result(outpath=outpath,
|
|
||||||
damping=damping,
|
|
||||||
max_iter=max_iter,
|
|
||||||
convergence_iter=convergence_iter,
|
|
||||||
preference_quantile=preference_quantile,
|
|
||||||
silhouette_score=score,
|
|
||||||
alt_silhouette_score=score,
|
|
||||||
name=str(name))
|
|
||||||
|
|
||||||
return res
|
|
||||||
|
|
||||||
|
|
||||||
# alt similiarities is for checking the silhouette coefficient of an alternative measure of similarity (e.g., topic similarities for user clustering).
|
|
||||||
|
|
||||||
def select_affinity_clustering(similarities, outdir, outinfo, damping=[0.9], max_iter=100000, convergence_iter=[30], preference_quantile=[0.5], random_state=1968, verbose=True, alt_similarities=None, J=None):
|
|
||||||
|
|
||||||
damping = list(map(float,damping))
|
|
||||||
convergence_iter = convergence_iter = list(map(int,convergence_iter))
|
|
||||||
preference_quantile = list(map(float,preference_quantile))
|
|
||||||
|
|
||||||
if type(outdir) is str:
|
|
||||||
outdir = Path(outdir)
|
|
||||||
|
|
||||||
outdir.mkdir(parents=True,exist_ok=True)
|
|
||||||
|
|
||||||
subreddits, mat = read_similarity_mat(similarities,use_threads=True)
|
|
||||||
|
|
||||||
if alt_similarities is not None:
|
|
||||||
alt_mat = read_similarity_mat(alt_similarities,use_threads=True)
|
|
||||||
else:
|
|
||||||
alt_mat = None
|
|
||||||
|
|
||||||
if J is None:
|
|
||||||
J = cpu_count()
|
|
||||||
pool = Pool(J)
|
|
||||||
|
|
||||||
# get list of tuples: the combinations of hyperparameters
|
|
||||||
hyper_grid = product(damping, convergence_iter, preference_quantile)
|
|
||||||
hyper_grid = (t + (str(i),) for i, t in enumerate(hyper_grid))
|
|
||||||
|
|
||||||
_do_clustering = partial(do_affinity_clustering, mat=mat, subreddits=subreddits, outdir=outdir, max_iter=max_iter, random_state=random_state, verbose=verbose, alt_mat=alt_mat)
|
|
||||||
|
|
||||||
# similarities = Array('d', mat)
|
|
||||||
# call pool.starmap
|
|
||||||
print("running clustering selection")
|
|
||||||
clustering_data = pool.starmap(_do_clustering, hyper_grid)
|
|
||||||
clustering_data = pd.DataFrame(list(clustering_data))
|
|
||||||
clustering_data.to_csv(outinfo)
|
|
||||||
|
|
||||||
|
|
||||||
return clustering_data
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
x = fire.Fire(select_affinity_clustering)
|
|
||||||
@@ -1,92 +0,0 @@
|
|||||||
from sklearn.metrics import silhouette_score
|
|
||||||
from sklearn.cluster import AffinityPropagation
|
|
||||||
from functools import partial
|
|
||||||
from clustering import _kmeans_clustering, read_similarity_mat, sim_to_dist, process_clustering_result, clustering_result
|
|
||||||
from dataclasses import dataclass
|
|
||||||
from multiprocessing import Pool, cpu_count, Array, Process
|
|
||||||
from pathlib import Path
|
|
||||||
from itertools import product, starmap
|
|
||||||
import numpy as np
|
|
||||||
import pandas as pd
|
|
||||||
import fire
|
|
||||||
import sys
|
|
||||||
|
|
||||||
@dataclass
|
|
||||||
class kmeans_clustering_result(clustering_result):
|
|
||||||
n_clusters:int
|
|
||||||
n_init:int
|
|
||||||
|
|
||||||
|
|
||||||
# silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying.
|
|
||||||
|
|
||||||
def do_clustering(n_clusters, n_init, name, mat, subreddits, max_iter, outdir:Path, random_state, verbose, alt_mat, overwrite=False):
|
|
||||||
if name is None:
|
|
||||||
name = f"damping-{damping}_convergenceIter-{convergence_iter}_preferenceQuantile-{preference_quantile}"
|
|
||||||
print(name)
|
|
||||||
sys.stdout.flush()
|
|
||||||
outpath = outdir / (str(name) + ".feather")
|
|
||||||
print(outpath)
|
|
||||||
mat = sim_to_dist(mat)
|
|
||||||
clustering = _kmeans_clustering(mat, outpath, n_clusters, n_init, max_iter, random_state, verbose)
|
|
||||||
|
|
||||||
outpath.parent.mkdir(parents=True,exist_ok=True)
|
|
||||||
cluster_data.to_feather(outpath)
|
|
||||||
cluster_data = process_clustering_result(clustering, subreddits)
|
|
||||||
|
|
||||||
try:
|
|
||||||
score = silhouette_score(mat, clustering.labels_, metric='precomputed')
|
|
||||||
except ValueError:
|
|
||||||
score = None
|
|
||||||
|
|
||||||
if alt_mat is not None:
|
|
||||||
alt_distances = sim_to_dist(alt_mat)
|
|
||||||
try:
|
|
||||||
alt_score = silhouette_score(alt_mat, clustering.labels_, metric='precomputed')
|
|
||||||
except ValueError:
|
|
||||||
alt_score = None
|
|
||||||
|
|
||||||
res = kmeans_clustering_result(outpath=outpath,
|
|
||||||
max_iter=max_iter,
|
|
||||||
n_clusters=n_clusters,
|
|
||||||
n_init = n_init,
|
|
||||||
silhouette_score=score,
|
|
||||||
alt_silhouette_score=score,
|
|
||||||
name=str(name))
|
|
||||||
|
|
||||||
return res
|
|
||||||
|
|
||||||
|
|
||||||
# alt similiarities is for checking the silhouette coefficient of an alternative measure of similarity (e.g., topic similarities for user clustering).
|
|
||||||
def select_kmeans_clustering(similarities, outdir, outinfo, n_clusters=[1000], max_iter=100000, n_init=10, random_state=1968, verbose=True, alt_similarities=None):
|
|
||||||
|
|
||||||
n_clusters = list(map(int,n_clusters))
|
|
||||||
n_init = list(map(int,n_init))
|
|
||||||
|
|
||||||
if type(outdir) is str:
|
|
||||||
outdir = Path(outdir)
|
|
||||||
|
|
||||||
outdir.mkdir(parents=True,exist_ok=True)
|
|
||||||
|
|
||||||
subreddits, mat = read_similarity_mat(similarities,use_threads=True)
|
|
||||||
|
|
||||||
if alt_similarities is not None:
|
|
||||||
alt_mat = read_similarity_mat(alt_similarities,use_threads=True)
|
|
||||||
else:
|
|
||||||
alt_mat = None
|
|
||||||
|
|
||||||
# get list of tuples: the combinations of hyperparameters
|
|
||||||
hyper_grid = product(n_clusters, n_init)
|
|
||||||
hyper_grid = (t + (str(i),) for i, t in enumerate(hyper_grid))
|
|
||||||
|
|
||||||
_do_clustering = partial(do_clustering, mat=mat, subreddits=subreddits, outdir=outdir, max_iter=max_iter, random_state=random_state, verbose=verbose, alt_mat=alt_mat)
|
|
||||||
|
|
||||||
# call starmap
|
|
||||||
print("running clustering selection")
|
|
||||||
clustering_data = starmap(_do_clustering, hyper_grid)
|
|
||||||
clustering_data = pd.DataFrame(list(clustering_data))
|
|
||||||
clustering_data.to_csv(outinfo)
|
|
||||||
|
|
||||||
return clustering_data
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
x = fire.Fire(select_kmeans_clustering)
|
|
||||||
@@ -1,7 +1,101 @@
|
|||||||
|
from sklearn.metrics import silhouette_score
|
||||||
|
from sklearn.cluster import AffinityPropagation
|
||||||
|
from functools import partial
|
||||||
|
from clustering import _affinity_clustering, read_similarity_mat
|
||||||
|
from dataclasses import dataclass
|
||||||
|
from multiprocessing import Pool, cpu_count, Array, Process
|
||||||
|
from pathlib import Path
|
||||||
|
from itertools import product, starmap
|
||||||
|
import numpy as np
|
||||||
|
import pandas as pd
|
||||||
import fire
|
import fire
|
||||||
from select_affinity import select_affinity_clustering
|
import sys
|
||||||
from select_kmeans import select_kmeans_clustering
|
|
||||||
|
# silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying.
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class clustering_result:
|
||||||
|
outpath:Path
|
||||||
|
damping:float
|
||||||
|
max_iter:int
|
||||||
|
convergence_iter:int
|
||||||
|
preference_quantile:float
|
||||||
|
silhouette_score:float
|
||||||
|
alt_silhouette_score:float
|
||||||
|
name:str
|
||||||
|
|
||||||
|
|
||||||
|
def sim_to_dist(mat):
|
||||||
|
dist = 1-mat
|
||||||
|
dist[dist < 0] = 0
|
||||||
|
np.fill_diagonal(dist,0)
|
||||||
|
return dist
|
||||||
|
|
||||||
|
def do_clustering(damping, convergence_iter, preference_quantile, name, mat, subreddits, max_iter, outdir:Path, random_state, verbose, alt_mat, overwrite=False):
|
||||||
|
if name is None:
|
||||||
|
name = f"damping-{damping}_convergenceIter-{convergence_iter}_preferenceQuantile-{preference_quantile}"
|
||||||
|
print(name)
|
||||||
|
sys.stdout.flush()
|
||||||
|
outpath = outdir / (str(name) + ".feather")
|
||||||
|
print(outpath)
|
||||||
|
clustering = _affinity_clustering(mat, subreddits, outpath, damping, max_iter, convergence_iter, preference_quantile, random_state, verbose)
|
||||||
|
mat = sim_to_dist(clustering.affinity_matrix_)
|
||||||
|
|
||||||
|
score = silhouette_score(mat, clustering.labels_, metric='precomputed')
|
||||||
|
|
||||||
|
if alt_mat is not None:
|
||||||
|
alt_distances = sim_to_dist(alt_mat)
|
||||||
|
alt_score = silhouette_score(alt_mat, clustering.labels_, metric='precomputed')
|
||||||
|
|
||||||
|
res = clustering_result(outpath=outpath,
|
||||||
|
damping=damping,
|
||||||
|
max_iter=max_iter,
|
||||||
|
convergence_iter=convergence_iter,
|
||||||
|
preference_quantile=preference_quantile,
|
||||||
|
silhouette_score=score,
|
||||||
|
alt_silhouette_score=score,
|
||||||
|
name=str(name))
|
||||||
|
|
||||||
|
return res
|
||||||
|
|
||||||
|
# alt similiarities is for checking the silhouette coefficient of an alternative measure of similarity (e.g., topic similarities for user clustering).
|
||||||
|
|
||||||
|
def select_affinity_clustering(similarities, outdir, outinfo, damping=[0.9], max_iter=100000, convergence_iter=[30], preference_quantile=[0.5], random_state=1968, verbose=True, alt_similarities=None, J=None):
|
||||||
|
|
||||||
|
damping = list(map(float,damping))
|
||||||
|
convergence_iter = convergence_iter = list(map(int,convergence_iter))
|
||||||
|
preference_quantile = list(map(float,preference_quantile))
|
||||||
|
|
||||||
|
if type(outdir) is str:
|
||||||
|
outdir = Path(outdir)
|
||||||
|
|
||||||
|
outdir.mkdir(parents=True,exist_ok=True)
|
||||||
|
|
||||||
|
subreddits, mat = read_similarity_mat(similarities,use_threads=True)
|
||||||
|
|
||||||
|
if alt_similarities is not None:
|
||||||
|
alt_mat = read_similarity_mat(alt_similarities,use_threads=True)
|
||||||
|
else:
|
||||||
|
alt_mat = None
|
||||||
|
|
||||||
|
if J is None:
|
||||||
|
J = cpu_count()
|
||||||
|
pool = Pool(J)
|
||||||
|
|
||||||
|
# get list of tuples: the combinations of hyperparameters
|
||||||
|
hyper_grid = product(damping, convergence_iter, preference_quantile)
|
||||||
|
hyper_grid = (t + (str(i),) for i, t in enumerate(hyper_grid))
|
||||||
|
|
||||||
|
_do_clustering = partial(do_clustering, mat=mat, subreddits=subreddits, outdir=outdir, max_iter=max_iter, random_state=random_state, verbose=verbose, alt_mat=alt_mat)
|
||||||
|
|
||||||
|
# similarities = Array('d', mat)
|
||||||
|
# call pool.starmap
|
||||||
|
print("running clustering selection")
|
||||||
|
clustering_data = pool.starmap(_do_clustering, hyper_grid)
|
||||||
|
clustering_data = pd.DataFrame(list(clustering_data))
|
||||||
|
clustering_data.to_csv(outinfo)
|
||||||
|
|
||||||
|
return clustering_data
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
fire.Fire({"kmeans":select_kmeans_clustering,
|
x = fire.Fire(select_affinity_clustering)
|
||||||
"affinity":select_affinity_clustering})
|
|
||||||
|
|||||||
381
datasets/README.md
Normal file
381
datasets/README.md
Normal file
@@ -0,0 +1,381 @@
|
|||||||
|
# Reddit dumps → sorted parquet datasets
|
||||||
|
|
||||||
|
This directory holds the pipeline that turns compressed Reddit dump files
|
||||||
|
(`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the
|
||||||
|
sorted, repartitioned parquet datasets that the rest of the project
|
||||||
|
consumes.
|
||||||
|
|
||||||
|
## Pipeline overview
|
||||||
|
|
||||||
|
The raw dumps are huge compressed json files with a lot of metadata that
|
||||||
|
we may not need. They aren't indexed so it's expensive to pull data from
|
||||||
|
just a handful of subreddits. It also turns out that it's a pain to read
|
||||||
|
these compressed files straight into spark. Extracting useful variables
|
||||||
|
from the dumps and building parquet datasets makes them easier to work
|
||||||
|
with. This happens in two steps:
|
||||||
|
|
||||||
|
1. Extracting json into (temporary, unpartitioned) parquet files using
|
||||||
|
pyarrow.
|
||||||
|
2. Repartitioning and sorting the data using pyspark.
|
||||||
|
|
||||||
|
Breaking this down into two steps is useful because it allows us to
|
||||||
|
decompress and parse the dumps in the backfill queue and then sort them
|
||||||
|
in spark. Partitioning the data makes it possible to efficiently read
|
||||||
|
data for specific subreddits or authors. Sorting it means that you can
|
||||||
|
efficiently compute aggregations at the subreddit or user level. More
|
||||||
|
documentation on using these files is available on the [CDSC wiki][hyak-datasets].
|
||||||
|
|
||||||
|
The final datasets are in `/gscratch/comdata/output`:
|
||||||
|
|
||||||
|
- `reddit_comments_by_author.parquet` has comments partitioned and sorted
|
||||||
|
by username (lowercase).
|
||||||
|
- `reddit_comments_by_subreddit.parquet` has comments partitioned and
|
||||||
|
sorted by subreddit name (lowercase).
|
||||||
|
- `reddit_submissions_by_author.parquet` has submissions partitioned and
|
||||||
|
sorted by username (lowercase).
|
||||||
|
- `reddit_submissions_by_subreddit.parquet` has submissions partitioned
|
||||||
|
and sorted by subreddit name (lowercase).
|
||||||
|
|
||||||
|
[hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets
|
||||||
|
|
||||||
|
## Scripts
|
||||||
|
|
||||||
|
| Script | Role |
|
||||||
|
|---|---|
|
||||||
|
| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
|
||||||
|
| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads a directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Accepts `--indir` and `--mode` to support layered appends; defaults match the build-from-scratch workflow. |
|
||||||
|
| `comments_merge.py`, `submissions_merge.py` | Merge entry points. Each is a Spark job that collapses all accumulated layers in the final datasets into a single clean layer. Launched via `start_spark_and_run.sh`. |
|
||||||
|
| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` / `merge_layers` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
|
||||||
|
| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |
|
||||||
|
|
||||||
|
## The three workflows
|
||||||
|
|
||||||
|
### Build from scratch — `build_from_scratch.sh`
|
||||||
|
|
||||||
|
Use this when there is no existing parquet output, or when the upstream
|
||||||
|
data has changed in a way that requires reparsing everything. Wipes the
|
||||||
|
per-source temp directories, processes every `RC_*` / `RS_*` dump in the
|
||||||
|
raw dumps directory through Part 1 (in parallel via GNU parallel), then
|
||||||
|
runs the Part 2 Spark sort.
|
||||||
|
|
||||||
|
### Add new months — `add_months.sh YYYY-MM [YYYY-MM ...]`
|
||||||
|
|
||||||
|
> **NOTE: written but not yet tested. Remove this notice after a
|
||||||
|
> successful end-to-end run.**
|
||||||
|
|
||||||
|
Use this for routine incremental updates. Runs Part 1 on only the
|
||||||
|
specified months, then appends the sorted output as a new layer of
|
||||||
|
partition files alongside the existing ones. No existing data is
|
||||||
|
rewritten.
|
||||||
|
|
||||||
|
Each run adds one layer to each final dataset directory. Spark and DuckDB
|
||||||
|
read all layers together correctly. At a yearly update cadence the number
|
||||||
|
of layers stays small; use `merge_layers.sh` to collapse them when
|
||||||
|
needed.
|
||||||
|
|
||||||
|
#### Environment setup
|
||||||
|
|
||||||
|
The Python environment runs inside a Singularity container. Set `PYTHON`
|
||||||
|
to the full path of the venv interpreter so that `parallel` jobs use the
|
||||||
|
right Python (fresh shells spawned by `parallel` don't inherit the active
|
||||||
|
venv):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3
|
||||||
|
```
|
||||||
|
|
||||||
|
The `.zst` decompression uses the `zstandard` Python library rather than
|
||||||
|
the system `zstd` binary, which is inaccessible from inside the container.
|
||||||
|
|
||||||
|
#### Dump directory
|
||||||
|
|
||||||
|
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
|
||||||
|
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
|
||||||
|
via environment variables if the files are not in the standard locations:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||||
|
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Running as a Slurm job
|
||||||
|
|
||||||
|
The recommended way to run `add_months.sh` is via `srun` on a fat
|
||||||
|
`cpu-g2` node. Using `srun` (rather than `salloc`) means the node is
|
||||||
|
released automatically as soon as the script finishes, regardless of the
|
||||||
|
walltime. Run from a login node inside a `tmux` session so the terminal
|
||||||
|
survives disconnections:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
tmux new -s add_months
|
||||||
|
|
||||||
|
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
|
||||||
|
bash -l -c "
|
||||||
|
cd /mmfs1/gscratch/comdata/users/makohill/cdsc_reddit && \
|
||||||
|
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3 \
|
||||||
|
COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||||
|
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||||
|
./datasets/add_months.sh --clean 2025-01 2025-02 ... YYYY-MM
|
||||||
|
" 2>&1 | tee /gscratch/comdata/users/makohill/add_months_run.log
|
||||||
|
```
|
||||||
|
|
||||||
|
The `bash -l` flag sources `.bashrc` on the compute node so the Spark
|
||||||
|
environment is available. The `tee` command writes output to both the
|
||||||
|
terminal and a log file so you can review it later.
|
||||||
|
|
||||||
|
Detach from tmux with `Ctrl-b d` and reattach with `tmux attach -t add_months`.
|
||||||
|
|
||||||
|
For a multi-node Spark cluster instead, use `add_months_multinode.sh`
|
||||||
|
from a login node — it takes the number of nodes as its first argument.
|
||||||
|
|
||||||
|
### Merge layers — `merge_layers.sh`
|
||||||
|
|
||||||
|
> **NOTE: written but not yet tested. Remove this notice after a
|
||||||
|
> successful end-to-end run.**
|
||||||
|
|
||||||
|
Use this to collapse accumulated layers from incremental adds into a
|
||||||
|
single clean layer. Reads the existing final datasets, re-sorts
|
||||||
|
everything, writes to `.merging` temp paths, then atomically replaces the
|
||||||
|
originals via rename.
|
||||||
|
|
||||||
|
Run this when query performance has degraded due to many layers, or any
|
||||||
|
time you want a clean single-file-per-partition layout. The existing
|
||||||
|
datasets are safe until the rename step completes; see `merge_layers.sh`
|
||||||
|
for recovery notes if interrupted. As with `add_months.sh`, Part 2 can
|
||||||
|
run on a single fat node or via `start_spark_and_run.sh`.
|
||||||
|
|
||||||
|
## Running steps individually
|
||||||
|
|
||||||
|
Both `.sh` runners are written so that every meaningful step is a
|
||||||
|
separate, self-contained command. If something fails partway through, or
|
||||||
|
you want to inspect intermediate state, you can copy any single line out
|
||||||
|
of the runner and execute it standalone. For example:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
# parse one specific file (skipping the rest of the workflow)
|
||||||
|
python3 comments_part1.py parse_dump RC_2025-03.zst
|
||||||
|
|
||||||
|
# override default dump/output paths from the CLI
|
||||||
|
python3 comments_part1.py parse_dump RC_2025-03.zst \
|
||||||
|
--dumpdir=/tmp/test --outdir=/tmp/out
|
||||||
|
|
||||||
|
# regenerate just the task list
|
||||||
|
python3 submissions_part1.py gen_task_list
|
||||||
|
```
|
||||||
|
|
||||||
|
The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
|
||||||
|
Hyak-provided wrapper not included in this repo); see the wiki for the
|
||||||
|
launch convention.
|
||||||
|
|
||||||
|
## Detailed walkthrough: refreshing the data on Hyak
|
||||||
|
|
||||||
|
This walkthrough describes the process we went through updating Reddit
|
||||||
|
data from the PushShift cutoff up to the end of 2024. Adapting it for
|
||||||
|
newer data should just involve using different academic torrent files
|
||||||
|
that start from 2025 onwards. For incremental updates, the
|
||||||
|
`add_months.sh` workflow above is much shorter; this walkthrough is
|
||||||
|
for the bulk-refresh case.
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config
|
||||||
|
and `.bashrc`)
|
||||||
|
- [Go through the Hyak Getting Started tutorial][hyak-syllabus]
|
||||||
|
|
||||||
|
Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`):
|
||||||
|
|
||||||
|
- [Watchful1's reddit explanation][watchful1-explainer] (separated by
|
||||||
|
subreddit), the [dataset not divided by subreddits][watchful1-bulk],
|
||||||
|
and the [GitHub repo with scripts for analyzing data][watchful1-repo]
|
||||||
|
- [RaiderBDev monthly dumps][raiderbdev-monthly] and
|
||||||
|
[RaiderBDev's ArcticShift API][arctic-shift]
|
||||||
|
- The [2005-06 to 2024-12 academic torrent][academic-torrent] used for
|
||||||
|
the 2005-2024 refresh
|
||||||
|
|
||||||
|
CDSC and Hyak docs:
|
||||||
|
|
||||||
|
- [Hyak docs — how to work with modules][hyak-modules]
|
||||||
|
- [CDSC — how to download Python or R packages][cdsc-pkgs]
|
||||||
|
- [CDSC — Hyak datasets information][hyak-datasets]
|
||||||
|
- [CDSC — Hyak Spark information][hyak-spark]
|
||||||
|
|
||||||
|
[hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak
|
||||||
|
[hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/
|
||||||
|
[watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/
|
||||||
|
[watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/
|
||||||
|
[watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master
|
||||||
|
[raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/
|
||||||
|
[arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift
|
||||||
|
[academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
|
||||||
|
[hyak-modules]: https://hyak.uw.edu/docs/tools/modules
|
||||||
|
[cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages
|
||||||
|
[hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark
|
||||||
|
|
||||||
|
### Step 1: data download on Nada and Hyak
|
||||||
|
|
||||||
|
We downloaded the [2005-2024 academic torrent][academic-torrent] and put
|
||||||
|
it on Nada (~2 days of downloading). We copied the raw data over to
|
||||||
|
Hyak's scrubbed directory in a new directory,
|
||||||
|
`/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw
|
||||||
|
data sorted into `/comments` or `/submissions`. The `/submissions`
|
||||||
|
directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst`
|
||||||
|
files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal
|
||||||
|
with.)
|
||||||
|
|
||||||
|
### Step 2: clone the repo on Hyak
|
||||||
|
|
||||||
|
On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the
|
||||||
|
working directory next to the raw data, e.g.
|
||||||
|
`/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant
|
||||||
|
code lives entirely in `datasets/`:
|
||||||
|
|
||||||
|
- `dumps_helper.py` — shared parsing and Spark logic
|
||||||
|
- `helper.py` — file-open helpers
|
||||||
|
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
|
||||||
|
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
|
||||||
|
- `comments_merge.py`, `submissions_merge.py` — merge entry points
|
||||||
|
- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts
|
||||||
|
|
||||||
|
The Spark wrapper scripts (`start_spark_and_run.sh`,
|
||||||
|
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
|
||||||
|
they are part of the CDSC Hyak environment and should already be on
|
||||||
|
PATH.
|
||||||
|
|
||||||
|
### Step 3: smoke-test Part 1 on a single file
|
||||||
|
|
||||||
|
Check out `any_machine`. We'll test submissions Part 1 with just one
|
||||||
|
file:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python3 submissions_part1.py parse_dump RS_2005-06.zst
|
||||||
|
```
|
||||||
|
|
||||||
|
To verify, go to your output directory and examine the start of the
|
||||||
|
file:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())"
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see columns like `id`, `author`, `subreddit`, and `title`
|
||||||
|
printed out. Repeat the process with `comments_part1.py`; you should see
|
||||||
|
columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out.
|
||||||
|
|
||||||
|
**Note**: you may have to install relevant libraries before successfully
|
||||||
|
running the file:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
pip install --user pyarrow simdjson zstandard fire
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Part 1 — converting `.zst` to `.parquet` files
|
||||||
|
|
||||||
|
Now we'll convert all of our `.zst` compressed Reddit data to `.parquet`
|
||||||
|
files. First, to generate our task list, we'll run
|
||||||
|
|
||||||
|
```sh
|
||||||
|
python3 submissions_part1.py gen_task_list
|
||||||
|
```
|
||||||
|
|
||||||
|
There should be a script, `parse_submissions_task_list`, in the working
|
||||||
|
directory. Check the script (`less parse_submissions_task_list`); it
|
||||||
|
should have many lines that look like our earlier test command,
|
||||||
|
`python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of
|
||||||
|
our `.zst` files. Do the same process with comments to generate
|
||||||
|
`parse_comments_task_list`.
|
||||||
|
|
||||||
|
From a login node, run `tmux` to keep our job running and then
|
||||||
|
`any_machine` to check out a node to do computational work. We'll run
|
||||||
|
our tasks (from the task list) in parallel to optimize. Start with
|
||||||
|
submissions:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list
|
||||||
|
```
|
||||||
|
|
||||||
|
The `--joblog` flag creates a text file where you can see which tasks
|
||||||
|
completed successfully, and the `--results` flag creates a directory
|
||||||
|
where each task has its own stderr output to see the specific error
|
||||||
|
(this is best practice for debugging).
|
||||||
|
|
||||||
|
Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`).
|
||||||
|
We'll ssh into our computational node (`ssh n1234` — you can get the
|
||||||
|
node name by running `ourjobs`) and run `htop`
|
||||||
|
([more details on htop][htop-explainer]). You should see that the
|
||||||
|
machine's CPUs are getting close to 100% usage. If all looks good,
|
||||||
|
create a new window and repeat the process for comments.
|
||||||
|
|
||||||
|
[htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/
|
||||||
|
|
||||||
|
Once the job has successfully completed, you'll see that your CPUs are
|
||||||
|
closer to 0% usage in `htop` and your `submissions_joblog.txt` file
|
||||||
|
should show an `exitval` of 0 for all commands. Kill your node by
|
||||||
|
running `scancel 12345678` (the job ID can be found from `ourjobs`).
|
||||||
|
|
||||||
|
### Step 5: verify the per-source parquet files
|
||||||
|
|
||||||
|
We'll want to verify our `.parquet` files at this point. We compared the
|
||||||
|
new files' number of columns and rows to the old data: from the
|
||||||
|
`/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet`
|
||||||
|
directory, run
|
||||||
|
|
||||||
|
```sh
|
||||||
|
diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet)
|
||||||
|
```
|
||||||
|
|
||||||
|
and confirm there are no differences (same process with submissions).
|
||||||
|
This may or may not be relevant if we continue using the same academic
|
||||||
|
torrent to update data and have nothing to compare to, but you can still
|
||||||
|
check that the new data's number of columns and rows are fairly
|
||||||
|
continuous with the most recent data we already have.
|
||||||
|
|
||||||
|
### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
|
||||||
|
|
||||||
|
If the `.parquet` files reasonably appear to be complete, we can now
|
||||||
|
sort them by author and subreddit. The most efficient way to do so is via
|
||||||
|
`srun` on a `cpu-g2` node (128 CPUs, ~1 TB RAM). Using `srun` releases
|
||||||
|
the node automatically when the job finishes. Run from a login node
|
||||||
|
inside `tmux`:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
|
||||||
|
bash -l -c "
|
||||||
|
cd /path/to/cdsc_reddit/datasets && \
|
||||||
|
source \$SPARK_CONF_DIR/spark-env.sh && \
|
||||||
|
start_spark_cluster.sh && \
|
||||||
|
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT submissions_part2.py && \
|
||||||
|
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT comments_part2.py && \
|
||||||
|
stop-all.sh
|
||||||
|
"
|
||||||
|
```
|
||||||
|
|
||||||
|
[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
|
||||||
|
|
||||||
|
Monitor via `htop` (as described in Step 4); the CPUs may not always
|
||||||
|
show high usage but you should see that memory is being used. Repeat
|
||||||
|
for the comments. Successful jobs will result in
|
||||||
|
`/gscratch/comdata/output` having four new directories:
|
||||||
|
`reddit_submissions_by_author.parquet`,
|
||||||
|
`reddit_submissions_by_subreddit.parquet`,
|
||||||
|
`reddit_comments_by_author.parquet`, and
|
||||||
|
`reddit_comments_by_subreddit.parquet`. Each should contain many
|
||||||
|
`snappy.parquet` files (e.g.
|
||||||
|
`part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`)
|
||||||
|
and `_SUCCESS`.
|
||||||
|
|
||||||
|
### Step 7: data verification
|
||||||
|
|
||||||
|
Verify and make sure the new data is reasonably complete before deleting
|
||||||
|
any of the old data. Do a simple time series to see how many posts there
|
||||||
|
are per day and make sure things don't fall off. It is also useful to
|
||||||
|
have lab members test out anything they're working on again with the
|
||||||
|
new parquet files.
|
||||||
|
|
||||||
|
## See also
|
||||||
|
|
||||||
|
The CDSC wiki page
|
||||||
|
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
|
||||||
|
is the landing page for this project on the wiki and provides
|
||||||
|
cross-links to related CDSC and Hyak documentation. The walkthrough
|
||||||
|
above used to live there; it now lives here so that doc and code stay
|
||||||
|
in sync.
|
||||||
162
datasets/add_months.sh
Executable file
162
datasets/add_months.sh
Executable file
@@ -0,0 +1,162 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Add one or more new months to the existing parquet datasets using a
|
||||||
|
# layered append. Designed to run on a single fat node (e.g. cpu-g2 with
|
||||||
|
# 128 cores / ~1TB RAM). For a multi-node Spark cluster instead, see
|
||||||
|
# add_months_multinode.sh.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# add_months.sh [--clean] YYYY-MM [YYYY-MM ...]
|
||||||
|
#
|
||||||
|
# Example:
|
||||||
|
# add_months.sh 2025-01 2025-02 2025-03
|
||||||
|
#
|
||||||
|
# If temp or staging directories from a previous run exist, the script
|
||||||
|
# will exit with an error. Pass --clean to wipe them before starting:
|
||||||
|
#
|
||||||
|
# The new .zst dump files must live at:
|
||||||
|
# $COMMENTS_DUMPDIR/RC_YYYY-MM.zst
|
||||||
|
# $SUBMISSIONS_DUMPDIR/RS_YYYY-MM.zst
|
||||||
|
#
|
||||||
|
# Override the dump directories via environment variables if the new files
|
||||||
|
# are not in the standard locations:
|
||||||
|
#
|
||||||
|
# COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||||
|
# SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||||
|
# ./add_months.sh 2025-01 2025-02
|
||||||
|
#
|
||||||
|
# Workflow:
|
||||||
|
# Part 1 — parse new .zst files into per-month parquets (parallel)
|
||||||
|
# Part 2 — sort into staging directories, not the live datasets (Spark)
|
||||||
|
# [script exits here — verify staging before continuing]
|
||||||
|
# Copy — move staging files into live datasets (run manually after verify)
|
||||||
|
# Cleanup — remove temp and staging dirs (run manually after copy)
|
||||||
|
#
|
||||||
|
# NOTE: This script and its workflow are written but not yet tested.
|
||||||
|
# Remove this notice after a successful end-to-end run.
|
||||||
|
#
|
||||||
|
# Every command below is independently runnable for debugging.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
cd "$(dirname "$0")"
|
||||||
|
|
||||||
|
CLEAN=0
|
||||||
|
if [ "${1:-}" = "--clean" ]; then
|
||||||
|
CLEAN=1
|
||||||
|
shift
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ $# -eq 0 ]; then
|
||||||
|
echo "Usage: $0 [--clean] YYYY-MM [YYYY-MM ...]" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
COMMENTS_DUMPDIR="${COMMENTS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/comments}"
|
||||||
|
SUBMISSIONS_DUMPDIR="${SUBMISSIONS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/submissions}"
|
||||||
|
PYTHON="${PYTHON:-python3}"
|
||||||
|
|
||||||
|
# Part 1 temp dirs (per-month parquets, parsed from .zst)
|
||||||
|
TEMP_COMMENTS="/gscratch/comdata/output/temp/add_months_comments.parquet"
|
||||||
|
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/add_months_submissions.parquet"
|
||||||
|
|
||||||
|
# Staging dirs (sorted new layer; inspected before copying to live)
|
||||||
|
STAGING_COMMENTS_SUB="/gscratch/comdata/output/temp/new_layer_comments_by_subreddit.parquet"
|
||||||
|
STAGING_COMMENTS_AUTH="/gscratch/comdata/output/temp/new_layer_comments_by_author.parquet"
|
||||||
|
STAGING_SUBMISSIONS_SUB="/gscratch/comdata/output/temp/new_layer_submissions_by_subreddit.parquet"
|
||||||
|
STAGING_SUBMISSIONS_AUTH="/gscratch/comdata/output/temp/new_layer_submissions_by_author.parquet"
|
||||||
|
|
||||||
|
# Live dataset dirs
|
||||||
|
LIVE_COMMENTS_SUB="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet"
|
||||||
|
LIVE_COMMENTS_AUTH="/gscratch/comdata/output/reddit_comments_by_author.parquet"
|
||||||
|
LIVE_SUBMISSIONS_SUB="/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet"
|
||||||
|
LIVE_SUBMISSIONS_AUTH="/gscratch/comdata/output/reddit_submissions_by_author.parquet"
|
||||||
|
|
||||||
|
# --- Check for leftover output from a previous run --------------------------
|
||||||
|
|
||||||
|
EXISTING=()
|
||||||
|
for d in "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS" \
|
||||||
|
"$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH" \
|
||||||
|
"$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"; do
|
||||||
|
[ -e "$d" ] && EXISTING+=("$d")
|
||||||
|
done
|
||||||
|
|
||||||
|
if [ ${#EXISTING[@]} -gt 0 ]; then
|
||||||
|
if [ $CLEAN -eq 1 ]; then
|
||||||
|
echo "Removing leftover files from previous run..."
|
||||||
|
rm -rf "${EXISTING[@]}"
|
||||||
|
rm -f add_months_tasks.txt add_months_joblog.txt
|
||||||
|
rm -rf add_months_logs/
|
||||||
|
else
|
||||||
|
echo "Error: leftover files from a previous run exist:" >&2
|
||||||
|
printf ' %s\n' "${EXISTING[@]}" >&2
|
||||||
|
echo "Re-run with --clean to remove them before starting." >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
# --- Part 1: parse new months in parallel (comments and submissions together) -
|
||||||
|
|
||||||
|
printf "$PYTHON comments_part1.py parse_dump RC_%s.zst --dumpdir=\"$COMMENTS_DUMPDIR\" --outdir=\"$TEMP_COMMENTS\"\n" "$@" \
|
||||||
|
> add_months_tasks.txt
|
||||||
|
|
||||||
|
printf "$PYTHON submissions_part1.py parse_dump RS_%s.zst --dumpdir=\"$SUBMISSIONS_DUMPDIR\" --outdir=\"$TEMP_SUBMISSIONS\"\n" "$@" \
|
||||||
|
>> add_months_tasks.txt
|
||||||
|
|
||||||
|
parallel --joblog add_months_joblog.txt --results add_months_logs \
|
||||||
|
< add_months_tasks.txt
|
||||||
|
|
||||||
|
# --- Part 2: sort new months into staging (Spark, single fat node) ----------
|
||||||
|
|
||||||
|
source "$SPARK_CONF_DIR/spark-env.sh"
|
||||||
|
start_spark_cluster.sh
|
||||||
|
|
||||||
|
spark-submit --master "spark://$(hostname):$SPARK_MASTER_PORT" \
|
||||||
|
comments_part2.py \
|
||||||
|
--indir="$TEMP_COMMENTS" \
|
||||||
|
--out_by_subreddit="$STAGING_COMMENTS_SUB" \
|
||||||
|
--out_by_author="$STAGING_COMMENTS_AUTH"
|
||||||
|
|
||||||
|
spark-submit --master "spark://$(hostname):$SPARK_MASTER_PORT" \
|
||||||
|
submissions_part2.py \
|
||||||
|
--indir="$TEMP_SUBMISSIONS" \
|
||||||
|
--out_by_subreddit="$STAGING_SUBMISSIONS_SUB" \
|
||||||
|
--out_by_author="$STAGING_SUBMISSIONS_AUTH"
|
||||||
|
|
||||||
|
stop-all.sh
|
||||||
|
|
||||||
|
# --- Verify: inspect staging before copying to live -------------------------
|
||||||
|
#
|
||||||
|
# The script stops here. Check the staging output looks right before running
|
||||||
|
# the copy step manually. The live datasets are untouched at this point.
|
||||||
|
# Example checks:
|
||||||
|
#
|
||||||
|
# ls -lah "$STAGING_COMMENTS_SUB" | head
|
||||||
|
# python3 -c "
|
||||||
|
# import pyarrow.parquet as pq, os
|
||||||
|
# f = sorted(os.listdir('$STAGING_COMMENTS_SUB'))[0]
|
||||||
|
# t = pq.read_table('$STAGING_COMMENTS_SUB/' + f, columns=['created_utc'])
|
||||||
|
# print(t.column('created_utc')[0].as_py(), t.column('created_utc')[-1].as_py())
|
||||||
|
# "
|
||||||
|
|
||||||
|
exit 0
|
||||||
|
|
||||||
|
# --- Copy: add staging files into live datasets -----------------------------
|
||||||
|
#
|
||||||
|
# Run these lines manually after verifying staging. This is the only step
|
||||||
|
# that touches the live datasets. It only adds new files — existing files
|
||||||
|
# are never deleted or overwritten.
|
||||||
|
|
||||||
|
find "$STAGING_COMMENTS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_SUB"/ \;
|
||||||
|
find "$STAGING_COMMENTS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_AUTH"/ \;
|
||||||
|
find "$STAGING_SUBMISSIONS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_SUB"/ \;
|
||||||
|
find "$STAGING_SUBMISSIONS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_AUTH"/ \;
|
||||||
|
|
||||||
|
# --- Cleanup: remove temp and staging dirs ----------------------------------
|
||||||
|
#
|
||||||
|
# Run after confirming the copy succeeded and the live datasets look right.
|
||||||
|
|
||||||
|
rm -f add_months_tasks.txt add_months_joblog.txt
|
||||||
|
rm -rf add_months_logs/
|
||||||
|
rm -rf "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS"
|
||||||
|
rm -rf "$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH"
|
||||||
|
rm -rf "$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"
|
||||||
89
datasets/add_months_multinode.sh
Executable file
89
datasets/add_months_multinode.sh
Executable file
@@ -0,0 +1,89 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Multi-node variant of add_months.sh. Uses start_spark_and_run.sh to
|
||||||
|
# allocate a Spark cluster across multiple nodes via salloc. Run this
|
||||||
|
# from a login node.
|
||||||
|
#
|
||||||
|
# For the common single-fat-node case, use add_months.sh instead.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# add_months_multinode.sh NODES YYYY-MM [YYYY-MM ...]
|
||||||
|
#
|
||||||
|
# Example (2 nodes, 3 months):
|
||||||
|
# add_months_multinode.sh 2 2025-01 2025-02 2025-03
|
||||||
|
#
|
||||||
|
# Override dump directories via environment variables if needed:
|
||||||
|
#
|
||||||
|
# COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||||
|
# SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||||
|
# ./add_months_multinode.sh 2 2025-01 2025-02
|
||||||
|
#
|
||||||
|
# NOTE: This script and its workflow are written but not yet tested.
|
||||||
|
# Remove this notice after a successful end-to-end run.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
cd "$(dirname "$0")"
|
||||||
|
|
||||||
|
NODES="${1:-}"
|
||||||
|
if [ -z "$NODES" ] || [ $# -lt 2 ]; then
|
||||||
|
echo "Usage: $0 NODES YYYY-MM [YYYY-MM ...]" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
shift
|
||||||
|
MONTHS=("$@")
|
||||||
|
|
||||||
|
COMMENTS_DUMPDIR="${COMMENTS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/comments}"
|
||||||
|
SUBMISSIONS_DUMPDIR="${SUBMISSIONS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/submissions}"
|
||||||
|
PYTHON="${PYTHON:-python3}"
|
||||||
|
|
||||||
|
TEMP_COMMENTS="/gscratch/comdata/output/temp/add_months_comments.parquet"
|
||||||
|
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/add_months_submissions.parquet"
|
||||||
|
STAGING_COMMENTS_SUB="/gscratch/comdata/output/temp/new_layer_comments_by_subreddit.parquet"
|
||||||
|
STAGING_COMMENTS_AUTH="/gscratch/comdata/output/temp/new_layer_comments_by_author.parquet"
|
||||||
|
STAGING_SUBMISSIONS_SUB="/gscratch/comdata/output/temp/new_layer_submissions_by_subreddit.parquet"
|
||||||
|
STAGING_SUBMISSIONS_AUTH="/gscratch/comdata/output/temp/new_layer_submissions_by_author.parquet"
|
||||||
|
LIVE_COMMENTS_SUB="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet"
|
||||||
|
LIVE_COMMENTS_AUTH="/gscratch/comdata/output/reddit_comments_by_author.parquet"
|
||||||
|
LIVE_SUBMISSIONS_SUB="/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet"
|
||||||
|
LIVE_SUBMISSIONS_AUTH="/gscratch/comdata/output/reddit_submissions_by_author.parquet"
|
||||||
|
|
||||||
|
# --- Part 1: parse new months in parallel -----------------------------------
|
||||||
|
|
||||||
|
printf "$PYTHON comments_part1.py parse_dump RC_%s.zst --dumpdir=\"$COMMENTS_DUMPDIR\" --outdir=\"$TEMP_COMMENTS\"\n" "${MONTHS[@]}" \
|
||||||
|
> add_months_comments_tasks.txt
|
||||||
|
|
||||||
|
printf "$PYTHON submissions_part1.py parse_dump RS_%s.zst --dumpdir=\"$SUBMISSIONS_DUMPDIR\" --outdir=\"$TEMP_SUBMISSIONS\"\n" "${MONTHS[@]}" \
|
||||||
|
> add_months_submissions_tasks.txt
|
||||||
|
|
||||||
|
parallel --joblog add_months_comments_joblog.txt --results add_months_comments_logs \
|
||||||
|
< add_months_comments_tasks.txt
|
||||||
|
|
||||||
|
parallel --joblog add_months_submissions_joblog.txt --results add_months_submissions_logs \
|
||||||
|
< add_months_submissions_tasks.txt
|
||||||
|
|
||||||
|
# --- Part 2: sort new months into staging (multi-node Spark cluster) --------
|
||||||
|
|
||||||
|
start_spark_and_run.sh "$NODES" comments_part2.py \
|
||||||
|
--indir="$TEMP_COMMENTS" \
|
||||||
|
--out_by_subreddit="$STAGING_COMMENTS_SUB" \
|
||||||
|
--out_by_author="$STAGING_COMMENTS_AUTH"
|
||||||
|
|
||||||
|
start_spark_and_run.sh "$NODES" submissions_part2.py \
|
||||||
|
--indir="$TEMP_SUBMISSIONS" \
|
||||||
|
--out_by_subreddit="$STAGING_SUBMISSIONS_SUB" \
|
||||||
|
--out_by_author="$STAGING_SUBMISSIONS_AUTH"
|
||||||
|
|
||||||
|
# --- Verify staging, then copy and cleanup manually -------------------------
|
||||||
|
#
|
||||||
|
# See add_months.sh for verify/copy/cleanup commands — they are identical.
|
||||||
|
|
||||||
|
exit 0
|
||||||
|
|
||||||
|
find "$STAGING_COMMENTS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_SUB"/ \;
|
||||||
|
find "$STAGING_COMMENTS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_AUTH"/ \;
|
||||||
|
find "$STAGING_SUBMISSIONS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_SUB"/ \;
|
||||||
|
find "$STAGING_SUBMISSIONS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_AUTH"/ \;
|
||||||
|
|
||||||
|
rm -rf "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS"
|
||||||
|
rm -rf "$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH"
|
||||||
|
rm -rf "$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"
|
||||||
56
datasets/build_from_scratch.sh
Executable file
56
datasets/build_from_scratch.sh
Executable file
@@ -0,0 +1,56 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Build the sorted, partitioned Reddit parquet datasets from scratch.
|
||||||
|
#
|
||||||
|
# Wipes the per-source temp directories, processes every RC_* and RS_* dump
|
||||||
|
# in the raw_data dumps directory through Part 1 (per-file, parallel), then
|
||||||
|
# runs the Part 2 Spark sort + repartition for both comments and submissions.
|
||||||
|
#
|
||||||
|
# Every command below is independently runnable — to debug a single stage,
|
||||||
|
# copy the line out and run it directly. Run the whole script end-to-end
|
||||||
|
# only when you trust each step.
|
||||||
|
#
|
||||||
|
# Prerequisites:
|
||||||
|
# - raw .zst dumps already staged in the dumpdir locations (see the
|
||||||
|
# defaults in dumps_helper.py, or override via --dumpdir)
|
||||||
|
# - GNU parallel installed
|
||||||
|
# - start_spark_and_run.sh on PATH (Hyak-provided wrapper)
|
||||||
|
#
|
||||||
|
# To add new months to an existing build without rebuilding from scratch,
|
||||||
|
# use add_months.sh.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
cd "$(dirname "$0")"
|
||||||
|
|
||||||
|
TEMP_COMMENTS="/gscratch/comdata/output/temp/reddit_comments.parquet"
|
||||||
|
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/reddit_submissions.parquet"
|
||||||
|
|
||||||
|
# --- Part 1a: comments ------------------------------------------------------
|
||||||
|
|
||||||
|
# wipe any existing comments temp output
|
||||||
|
rm -rf "$TEMP_COMMENTS"
|
||||||
|
|
||||||
|
# generate the per-file parse task list
|
||||||
|
python3 comments_part1.py gen_task_list
|
||||||
|
|
||||||
|
# run all comments parse tasks in parallel
|
||||||
|
parallel --joblog comments_joblog.txt --results comments_logs < parse_comments_task_list
|
||||||
|
|
||||||
|
# --- Part 1b: submissions ---------------------------------------------------
|
||||||
|
|
||||||
|
# wipe any existing submissions temp output
|
||||||
|
rm -rf "$TEMP_SUBMISSIONS"
|
||||||
|
|
||||||
|
# generate the per-file parse task list
|
||||||
|
python3 submissions_part1.py gen_task_list
|
||||||
|
|
||||||
|
# run all submissions parse tasks in parallel
|
||||||
|
parallel --joblog submissions_joblog.txt --results submissions_logs < parse_submissions_task_list
|
||||||
|
|
||||||
|
# --- Part 2: spark sort + repartition --------------------------------------
|
||||||
|
|
||||||
|
# sort comments and write reddit_comments_by_{subreddit,author}.parquet
|
||||||
|
start_spark_and_run.sh 1 comments_part2.py
|
||||||
|
|
||||||
|
# sort submissions and write reddit_submissions_by_{subreddit,author}.parquet
|
||||||
|
start_spark_and_run.sh 1 submissions_part2.py
|
||||||
@@ -1,10 +0,0 @@
|
|||||||
## needs to be run by hand since i don't have a nice way of waiting on a parallel-sql job to complete
|
|
||||||
|
|
||||||
#!/usr/bin/env bash
|
|
||||||
echo "#!/usr/bin/bash" > job_script.sh
|
|
||||||
#echo "source $(pwd)/../bin/activate" >> job_script.sh
|
|
||||||
echo "python3 $(pwd)/comments_2_parquet_part1.py" >> job_script.sh
|
|
||||||
|
|
||||||
srun -p comdata -A comdata --nodes=1 --mem=120G --time=48:00:00 --pty job_script.sh
|
|
||||||
|
|
||||||
start_spark_and_run.sh 1 $(pwd)/comments_2_parquet_part2.py
|
|
||||||
@@ -1,115 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
import json
|
|
||||||
from datetime import datetime
|
|
||||||
from multiprocessing import Pool
|
|
||||||
from itertools import islice
|
|
||||||
from helper import find_dumps, open_fileset
|
|
||||||
import pandas as pd
|
|
||||||
import pyarrow as pa
|
|
||||||
import pyarrow.parquet as pq
|
|
||||||
|
|
||||||
def parse_comment(comment, names= None):
|
|
||||||
if names is None:
|
|
||||||
names = ["id","subreddit","link_id","parent_id","created_utc","author","ups","downs","score","edited","subreddit_type","subreddit_id","stickied","is_submitter","body","error"]
|
|
||||||
|
|
||||||
try:
|
|
||||||
comment = json.loads(comment)
|
|
||||||
except json.decoder.JSONDecodeError as e:
|
|
||||||
print(e)
|
|
||||||
print(comment)
|
|
||||||
row = [None for _ in names]
|
|
||||||
row[-1] = "json.decoder.JSONDecodeError|{0}|{1}".format(e,comment)
|
|
||||||
return tuple(row)
|
|
||||||
|
|
||||||
row = []
|
|
||||||
for name in names:
|
|
||||||
if name == 'created_utc':
|
|
||||||
row.append(datetime.fromtimestamp(int(comment['created_utc']),tz=None))
|
|
||||||
elif name == 'edited':
|
|
||||||
val = comment[name]
|
|
||||||
if type(val) == bool:
|
|
||||||
row.append(val)
|
|
||||||
row.append(None)
|
|
||||||
else:
|
|
||||||
row.append(True)
|
|
||||||
row.append(datetime.fromtimestamp(int(val),tz=None))
|
|
||||||
elif name == "time_edited":
|
|
||||||
continue
|
|
||||||
elif name not in comment:
|
|
||||||
row.append(None)
|
|
||||||
|
|
||||||
else:
|
|
||||||
row.append(comment[name])
|
|
||||||
|
|
||||||
return tuple(row)
|
|
||||||
|
|
||||||
|
|
||||||
# conf = sc._conf.setAll([('spark.executor.memory', '20g'), ('spark.app.name', 'extract_reddit_timeline'), ('spark.executor.cores', '26'), ('spark.cores.max', '26'), ('spark.driver.memory','84g'),('spark.driver.maxResultSize','0'),('spark.local.dir','/gscratch/comdata/spark_tmp')])
|
|
||||||
|
|
||||||
dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/comments/"
|
|
||||||
|
|
||||||
files = list(find_dumps(dumpdir, base_pattern="RC_20*"))
|
|
||||||
|
|
||||||
pool = Pool(28)
|
|
||||||
|
|
||||||
stream = open_fileset(files)
|
|
||||||
|
|
||||||
N = int(1e4)
|
|
||||||
|
|
||||||
rows = pool.imap_unordered(parse_comment, stream, chunksize=int(N/28))
|
|
||||||
|
|
||||||
schema = pa.schema([
|
|
||||||
pa.field('id', pa.string(), nullable=True),
|
|
||||||
pa.field('subreddit', pa.string(), nullable=True),
|
|
||||||
pa.field('link_id', pa.string(), nullable=True),
|
|
||||||
pa.field('parent_id', pa.string(), nullable=True),
|
|
||||||
pa.field('created_utc', pa.timestamp('ms'), nullable=True),
|
|
||||||
pa.field('author', pa.string(), nullable=True),
|
|
||||||
pa.field('ups', pa.int64(), nullable=True),
|
|
||||||
pa.field('downs', pa.int64(), nullable=True),
|
|
||||||
pa.field('score', pa.int64(), nullable=True),
|
|
||||||
pa.field('edited', pa.bool_(), nullable=True),
|
|
||||||
pa.field('time_edited', pa.timestamp('ms'), nullable=True),
|
|
||||||
pa.field('subreddit_type', pa.string(), nullable=True),
|
|
||||||
pa.field('subreddit_id', pa.string(), nullable=True),
|
|
||||||
pa.field('stickied', pa.bool_(), nullable=True),
|
|
||||||
pa.field('is_submitter', pa.bool_(), nullable=True),
|
|
||||||
pa.field('body', pa.string(), nullable=True),
|
|
||||||
pa.field('error', pa.string(), nullable=True),
|
|
||||||
])
|
|
||||||
|
|
||||||
from pathlib import Path
|
|
||||||
p = Path("/gscratch/comdata/output/reddit_comments.parquet_temp2")
|
|
||||||
|
|
||||||
if not p.is_dir():
|
|
||||||
if p.exists():
|
|
||||||
p.unlink()
|
|
||||||
p.mkdir()
|
|
||||||
|
|
||||||
else:
|
|
||||||
list(map(Path.unlink,p.glob('*')))
|
|
||||||
|
|
||||||
part_size = int(1e7)
|
|
||||||
part = 1
|
|
||||||
n_output = 0
|
|
||||||
writer = pq.ParquetWriter(f"/gscratch/comdata/output/reddit_comments.parquet_temp2/part_{part}.parquet",schema=schema,compression='snappy',flavor='spark')
|
|
||||||
|
|
||||||
while True:
|
|
||||||
if n_output > part_size:
|
|
||||||
if part > 1:
|
|
||||||
writer.close()
|
|
||||||
|
|
||||||
part = part + 1
|
|
||||||
n_output = 0
|
|
||||||
|
|
||||||
writer = pq.ParquetWriter(f"/gscratch/comdata/output/reddit_comments.parquet_temp2/part_{part}.parquet",schema=schema,compression='snappy',flavor='spark')
|
|
||||||
|
|
||||||
n_output += N
|
|
||||||
chunk = islice(rows,N)
|
|
||||||
pddf = pd.DataFrame(chunk, columns=schema.names)
|
|
||||||
table = pa.Table.from_pandas(pddf,schema=schema)
|
|
||||||
if table.shape[0] == 0:
|
|
||||||
break
|
|
||||||
writer.write_table(table)
|
|
||||||
|
|
||||||
|
|
||||||
@@ -1,29 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
# spark script to make sorted, and partitioned parquet files
|
|
||||||
|
|
||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_comments.parquet_temp2",compression='snappy')
|
|
||||||
|
|
||||||
df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
|
|
||||||
df = df.drop('subreddit')
|
|
||||||
df = df.withColumnRenamed('subreddit_2','subreddit')
|
|
||||||
|
|
||||||
df = df.withColumnRenamed("created_utc","CreatedAt")
|
|
||||||
df = df.withColumn("Month",f.month(f.col("CreatedAt")))
|
|
||||||
df = df.withColumn("Year",f.year(f.col("CreatedAt")))
|
|
||||||
df = df.withColumn("Day",f.dayofmonth(f.col("CreatedAt")))
|
|
||||||
|
|
||||||
df = df.repartition('subreddit')
|
|
||||||
df2 = df.sort(["subreddit","CreatedAt","link_id","parent_id","Year","Month","Day"],ascending=True)
|
|
||||||
df2 = df2.sortWithinPartitions(["subreddit","CreatedAt","link_id","parent_id","Year","Month","Day"],ascending=True)
|
|
||||||
df2.write.parquet("/gscratch/comdata/users/nathante/reddit_comments_by_subreddit.parquet_new", mode='overwrite', compression='snappy')
|
|
||||||
|
|
||||||
df = df.repartition('author')
|
|
||||||
df3 = df.sort(["author","CreatedAt","subreddit","link_id","parent_id","Year","Month","Day"],ascending=True)
|
|
||||||
df3 = df3.sortWithinPartitions(["author","CreatedAt","subreddit","link_id","parent_id","Year","Month","Day"],ascending=True)
|
|
||||||
df3.write.parquet("/gscratch/comdata/users/nathante/reddit_comments_by_author.parquet_new", mode='overwrite',compression='snappy')
|
|
||||||
14
datasets/comments_merge.py
Normal file
14
datasets/comments_merge.py
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Collapse all layers in the comments final datasets into a single clean layer.
|
||||||
|
|
||||||
|
Must be launched from a login node via the Hyak-provided wrapper:
|
||||||
|
start_spark_and_run.sh 1 comments_merge.py
|
||||||
|
|
||||||
|
See merge_layers.sh and dumps_helper.merge_layers for details.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from dumps_helper import COMMENTS, merge_layers
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
merge_layers(COMMENTS)
|
||||||
24
datasets/comments_part1.py
Executable file
24
datasets/comments_part1.py
Executable file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Part 1 for comments: parse one RC_*.zst dump into a parquet file.
|
||||||
|
|
||||||
|
CLI:
|
||||||
|
comments_part1.py parse_dump RC_2018-08.zst
|
||||||
|
comments_part1.py gen_task_list
|
||||||
|
comments_part1.py parse_dump RC_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
|
||||||
|
"""
|
||||||
|
|
||||||
|
import fire
|
||||||
|
from dumps_helper import COMMENTS, parse_dump, gen_task_list
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_dump(partition, dumpdir=None, outdir=None):
|
||||||
|
parse_dump(COMMENTS, partition, dumpdir=dumpdir, outdir=outdir)
|
||||||
|
|
||||||
|
|
||||||
|
def _gen_task_list(dumpdir=None, tasklist=None):
|
||||||
|
gen_task_list(COMMENTS, 'comments_part1.py', dumpdir=dumpdir, tasklist=tasklist)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
fire.Fire({'parse_dump': _parse_dump,
|
||||||
|
'gen_task_list': _gen_task_list})
|
||||||
21
datasets/comments_part2.py
Executable file
21
datasets/comments_part2.py
Executable file
@@ -0,0 +1,21 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Part 2 for comments: Spark sort + repartition into the final datasets.
|
||||||
|
|
||||||
|
Must be launched from a login node via the Hyak-provided wrapper:
|
||||||
|
start_spark_and_run.sh 1 comments_part2.py
|
||||||
|
start_spark_and_run.sh 1 comments_part2.py --indir=/path/to/parquets --mode=append
|
||||||
|
|
||||||
|
--indir defaults to the temp comments dir in dumps_helper.py.
|
||||||
|
--out_by_subreddit and --out_by_author default to the live dataset paths;
|
||||||
|
override them to write to staging directories first (see add_months.sh).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import fire
|
||||||
|
from dumps_helper import COMMENTS, sort_and_write
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
|
||||||
|
sort_and_write(COMMENTS, indir=indir,
|
||||||
|
out_by_subreddit=out_by_subreddit,
|
||||||
|
out_by_author=out_by_author))
|
||||||
343
datasets/dumps_helper.py
Normal file
343
datasets/dumps_helper.py
Normal file
@@ -0,0 +1,343 @@
|
|||||||
|
"""Shared logic for the comments and submissions dump-to-parquet pipeline.
|
||||||
|
|
||||||
|
Used by comments_part1.py / submissions_part1.py (Part 1: one compressed
|
||||||
|
dump file → one parquet file) and comments_part2.py / submissions_part2.py
|
||||||
|
(Part 2: Spark sort + repartition of the per-source parquets).
|
||||||
|
|
||||||
|
The two dump types only differ in their schemas and a handful of
|
||||||
|
field-specific extractors. The parse loop, the file I/O wrapping, the
|
||||||
|
task-list generator, and the Spark sort are all shared here.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
from datetime import datetime
|
||||||
|
from itertools import islice
|
||||||
|
|
||||||
|
import pandas as pd
|
||||||
|
import pyarrow as pa
|
||||||
|
import pyarrow.parquet as pq
|
||||||
|
import simdjson
|
||||||
|
|
||||||
|
from helper import find_dumps, open_fileset
|
||||||
|
|
||||||
|
|
||||||
|
_json = simdjson.Parser()
|
||||||
|
|
||||||
|
|
||||||
|
# --- field-level extractors ------------------------------------------------
|
||||||
|
|
||||||
|
def _ts(name):
|
||||||
|
"""Extractor for a unix-timestamp field (or None if missing)."""
|
||||||
|
def handler(record):
|
||||||
|
val = record.get(name)
|
||||||
|
if val is None:
|
||||||
|
return None
|
||||||
|
return datetime.fromtimestamp(int(val), tz=None)
|
||||||
|
return handler
|
||||||
|
|
||||||
|
|
||||||
|
def _edited(record):
|
||||||
|
"""Returns (edited, time_edited). The dump packs both into one `edited`
|
||||||
|
field that is either a bool (never edited / unknown timestamp) or a
|
||||||
|
unix timestamp."""
|
||||||
|
val = record.get('edited')
|
||||||
|
if isinstance(val, bool):
|
||||||
|
return (val, None)
|
||||||
|
if val is None:
|
||||||
|
return (None, None)
|
||||||
|
return (True, datetime.fromtimestamp(int(val), tz=None))
|
||||||
|
|
||||||
|
|
||||||
|
def _has_media(record):
|
||||||
|
"""Submissions don't have a `has_media` field directly — derive it."""
|
||||||
|
return record.get('media') is not None
|
||||||
|
|
||||||
|
|
||||||
|
# --- generic parse loop ----------------------------------------------------
|
||||||
|
|
||||||
|
def parse_record(line, fields, handlers):
|
||||||
|
"""Parse one JSON line into a tuple aligned with `fields`.
|
||||||
|
|
||||||
|
`handlers` maps field name → callable(record) returning either a single
|
||||||
|
value (one column) or a tuple of values (multiple consecutive columns,
|
||||||
|
consuming the next len(tuple)-1 entries in `fields`).
|
||||||
|
Fields without a handler are pulled from the record by name, with
|
||||||
|
missing keys yielding None.
|
||||||
|
The last field in `fields` is reserved for an error message string
|
||||||
|
and is set to None on success.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
record = _json.parse(line)
|
||||||
|
except (ValueError, KeyError) as e:
|
||||||
|
row = [None] * len(fields)
|
||||||
|
row[-1] = f"parse error|{e}|{line}"
|
||||||
|
return tuple(row)
|
||||||
|
|
||||||
|
row = []
|
||||||
|
skip_next = 0
|
||||||
|
for name in fields:
|
||||||
|
if skip_next > 0:
|
||||||
|
skip_next -= 1
|
||||||
|
continue
|
||||||
|
handler = handlers.get(name)
|
||||||
|
if handler is None:
|
||||||
|
try:
|
||||||
|
row.append(record[name])
|
||||||
|
except KeyError:
|
||||||
|
row.append(None)
|
||||||
|
else:
|
||||||
|
result = handler(record)
|
||||||
|
if isinstance(result, tuple):
|
||||||
|
row.extend(result)
|
||||||
|
skip_next = len(result) - 1
|
||||||
|
else:
|
||||||
|
row.append(result)
|
||||||
|
return tuple(row)
|
||||||
|
|
||||||
|
|
||||||
|
# --- comments schema -------------------------------------------------------
|
||||||
|
|
||||||
|
COMMENT_FIELDS = [
|
||||||
|
'id', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'author',
|
||||||
|
'ups', 'downs', 'score', 'edited', 'time_edited', 'subreddit_type',
|
||||||
|
'subreddit_id', 'stickied', 'is_submitter', 'body', 'error',
|
||||||
|
]
|
||||||
|
|
||||||
|
COMMENT_SCHEMA = pa.schema([
|
||||||
|
pa.field('id', pa.string(), nullable=True),
|
||||||
|
pa.field('subreddit', pa.string(), nullable=True),
|
||||||
|
pa.field('link_id', pa.string(), nullable=True),
|
||||||
|
pa.field('parent_id', pa.string(), nullable=True),
|
||||||
|
pa.field('created_utc', pa.timestamp('ms'), nullable=True),
|
||||||
|
pa.field('author', pa.string(), nullable=True),
|
||||||
|
pa.field('ups', pa.int64(), nullable=True),
|
||||||
|
pa.field('downs', pa.int64(), nullable=True),
|
||||||
|
pa.field('score', pa.int64(), nullable=True),
|
||||||
|
pa.field('edited', pa.bool_(), nullable=True),
|
||||||
|
pa.field('time_edited', pa.timestamp('ms'), nullable=True),
|
||||||
|
pa.field('subreddit_type', pa.string(), nullable=True),
|
||||||
|
pa.field('subreddit_id', pa.string(), nullable=True),
|
||||||
|
pa.field('stickied', pa.bool_(), nullable=True),
|
||||||
|
pa.field('is_submitter', pa.bool_(), nullable=True),
|
||||||
|
pa.field('body', pa.string(), nullable=True),
|
||||||
|
pa.field('error', pa.string(), nullable=True),
|
||||||
|
])
|
||||||
|
|
||||||
|
COMMENT_HANDLERS = {
|
||||||
|
'created_utc': _ts('created_utc'),
|
||||||
|
'edited': _edited,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# --- submissions schema ----------------------------------------------------
|
||||||
|
|
||||||
|
SUBMISSION_FIELDS = [
|
||||||
|
'id', 'author', 'subreddit', 'title', 'created_utc', 'permalink', 'url',
|
||||||
|
'domain', 'score', 'ups', 'downs', 'over_18', 'has_media', 'selftext',
|
||||||
|
'retrieved_on', 'num_comments', 'gilded', 'edited', 'time_edited',
|
||||||
|
'subreddit_type', 'subreddit_id', 'subreddit_subscribers', 'name',
|
||||||
|
'is_self', 'stickied', 'quarantine', 'error',
|
||||||
|
]
|
||||||
|
|
||||||
|
SUBMISSION_SCHEMA = pa.schema([
|
||||||
|
pa.field('id', pa.string(), nullable=True),
|
||||||
|
pa.field('author', pa.string(), nullable=True),
|
||||||
|
pa.field('subreddit', pa.string(), nullable=True),
|
||||||
|
pa.field('title', pa.string(), nullable=True),
|
||||||
|
pa.field('created_utc', pa.timestamp('ms'), nullable=True),
|
||||||
|
pa.field('permalink', pa.string(), nullable=True),
|
||||||
|
pa.field('url', pa.string(), nullable=True),
|
||||||
|
pa.field('domain', pa.string(), nullable=True),
|
||||||
|
pa.field('score', pa.int64(), nullable=True),
|
||||||
|
pa.field('ups', pa.int64(), nullable=True),
|
||||||
|
pa.field('downs', pa.int64(), nullable=True),
|
||||||
|
pa.field('over_18', pa.bool_(), nullable=True),
|
||||||
|
pa.field('has_media', pa.bool_(), nullable=True),
|
||||||
|
pa.field('selftext', pa.string(), nullable=True),
|
||||||
|
pa.field('retrieved_on', pa.timestamp('ms'), nullable=True),
|
||||||
|
pa.field('num_comments', pa.int64(), nullable=True),
|
||||||
|
pa.field('gilded', pa.int64(), nullable=True),
|
||||||
|
pa.field('edited', pa.bool_(), nullable=True),
|
||||||
|
pa.field('time_edited', pa.timestamp('ms'), nullable=True),
|
||||||
|
pa.field('subreddit_type', pa.string(), nullable=True),
|
||||||
|
pa.field('subreddit_id', pa.string(), nullable=True),
|
||||||
|
pa.field('subreddit_subscribers', pa.int64(), nullable=True),
|
||||||
|
pa.field('name', pa.string(), nullable=True),
|
||||||
|
pa.field('is_self', pa.bool_(), nullable=True),
|
||||||
|
pa.field('stickied', pa.bool_(), nullable=True),
|
||||||
|
pa.field('quarantine', pa.bool_(), nullable=True),
|
||||||
|
pa.field('error', pa.string(), nullable=True),
|
||||||
|
])
|
||||||
|
|
||||||
|
SUBMISSION_HANDLERS = {
|
||||||
|
'created_utc': _ts('created_utc'),
|
||||||
|
'retrieved_on': _ts('retrieved_on'),
|
||||||
|
'edited': _edited,
|
||||||
|
'has_media': _has_media,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# --- per-type configuration ------------------------------------------------
|
||||||
|
|
||||||
|
# Defaults that the entry-point scripts pass through, exposed here so the
|
||||||
|
# field/schema/handler triplet, the canonical paths, and the dump filename
|
||||||
|
# pattern all live in one place.
|
||||||
|
COMMENTS = {
|
||||||
|
'fields': COMMENT_FIELDS,
|
||||||
|
'schema': COMMENT_SCHEMA,
|
||||||
|
'handlers': COMMENT_HANDLERS,
|
||||||
|
'dumpdir': "/gscratch/comdata/raw_data/reddit_dumps/comments",
|
||||||
|
'outdir': "/gscratch/comdata/output/temp/reddit_comments.parquet",
|
||||||
|
'file_pattern': 'RC_20*.*',
|
||||||
|
'task_list': 'parse_comments_task_list',
|
||||||
|
'output_by_subreddit': "/gscratch/comdata/output/reddit_comments_by_subreddit.parquet",
|
||||||
|
'output_by_author': "/gscratch/comdata/output/reddit_comments_by_author.parquet",
|
||||||
|
'subreddit_sort_keys': ["subreddit", "CreatedAt", "link_id", "parent_id", "Year", "Month", "Day"],
|
||||||
|
'author_sort_keys': ["author", "CreatedAt", "subreddit", "link_id", "parent_id", "Year", "Month", "Day"],
|
||||||
|
'app_name': "Reddit comments to parquet",
|
||||||
|
}
|
||||||
|
|
||||||
|
SUBMISSIONS = {
|
||||||
|
'fields': SUBMISSION_FIELDS,
|
||||||
|
'schema': SUBMISSION_SCHEMA,
|
||||||
|
'handlers': SUBMISSION_HANDLERS,
|
||||||
|
'dumpdir': "/gscratch/comdata/raw_data/reddit_dumps/submissions",
|
||||||
|
'outdir': "/gscratch/comdata/output/temp/reddit_submissions.parquet",
|
||||||
|
'file_pattern': 'RS_20*.*',
|
||||||
|
'task_list': 'parse_submissions_task_list',
|
||||||
|
'output_by_subreddit': "/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet",
|
||||||
|
'output_by_author': "/gscratch/comdata/output/reddit_submissions_by_author.parquet",
|
||||||
|
'subreddit_sort_keys': ["subreddit", "CreatedAt", "id"],
|
||||||
|
'author_sort_keys': ["author", "CreatedAt", "id"],
|
||||||
|
'app_name': "Reddit submissions to parquet",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
# --- Part 1: parse one dump file -> one parquet ----------------------------
|
||||||
|
|
||||||
|
def parse_dump(config, partition, dumpdir=None, outdir=None, chunk_size=10000):
|
||||||
|
"""Read one compressed dump from `dumpdir/partition` and write a parquet
|
||||||
|
file to `outdir/<basename>.parquet`. Streams chunks of `chunk_size`
|
||||||
|
rows so memory stays bounded."""
|
||||||
|
dumpdir = dumpdir or config['dumpdir']
|
||||||
|
outdir = outdir or config['outdir']
|
||||||
|
schema = config['schema']
|
||||||
|
fields = config['fields']
|
||||||
|
handlers = config['handlers']
|
||||||
|
|
||||||
|
stream = open_fileset([os.path.join(dumpdir, partition)])
|
||||||
|
rows = (parse_record(line, fields, handlers) for line in stream)
|
||||||
|
|
||||||
|
os.makedirs(outdir, exist_ok=True)
|
||||||
|
outfile = os.path.join(outdir, os.path.splitext(partition)[0] + ".parquet")
|
||||||
|
|
||||||
|
with pq.ParquetWriter(outfile, schema=schema, compression='snappy', flavor='spark') as writer:
|
||||||
|
while True:
|
||||||
|
chunk = list(islice(rows, chunk_size))
|
||||||
|
if not chunk:
|
||||||
|
break
|
||||||
|
pddf = pd.DataFrame(chunk, columns=schema.names)
|
||||||
|
table = pa.Table.from_pandas(pddf, schema=schema)
|
||||||
|
writer.write_table(table)
|
||||||
|
|
||||||
|
|
||||||
|
def gen_task_list(config, script_name, dumpdir=None, tasklist=None):
|
||||||
|
"""Write a parallel-friendly task list of `script_name parse_dump <file>`
|
||||||
|
lines, one per dump file found under `dumpdir`."""
|
||||||
|
dumpdir = dumpdir or config['dumpdir']
|
||||||
|
tasklist = tasklist or config['task_list']
|
||||||
|
files = list(find_dumps(dumpdir, base_pattern=config['file_pattern']))
|
||||||
|
with open(tasklist, 'w') as of:
|
||||||
|
for fpath in files:
|
||||||
|
partition = os.path.split(fpath)[1]
|
||||||
|
of.write(f'python3 {script_name} parse_dump {partition}\n')
|
||||||
|
|
||||||
|
|
||||||
|
# --- Part 2: spark sort + repartition --------------------------------------
|
||||||
|
|
||||||
|
def sort_and_write(config, indir=None, out_by_subreddit=None, out_by_author=None):
|
||||||
|
"""Read a directory of per-source parquets, sort and repartition twice
|
||||||
|
(once by subreddit, once by author), and write the two output datasets.
|
||||||
|
|
||||||
|
indir defaults to config['outdir'].
|
||||||
|
out_by_subreddit and out_by_author default to config['output_by_subreddit']
|
||||||
|
and config['output_by_author']. Override them to write to staging directories
|
||||||
|
instead of the live datasets (see add_months.sh).
|
||||||
|
|
||||||
|
Pyspark is imported lazily so Part 1 callers don't pay the Spark startup
|
||||||
|
cost.
|
||||||
|
"""
|
||||||
|
from pyspark.sql import SparkSession, functions as f
|
||||||
|
|
||||||
|
indir = indir or config['outdir']
|
||||||
|
out_by_subreddit = out_by_subreddit or config['output_by_subreddit']
|
||||||
|
out_by_author = out_by_author or config['output_by_author']
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName(config['app_name']).getOrCreate()
|
||||||
|
|
||||||
|
df = spark.read.parquet(indir, compression='snappy')
|
||||||
|
|
||||||
|
df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
|
||||||
|
df = df.drop('subreddit')
|
||||||
|
df = df.withColumnRenamed('subreddit_2', 'subreddit')
|
||||||
|
|
||||||
|
df = df.withColumnRenamed("created_utc", "CreatedAt")
|
||||||
|
df = df.withColumn("Month", f.month(f.col("CreatedAt")))
|
||||||
|
df = df.withColumn("Year", f.year(f.col("CreatedAt")))
|
||||||
|
df = df.withColumn("Day", f.dayofmonth(f.col("CreatedAt")))
|
||||||
|
|
||||||
|
sub_keys = config['subreddit_sort_keys']
|
||||||
|
df_sub = df.repartition('subreddit').sort(sub_keys, ascending=True)
|
||||||
|
df_sub = df_sub.sortWithinPartitions(sub_keys, ascending=True)
|
||||||
|
df_sub.write.parquet(out_by_subreddit, mode='overwrite', compression='snappy')
|
||||||
|
|
||||||
|
auth_keys = config['author_sort_keys']
|
||||||
|
df_auth = df.repartition('author').sort(auth_keys, ascending=True)
|
||||||
|
df_auth = df_auth.sortWithinPartitions(auth_keys, ascending=True)
|
||||||
|
df_auth.write.parquet(out_by_author, mode='overwrite', compression='snappy')
|
||||||
|
|
||||||
|
|
||||||
|
def merge_layers(config):
|
||||||
|
"""Collapse all accumulated layers in the final datasets into a single
|
||||||
|
clean layer. Reads the existing by_subreddit dataset (which contains all
|
||||||
|
layers), re-sorts twice, writes to temp paths, then atomically replaces
|
||||||
|
the originals by renaming.
|
||||||
|
|
||||||
|
Safe to interrupt after the writes complete but before the renames — the
|
||||||
|
originals are untouched until the .merging directories exist. The .old
|
||||||
|
directories are left behind if the process is interrupted after renaming;
|
||||||
|
delete them manually once satisfied.
|
||||||
|
|
||||||
|
Pyspark is imported lazily so Part 1 callers don't pay the Spark startup
|
||||||
|
cost.
|
||||||
|
"""
|
||||||
|
from pyspark.sql import SparkSession
|
||||||
|
|
||||||
|
spark = SparkSession.builder.appName(config['app_name'] + ' merge layers').getOrCreate()
|
||||||
|
|
||||||
|
# Both final datasets have identical rows; read from by_subreddit.
|
||||||
|
df = spark.read.parquet(config['output_by_subreddit'])
|
||||||
|
|
||||||
|
tmp_sub = config['output_by_subreddit'] + '.merging'
|
||||||
|
tmp_auth = config['output_by_author'] + '.merging'
|
||||||
|
|
||||||
|
sub_keys = config['subreddit_sort_keys']
|
||||||
|
df_sub = df.repartition('subreddit').sort(sub_keys, ascending=True)
|
||||||
|
df_sub = df_sub.sortWithinPartitions(sub_keys, ascending=True)
|
||||||
|
df_sub.write.parquet(tmp_sub, mode='overwrite', compression='snappy')
|
||||||
|
|
||||||
|
auth_keys = config['author_sort_keys']
|
||||||
|
df_auth = df.repartition('author').sort(auth_keys, ascending=True)
|
||||||
|
df_auth = df_auth.sortWithinPartitions(auth_keys, ascending=True)
|
||||||
|
df_auth.write.parquet(tmp_auth, mode='overwrite', compression='snappy')
|
||||||
|
|
||||||
|
# Atomic swap: rename old → .old, then .merging → final, then delete .old.
|
||||||
|
old_sub = config['output_by_subreddit'] + '.old'
|
||||||
|
old_auth = config['output_by_author'] + '.old'
|
||||||
|
os.rename(config['output_by_subreddit'], old_sub)
|
||||||
|
os.rename(tmp_sub, config['output_by_subreddit'])
|
||||||
|
os.rename(config['output_by_author'], old_auth)
|
||||||
|
os.rename(tmp_auth, config['output_by_author'])
|
||||||
|
shutil.rmtree(old_sub)
|
||||||
|
shutil.rmtree(old_auth)
|
||||||
@@ -3,6 +3,9 @@ import re
|
|||||||
from collections import defaultdict
|
from collections import defaultdict
|
||||||
from os import path
|
from os import path
|
||||||
import glob
|
import glob
|
||||||
|
import io
|
||||||
|
import zstandard
|
||||||
|
|
||||||
|
|
||||||
def find_dumps(dumpdir, base_pattern):
|
def find_dumps(dumpdir, base_pattern):
|
||||||
|
|
||||||
@@ -28,24 +31,28 @@ def open_fileset(files):
|
|||||||
yield line
|
yield line
|
||||||
|
|
||||||
def open_input_file(input_filename):
|
def open_input_file(input_filename):
|
||||||
|
# .zst handled via the zstandard library to avoid subprocess/container issues
|
||||||
|
if re.match(r'.*\.zst$', input_filename):
|
||||||
|
fh = open(input_filename, 'rb')
|
||||||
|
dctx = zstandard.ZstdDecompressor()
|
||||||
|
return io.TextIOWrapper(dctx.stream_reader(fh), encoding='utf-8')
|
||||||
|
|
||||||
if re.match(r'.*\.7z$', input_filename):
|
if re.match(r'.*\.7z$', input_filename):
|
||||||
cmd = ["7za", "x", "-so", input_filename, '*']
|
cmd = ["7za", "x", "-so", input_filename, '*']
|
||||||
elif re.match(r'.*\.gz$', input_filename):
|
|
||||||
cmd = ["zcat", input_filename]
|
|
||||||
elif re.match(r'.*\.bz2$', input_filename):
|
elif re.match(r'.*\.bz2$', input_filename):
|
||||||
cmd = ["bzcat", "-dk", input_filename]
|
cmd = ["bzcat", "-dk", input_filename]
|
||||||
elif re.match(r'.*\.bz', input_filename):
|
elif re.match(r'.*\.bz', input_filename):
|
||||||
cmd = ["bzcat", "-dk", input_filename]
|
cmd = ["bzcat", "-dk", input_filename]
|
||||||
elif re.match(r'.*\.xz', input_filename):
|
elif re.match(r'.*\.xz', input_filename):
|
||||||
cmd = ["xzcat",'-dk', '-T 20',input_filename]
|
cmd = ["xzcat", '-dk', '-T 20', input_filename]
|
||||||
elif re.match(r'.*\.zst',input_filename):
|
elif re.match(r'.*\.gz', input_filename):
|
||||||
cmd = ['zstd','-dck', input_filename]
|
cmd = ["zcat", input_filename]
|
||||||
elif re.match(r'.*\.gz',input_filename):
|
else:
|
||||||
cmd = ['gzip','-dc', input_filename]
|
return open(input_filename, 'r')
|
||||||
|
|
||||||
try:
|
try:
|
||||||
input_file = Popen(cmd, stdout=PIPE).stdout
|
return Popen(cmd, stdout=PIPE).stdout
|
||||||
except NameError as e:
|
except NameError as e:
|
||||||
print(e)
|
print(e)
|
||||||
input_file = open(input_filename, 'r')
|
return open(input_filename, 'r')
|
||||||
return input_file
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +0,0 @@
|
|||||||
#!/usr/bin/bash
|
|
||||||
start_spark_cluster.sh
|
|
||||||
spark-submit --master spark://$(hostname):18899 weekly_cosine_similarities.py term --outfile=/gscratch/comdata/users/nathante/subreddit_term_similarity_weekly_5000.parquet --topN=5000
|
|
||||||
stop-all.sh
|
|
||||||
32
datasets/merge_layers.sh
Executable file
32
datasets/merge_layers.sh
Executable file
@@ -0,0 +1,32 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
#
|
||||||
|
# Collapse all accumulated layers in the final parquet datasets into a
|
||||||
|
# single clean layer. Use this after several incremental adds via
|
||||||
|
# add_months.sh when you want to reduce the number of partition files.
|
||||||
|
#
|
||||||
|
# Reads the existing by_subreddit / by_author datasets, re-sorts everything,
|
||||||
|
# writes to temp paths, then atomically replaces the originals via rename.
|
||||||
|
# The old directories are removed once the new ones are in place.
|
||||||
|
#
|
||||||
|
# If the process is interrupted after writing the .merging directories but
|
||||||
|
# before the renames complete, re-run — the .merging directories will be
|
||||||
|
# overwritten and the originals are still intact. If interrupted after the
|
||||||
|
# renames, the .old directories are left behind; delete them manually once
|
||||||
|
# satisfied with the output.
|
||||||
|
#
|
||||||
|
# To add new months without merging, use add_months.sh.
|
||||||
|
# To rebuild everything from raw dumps, use build_from_scratch.sh.
|
||||||
|
#
|
||||||
|
# NOTE: This script and its workflow are written but not yet tested.
|
||||||
|
# Remove this notice after a successful end-to-end run.
|
||||||
|
#
|
||||||
|
# Every command below is independently runnable for debugging.
|
||||||
|
|
||||||
|
set -e
|
||||||
|
cd "$(dirname "$0")"
|
||||||
|
|
||||||
|
# merge and collapse comments layers
|
||||||
|
start_spark_and_run.sh 1 comments_merge.py
|
||||||
|
|
||||||
|
# merge and collapse submissions layers
|
||||||
|
start_spark_and_run.sh 1 submissions_merge.py
|
||||||
@@ -1,9 +0,0 @@
|
|||||||
## this should be run manually since we don't have a nice way to wait on parallel_sql jobs
|
|
||||||
|
|
||||||
#!/usr/bin/env bash
|
|
||||||
|
|
||||||
./parse_submissions.sh
|
|
||||||
|
|
||||||
start_spark_and_run.sh 1 $(pwd)/submissions_2_parquet_part2.py
|
|
||||||
|
|
||||||
|
|
||||||
@@ -1,118 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
# two stages:
|
|
||||||
# 1. from gz to arrow parquet (this script)
|
|
||||||
# 2. from arrow parquet to spark parquet (submissions_2_parquet_part2.py)
|
|
||||||
|
|
||||||
from datetime import datetime
|
|
||||||
from multiprocessing import Pool
|
|
||||||
from itertools import islice
|
|
||||||
from helper import find_dumps, open_fileset
|
|
||||||
import pandas as pd
|
|
||||||
import pyarrow as pa
|
|
||||||
import pyarrow.parquet as pq
|
|
||||||
import simdjson
|
|
||||||
import fire
|
|
||||||
import os
|
|
||||||
|
|
||||||
parser = simdjson.Parser()
|
|
||||||
|
|
||||||
def parse_submission(post, names = None):
|
|
||||||
if names is None:
|
|
||||||
names = ['id','author','subreddit','title','created_utc','permalink','url','domain','score','ups','downs','over_18','has_media','selftext','retrieved_on','num_comments','gilded','edited','time_edited','subreddit_type','subreddit_id','subreddit_subscribers','name','is_self','stickied','quarantine','error']
|
|
||||||
|
|
||||||
try:
|
|
||||||
post = parser.parse(post)
|
|
||||||
except (ValueError) as e:
|
|
||||||
# print(e)
|
|
||||||
# print(post)
|
|
||||||
row = [None for _ in names]
|
|
||||||
row[-1] = "Error parsing json|{0}|{1}".format(e,post)
|
|
||||||
return tuple(row)
|
|
||||||
|
|
||||||
row = []
|
|
||||||
|
|
||||||
for name in names:
|
|
||||||
if name == 'created_utc' or name == 'retrieved_on':
|
|
||||||
val = post.get(name,None)
|
|
||||||
if val is not None:
|
|
||||||
row.append(datetime.fromtimestamp(int(post[name]),tz=None))
|
|
||||||
else:
|
|
||||||
row.append(None)
|
|
||||||
elif name == 'edited':
|
|
||||||
val = post[name]
|
|
||||||
if type(val) == bool:
|
|
||||||
row.append(val)
|
|
||||||
row.append(None)
|
|
||||||
else:
|
|
||||||
row.append(True)
|
|
||||||
row.append(datetime.fromtimestamp(int(val),tz=None))
|
|
||||||
elif name == "time_edited":
|
|
||||||
continue
|
|
||||||
elif name == 'has_media':
|
|
||||||
row.append(post.get('media',None) is not None)
|
|
||||||
|
|
||||||
elif name not in post:
|
|
||||||
row.append(None)
|
|
||||||
else:
|
|
||||||
row.append(post[name])
|
|
||||||
return tuple(row)
|
|
||||||
|
|
||||||
def parse_dump(partition):
|
|
||||||
|
|
||||||
N=10000
|
|
||||||
stream = open_fileset([f"/gscratch/comdata/raw_data/reddit_dumps/submissions/{partition}"])
|
|
||||||
rows = map(parse_submission,stream)
|
|
||||||
schema = pa.schema([
|
|
||||||
pa.field('id', pa.string(),nullable=True),
|
|
||||||
pa.field('author', pa.string(),nullable=True),
|
|
||||||
pa.field('subreddit', pa.string(),nullable=True),
|
|
||||||
pa.field('title', pa.string(),nullable=True),
|
|
||||||
pa.field('created_utc', pa.timestamp('ms'),nullable=True),
|
|
||||||
pa.field('permalink', pa.string(),nullable=True),
|
|
||||||
pa.field('url', pa.string(),nullable=True),
|
|
||||||
pa.field('domain', pa.string(),nullable=True),
|
|
||||||
pa.field('score', pa.int64(),nullable=True),
|
|
||||||
pa.field('ups', pa.int64(),nullable=True),
|
|
||||||
pa.field('downs', pa.int64(),nullable=True),
|
|
||||||
pa.field('over_18', pa.bool_(),nullable=True),
|
|
||||||
pa.field('has_media',pa.bool_(),nullable=True),
|
|
||||||
pa.field('selftext',pa.string(),nullable=True),
|
|
||||||
pa.field('retrieved_on', pa.timestamp('ms'),nullable=True),
|
|
||||||
pa.field('num_comments', pa.int64(),nullable=True),
|
|
||||||
pa.field('gilded',pa.int64(),nullable=True),
|
|
||||||
pa.field('edited',pa.bool_(),nullable=True),
|
|
||||||
pa.field('time_edited',pa.timestamp('ms'),nullable=True),
|
|
||||||
pa.field('subreddit_type',pa.string(),nullable=True),
|
|
||||||
pa.field('subreddit_id',pa.string(),nullable=True),
|
|
||||||
pa.field('subreddit_subscribers',pa.int64(),nullable=True),
|
|
||||||
pa.field('name',pa.string(),nullable=True),
|
|
||||||
pa.field('is_self',pa.bool_(),nullable=True),
|
|
||||||
pa.field('stickied',pa.bool_(),nullable=True),
|
|
||||||
pa.field('quarantine',pa.bool_(),nullable=True),
|
|
||||||
pa.field('error',pa.string(),nullable=True)])
|
|
||||||
|
|
||||||
if not os.path.exists("/gscratch/comdata/output/temp/reddit_submissions.parquet/"):
|
|
||||||
os.mkdir("/gscratch/comdata/output/temp/reddit_submissions.parquet/")
|
|
||||||
|
|
||||||
with pq.ParquetWriter(f"/gscratch/comdata/output/temp/reddit_submissions.parquet/{partition}",schema=schema,compression='snappy',flavor='spark') as writer:
|
|
||||||
while True:
|
|
||||||
chunk = islice(rows,N)
|
|
||||||
pddf = pd.DataFrame(chunk, columns=schema.names)
|
|
||||||
table = pa.Table.from_pandas(pddf,schema=schema)
|
|
||||||
if table.shape[0] == 0:
|
|
||||||
break
|
|
||||||
writer.write_table(table)
|
|
||||||
|
|
||||||
writer.close()
|
|
||||||
|
|
||||||
def gen_task_list(dumpdir="/gscratch/comdata/raw_data/reddit_dumps/submissions"):
|
|
||||||
files = list(find_dumps(dumpdir,base_pattern="RS_20*.*"))
|
|
||||||
with open("parse_submissions_task_list",'w') as of:
|
|
||||||
for fpath in files:
|
|
||||||
partition = os.path.split(fpath)[1]
|
|
||||||
of.write(f'python3 submissions_2_parquet_part1.py parse_dump {partition}\n')
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
fire.Fire({'parse_dump':parse_dump,
|
|
||||||
'gen_task_list':gen_task_list})
|
|
||||||
@@ -1,42 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
|
|
||||||
# spark script to make sorted, and partitioned parquet files
|
|
||||||
|
|
||||||
import pyspark
|
|
||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
import os
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
|
|
||||||
sc = spark.sparkContext
|
|
||||||
|
|
||||||
conf = pyspark.SparkConf().setAppName("Reddit submissions to parquet")
|
|
||||||
conf = conf.set("spark.sql.shuffle.partitions",2000)
|
|
||||||
conf = conf.set('spark.sql.crossJoin.enabled',"true")
|
|
||||||
conf = conf.set('spark.debug.maxToStringFields',200)
|
|
||||||
sqlContext = pyspark.SQLContext(sc)
|
|
||||||
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/temp/reddit_submissions.parquet/")
|
|
||||||
|
|
||||||
df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
|
|
||||||
df = df.drop('subreddit')
|
|
||||||
df = df.withColumnRenamed('subreddit_2','subreddit')
|
|
||||||
df = df.withColumnRenamed("created_utc","CreatedAt")
|
|
||||||
df = df.withColumn("Month",f.month(f.col("CreatedAt")))
|
|
||||||
df = df.withColumn("Year",f.year(f.col("CreatedAt")))
|
|
||||||
df = df.withColumn("Day",f.dayofmonth(f.col("CreatedAt")))
|
|
||||||
df = df.withColumn("subreddit_hash",f.sha2(f.col("subreddit"), 256)[0:3])
|
|
||||||
|
|
||||||
# next we gotta resort it all.
|
|
||||||
df = df.repartition("subreddit")
|
|
||||||
df2 = df.sort(["subreddit","CreatedAt","id"],ascending=True)
|
|
||||||
df2 = df.sortWithinPartitions(["subreddit","CreatedAt","id"],ascending=True)
|
|
||||||
df2.write.parquet("/gscratch/comdata/output/temp/reddit_submissions_by_subreddit.parquet2", mode='overwrite',compression='snappy')
|
|
||||||
|
|
||||||
|
|
||||||
# # we also want to have parquet files sorted by author then reddit.
|
|
||||||
df = df.repartition("author")
|
|
||||||
df3 = df.sort(["author","CreatedAt","id"],ascending=True)
|
|
||||||
df3 = df.sortWithinPartitions(["author","CreatedAt","id"],ascending=True)
|
|
||||||
df3.write.parquet("/gscratch/comdata/output/temp/reddit_submissions_by_author.parquet2", mode='overwrite',compression='snappy')
|
|
||||||
14
datasets/submissions_merge.py
Normal file
14
datasets/submissions_merge.py
Normal file
@@ -0,0 +1,14 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Collapse all layers in the submissions final datasets into a single clean layer.
|
||||||
|
|
||||||
|
Must be launched from a login node via the Hyak-provided wrapper:
|
||||||
|
start_spark_and_run.sh 1 submissions_merge.py
|
||||||
|
|
||||||
|
See merge_layers.sh and dumps_helper.merge_layers for details.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from dumps_helper import SUBMISSIONS, merge_layers
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
merge_layers(SUBMISSIONS)
|
||||||
24
datasets/submissions_part1.py
Executable file
24
datasets/submissions_part1.py
Executable file
@@ -0,0 +1,24 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Part 1 for submissions: parse one RS_*.zst dump into a parquet file.
|
||||||
|
|
||||||
|
CLI:
|
||||||
|
submissions_part1.py parse_dump RS_2018-08.zst
|
||||||
|
submissions_part1.py gen_task_list
|
||||||
|
submissions_part1.py parse_dump RS_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
|
||||||
|
"""
|
||||||
|
|
||||||
|
import fire
|
||||||
|
from dumps_helper import SUBMISSIONS, parse_dump, gen_task_list
|
||||||
|
|
||||||
|
|
||||||
|
def _parse_dump(partition, dumpdir=None, outdir=None):
|
||||||
|
parse_dump(SUBMISSIONS, partition, dumpdir=dumpdir, outdir=outdir)
|
||||||
|
|
||||||
|
|
||||||
|
def _gen_task_list(dumpdir=None, tasklist=None):
|
||||||
|
gen_task_list(SUBMISSIONS, 'submissions_part1.py', dumpdir=dumpdir, tasklist=tasklist)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
fire.Fire({'parse_dump': _parse_dump,
|
||||||
|
'gen_task_list': _gen_task_list})
|
||||||
21
datasets/submissions_part2.py
Executable file
21
datasets/submissions_part2.py
Executable file
@@ -0,0 +1,21 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Part 2 for submissions: Spark sort + repartition into the final datasets.
|
||||||
|
|
||||||
|
Must be launched from a login node via the Hyak-provided wrapper:
|
||||||
|
start_spark_and_run.sh 1 submissions_part2.py
|
||||||
|
start_spark_and_run.sh 1 submissions_part2.py --indir=/path/to/parquets --mode=append
|
||||||
|
|
||||||
|
--indir defaults to the temp submissions dir in dumps_helper.py.
|
||||||
|
--out_by_subreddit and --out_by_author default to the live dataset paths;
|
||||||
|
override them to write to staging directories first (see add_months.sh).
|
||||||
|
"""
|
||||||
|
|
||||||
|
import fire
|
||||||
|
from dumps_helper import SUBMISSIONS, sort_and_write
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
|
||||||
|
sort_and_write(SUBMISSIONS, indir=indir,
|
||||||
|
out_by_subreddit=out_by_subreddit,
|
||||||
|
out_by_author=out_by_author))
|
||||||
@@ -1,33 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
# run from a build_machine
|
|
||||||
|
|
||||||
import requests
|
|
||||||
from os import path
|
|
||||||
import hashlib
|
|
||||||
|
|
||||||
shasums1 = requests.get("https://files.pushshift.io/reddit/comments/sha256sum.txt").text
|
|
||||||
shasums2 = requests.get("https://files.pushshift.io/reddit/comments/daily/sha256sum.txt").text
|
|
||||||
|
|
||||||
shasums = shasums1 + shasums2
|
|
||||||
dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/comments"
|
|
||||||
|
|
||||||
for l in shasums.strip().split('\n'):
|
|
||||||
sha256_hash = hashlib.sha256()
|
|
||||||
parts = l.split(' ')
|
|
||||||
|
|
||||||
correct_sha256 = parts[0]
|
|
||||||
filename = parts[-1]
|
|
||||||
print(f"checking {filename}")
|
|
||||||
fpath = path.join(dumpdir,filename)
|
|
||||||
if path.isfile(fpath):
|
|
||||||
with open(fpath,'rb') as f:
|
|
||||||
for byte_block in iter(lambda: f.read(4096),b""):
|
|
||||||
sha256_hash.update(byte_block)
|
|
||||||
|
|
||||||
if sha256_hash.hexdigest() == correct_sha256:
|
|
||||||
print(f"{filename} checks out")
|
|
||||||
else:
|
|
||||||
print(f"ERROR! {filename} has the wrong hash. Redownload and recheck!")
|
|
||||||
else:
|
|
||||||
print(f"Skipping {filename} as it doesn't exist")
|
|
||||||
|
|
||||||
@@ -1,31 +0,0 @@
|
|||||||
#!/usr/bin/env python3
|
|
||||||
# run from a build_machine
|
|
||||||
|
|
||||||
import requests
|
|
||||||
from os import path
|
|
||||||
import hashlib
|
|
||||||
|
|
||||||
file1 = requests.get("https://files.pushshift.io/reddit/submissions/sha256sums.txt").text
|
|
||||||
file2 = requests.get("https://files.pushshift.io/reddit/submissions/old_v1_data/sha256sums.txt").text
|
|
||||||
dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/submissions"
|
|
||||||
|
|
||||||
for l in file1.strip().split('\n') + file2.strip().split('\n'):
|
|
||||||
sha256_hash = hashlib.sha256()
|
|
||||||
parts = l.split(' ')
|
|
||||||
|
|
||||||
correct_sha256 = parts[0]
|
|
||||||
filename = parts[-1]
|
|
||||||
print(f"checking {filename}")
|
|
||||||
fpath = path.join(dumpdir,filename)
|
|
||||||
if path.isfile(fpath):
|
|
||||||
with open(fpath,'rb') as f:
|
|
||||||
for byte_block in iter(lambda: f.read(4096),b""):
|
|
||||||
sha256_hash.update(byte_block)
|
|
||||||
|
|
||||||
if sha256_hash.hexdigest() == correct_sha256:
|
|
||||||
print(f"{filename} checks out")
|
|
||||||
else:
|
|
||||||
print(f"ERROR! {filename} has the wrong hash. Redownload and recheck!")
|
|
||||||
else:
|
|
||||||
print(f"Skipping {filename} as it doesn't exist")
|
|
||||||
|
|
||||||
@@ -1,14 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
user_agent='nathante teblunthuis <nathante@uw.edu>'
|
|
||||||
output_dir='/gscratch/comdata/raw_data/reddit_dumps/comments'
|
|
||||||
base_url='https://files.pushshift.io/reddit/comments/'
|
|
||||||
|
|
||||||
wget -r --no-parent -A 'RC_201*.bz2' -U $user_agent -P $output_dir -nd -nc $base_url
|
|
||||||
wget -r --no-parent -A 'RC_201*.xz' -U $user_agent -P $output_dir -nd -nc $base_url
|
|
||||||
wget -r --no-parent -A 'RC_201*.zst' -U $user_agent -P $output_dir -nd -nc $base_url
|
|
||||||
|
|
||||||
# starting in 2020 we use daily dumps not monthly dumps
|
|
||||||
wget -r --no-parent -A 'RC_202*.gz' -U $user_agent -P $output_dir -nd -nc $base_url/daily/
|
|
||||||
|
|
||||||
./check_comments_shas.py
|
|
||||||
@@ -1,14 +0,0 @@
|
|||||||
#!/bin/bash
|
|
||||||
|
|
||||||
user_agent='nathante teblunthuis <nathante@uw.edu>'
|
|
||||||
output_dir='/gscratch/comdata/raw_data/reddit_dumps/submissions'
|
|
||||||
base_url='https://files.pushshift.io/reddit/submissions/'
|
|
||||||
|
|
||||||
wget -r --no-parent -A 'RS_20*.bz2' -U $user_agent -P $output_dir -nd -nc $base_url
|
|
||||||
wget -r --no-parent -A 'RS_20*.xz' -U $user_agent -P $output_dir -nd -nc $base_url
|
|
||||||
wget -r --no-parent -A 'RS_20*.zst' -U $user_agent -P $output_dir -nd -nc $base_url
|
|
||||||
wget -r --no-parent -A 'RS_20*.bz2' -U $user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
|
|
||||||
wget -r --no-parent -A 'RS_20*.xz' -U $user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
|
|
||||||
wget -r --no-parent -A 'RS_20*.zst' -U $user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
|
|
||||||
|
|
||||||
./check_submission_shas.py
|
|
||||||
@@ -1,21 +0,0 @@
|
|||||||
from pyspark.sql import SparkSession
|
|
||||||
from similarities_helper import build_tfidf_dataset
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
|
|
||||||
|
|
||||||
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
|
|
||||||
|
|
||||||
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
|
|
||||||
|
|
||||||
# remove [deleted] and AutoModerator (TODO remove other bots)
|
|
||||||
df = df.filter(df.author != '[deleted]')
|
|
||||||
df = df.filter(df.author != 'AutoModerator')
|
|
||||||
|
|
||||||
df = build_tfidf_dataset(df, include_subs, 'author')
|
|
||||||
|
|
||||||
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet',mode='overwrite',compression='snappy')
|
|
||||||
|
|
||||||
spark.stop()
|
|
||||||
@@ -1,27 +0,0 @@
|
|||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
from pyspark.sql import Window
|
|
||||||
from similarities_helper import build_weekly_tfidf_dataset
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
|
|
||||||
## TODO:need to exclude automoderator / bot posts.
|
|
||||||
## TODO:need to exclude better handle hyperlinks.
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
|
|
||||||
|
|
||||||
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
|
|
||||||
|
|
||||||
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
|
|
||||||
|
|
||||||
# remove [deleted] and AutoModerator (TODO remove other bots)
|
|
||||||
# df = df.filter(df.author != '[deleted]')
|
|
||||||
# df = df.filter(df.author != 'AutoModerator')
|
|
||||||
|
|
||||||
df = build_weekly_tfidf_dataset(df, include_subs, 'term')
|
|
||||||
|
|
||||||
|
|
||||||
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', mode='overwrite', compression='snappy')
|
|
||||||
spark.stop()
|
|
||||||
|
|
||||||
@@ -1,106 +0,0 @@
|
|||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
from pyspark.sql import Window
|
|
||||||
import numpy as np
|
|
||||||
import pyarrow
|
|
||||||
import pandas as pd
|
|
||||||
import fire
|
|
||||||
from itertools import islice
|
|
||||||
from pathlib import Path
|
|
||||||
from similarities_helper import *
|
|
||||||
|
|
||||||
#tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/subreddit_terms.parquet')
|
|
||||||
def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, included_subreddits = None, topN = 500):
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
conf = spark.sparkContext.getConf()
|
|
||||||
print(outfile)
|
|
||||||
tfidf = spark.read.parquet(tfidf_path)
|
|
||||||
|
|
||||||
if included_subreddits is None:
|
|
||||||
included_subreddits = select_topN_subreddits(topN)
|
|
||||||
|
|
||||||
else:
|
|
||||||
included_subreddits = set(open(included_subreddits))
|
|
||||||
|
|
||||||
print("creating temporary parquet with matrix indicies")
|
|
||||||
tempdir = prep_tfidf_entries_weekly(tfidf, term_colname, min_df, included_subreddits)
|
|
||||||
|
|
||||||
tfidf = spark.read.parquet(tempdir.name)
|
|
||||||
|
|
||||||
# the ids can change each week.
|
|
||||||
subreddit_names = tfidf.select(['subreddit','subreddit_id_new','week']).distinct().toPandas()
|
|
||||||
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
|
||||||
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
|
|
||||||
spark.stop()
|
|
||||||
|
|
||||||
weeks = list(subreddit_names.week.drop_duplicates())
|
|
||||||
for week in weeks:
|
|
||||||
print("loading matrix")
|
|
||||||
mat = read_tfidf_matrix_weekly(tempdir.name, term_colname, week)
|
|
||||||
print('computing similarities')
|
|
||||||
sims = column_similarities(mat)
|
|
||||||
del mat
|
|
||||||
|
|
||||||
names = subreddit_names.loc[subreddit_names.week==week]
|
|
||||||
|
|
||||||
sims = sims.rename({i:sr for i, sr in enumerate(names.subreddit.values)},axis=1)
|
|
||||||
sims['subreddit'] = names.subreddit.values
|
|
||||||
write_weekly_similarities(outfile, sims, week)
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
def cosine_similarities(outfile, min_df = None, included_subreddits=None, topN=500):
|
|
||||||
'''
|
|
||||||
Compute similarities between subreddits based on tfi-idf vectors of author comments
|
|
||||||
|
|
||||||
included_subreddits : string
|
|
||||||
Text file containing a list of subreddits to include (one per line) if included_subreddits is None then do the top 500 subreddits
|
|
||||||
|
|
||||||
min_df : int (default = 0.1 * (number of included_subreddits)
|
|
||||||
exclude terms that appear in fewer than this number of documents.
|
|
||||||
|
|
||||||
outfile: string
|
|
||||||
where to output csv and feather outputs
|
|
||||||
'''
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
conf = spark.sparkContext.getConf()
|
|
||||||
print(outfile)
|
|
||||||
|
|
||||||
tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet')
|
|
||||||
|
|
||||||
if included_subreddits is None:
|
|
||||||
included_subreddits = select_topN_subreddits(topN)
|
|
||||||
|
|
||||||
else:
|
|
||||||
included_subreddits = set(open(included_subreddits))
|
|
||||||
|
|
||||||
print("creating temporary parquet with matrix indicies")
|
|
||||||
tempdir = prep_tfidf_entries(tfidf, 'author', min_df, included_subreddits)
|
|
||||||
tfidf = spark.read.parquet(tempdir.name)
|
|
||||||
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
|
|
||||||
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
|
||||||
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
|
|
||||||
spark.stop()
|
|
||||||
|
|
||||||
print("loading matrix")
|
|
||||||
mat = read_tfidf_matrix(tempdir.name,'author')
|
|
||||||
print('computing similarities')
|
|
||||||
sims = column_similarities(mat)
|
|
||||||
del mat
|
|
||||||
|
|
||||||
sims = pd.DataFrame(sims.todense())
|
|
||||||
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)},axis=1)
|
|
||||||
sims['subreddit'] = subreddit_names.subreddit.values
|
|
||||||
|
|
||||||
p = Path(outfile)
|
|
||||||
|
|
||||||
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
|
|
||||||
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
|
|
||||||
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
|
|
||||||
|
|
||||||
sims.to_feather(outfile)
|
|
||||||
tempdir.cleanup()
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
fire.Fire(author_cosine_similarities)
|
|
||||||
@@ -1,61 +0,0 @@
|
|||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
from pyspark.sql import Window
|
|
||||||
from pyspark.mllib.linalg.distributed import RowMatrix, CoordinateMatrix
|
|
||||||
import numpy as np
|
|
||||||
import pyarrow
|
|
||||||
import pandas as pd
|
|
||||||
import fire
|
|
||||||
from itertools import islice
|
|
||||||
from pathlib import Path
|
|
||||||
from similarities_helper import prep_tfidf_entries, read_tfidf_matrix, column_similarities, select_topN
|
|
||||||
import scipy
|
|
||||||
|
|
||||||
# outfile='test_similarities_500.feather';
|
|
||||||
# min_df = None;
|
|
||||||
# included_subreddits=None; topN=100; exclude_phrases=True;
|
|
||||||
def term_cosine_similarities(outfile, min_df = None, included_subreddits=None, topN=500, exclude_phrases=False):
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
conf = spark.sparkContext.getConf()
|
|
||||||
print(outfile)
|
|
||||||
print(exclude_phrases)
|
|
||||||
|
|
||||||
tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_terms.parquet')
|
|
||||||
|
|
||||||
if included_subreddits is None:
|
|
||||||
included_subreddits = select_topN_subreddits(topN)
|
|
||||||
else:
|
|
||||||
included_subreddits = set(open(included_subreddits))
|
|
||||||
|
|
||||||
if exclude_phrases == True:
|
|
||||||
tfidf = tfidf.filter(~f.col(term).contains("_"))
|
|
||||||
|
|
||||||
print("creating temporary parquet with matrix indicies")
|
|
||||||
tempdir = prep_tfidf_entries(tfidf, 'term', min_df, included_subreddits)
|
|
||||||
tfidf = spark.read.parquet(tempdir.name)
|
|
||||||
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
|
|
||||||
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
|
||||||
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
|
|
||||||
spark.stop()
|
|
||||||
|
|
||||||
print("loading matrix")
|
|
||||||
mat = read_tfidf_matrix(tempdir.name,'term')
|
|
||||||
print('computing similarities')
|
|
||||||
sims = column_similarities(mat)
|
|
||||||
del mat
|
|
||||||
|
|
||||||
sims = pd.DataFrame(sims.todense())
|
|
||||||
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)},axis=1)
|
|
||||||
sims['subreddit'] = subreddit_names.subreddit.values
|
|
||||||
|
|
||||||
p = Path(outfile)
|
|
||||||
|
|
||||||
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
|
|
||||||
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
|
|
||||||
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
|
|
||||||
|
|
||||||
sims.to_feather(outfile)
|
|
||||||
tempdir.cleanup()
|
|
||||||
|
|
||||||
if __name__ == '__main__':
|
|
||||||
fire.Fire(term_cosine_similarities)
|
|
||||||
@@ -1,21 +0,0 @@
|
|||||||
from pyspark.sql import SparkSession
|
|
||||||
from similarities_helper import build_tfidf_dataset
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
|
|
||||||
|
|
||||||
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
|
|
||||||
|
|
||||||
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
|
|
||||||
|
|
||||||
# remove [deleted] and AutoModerator (TODO remove other bots)
|
|
||||||
df = df.filter(df.author != '[deleted]')
|
|
||||||
df = df.filter(df.author != 'AutoModerator')
|
|
||||||
|
|
||||||
df = build_tfidf_dataset(df, include_subs, 'author')
|
|
||||||
|
|
||||||
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet',mode='overwrite',compression='snappy')
|
|
||||||
|
|
||||||
spark.stop()
|
|
||||||
@@ -1,21 +0,0 @@
|
|||||||
from pyspark.sql import SparkSession
|
|
||||||
from similarities_helper import build_weekly_tfidf_dataset
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
|
|
||||||
|
|
||||||
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
|
|
||||||
|
|
||||||
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
|
|
||||||
|
|
||||||
# remove [deleted] and AutoModerator (TODO remove other bots)
|
|
||||||
df = df.filter(df.author != '[deleted]')
|
|
||||||
df = df.filter(df.author != 'AutoModerator')
|
|
||||||
|
|
||||||
df = build_weekly_tfidf_dataset(df, include_subs, 'author')
|
|
||||||
|
|
||||||
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet', mode='overwrite', compression='snappy')
|
|
||||||
|
|
||||||
spark.stop()
|
|
||||||
@@ -1,18 +0,0 @@
|
|||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
from pyspark.sql import Window
|
|
||||||
from similarities_helper import build_tfidf_dataset
|
|
||||||
|
|
||||||
## TODO:need to exclude automoderator / bot posts.
|
|
||||||
## TODO:need to exclude better handle hyperlinks.
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
|
|
||||||
|
|
||||||
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
|
|
||||||
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
|
|
||||||
|
|
||||||
df = build_tfidf_dataset(df, include_subs, 'term')
|
|
||||||
|
|
||||||
df.write.parquet('/gscratch/comdata/output/reddit_similarity/reddit_similarity/subreddit_terms.parquet',mode='overwrite',compression='snappy')
|
|
||||||
spark.stop()
|
|
||||||
@@ -1,27 +0,0 @@
|
|||||||
from pyspark.sql import functions as f
|
|
||||||
from pyspark.sql import SparkSession
|
|
||||||
from pyspark.sql import Window
|
|
||||||
from similarities_helper import build_weekly_tfidf_dataset
|
|
||||||
import pandas as pd
|
|
||||||
|
|
||||||
|
|
||||||
## TODO:need to exclude automoderator / bot posts.
|
|
||||||
## TODO:need to exclude better handle hyperlinks.
|
|
||||||
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
|
|
||||||
|
|
||||||
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
|
|
||||||
|
|
||||||
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
|
|
||||||
|
|
||||||
# remove [deleted] and AutoModerator (TODO remove other bots)
|
|
||||||
# df = df.filter(df.author != '[deleted]')
|
|
||||||
# df = df.filter(df.author != 'AutoModerator')
|
|
||||||
|
|
||||||
df = build_weekly_tfidf_dataset(df, include_subs, 'term')
|
|
||||||
|
|
||||||
|
|
||||||
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', mode='overwrite', compression='snappy')
|
|
||||||
spark.stop()
|
|
||||||
|
|
||||||
@@ -1,130 +1,25 @@
|
|||||||
#all: /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_130k.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_130k.parquet /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms_130k.parquet /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors_130k.parquet
|
all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms.parquet
|
||||||
srun_singularity=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity.sh
|
|
||||||
srun_singularity_huge=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity_huge.sh
|
|
||||||
base_data=/gscratch/comdata/output/
|
|
||||||
similarity_data=${base_data}/reddit_similarity
|
|
||||||
tfidf_data=${similarity_data}/tfidf
|
|
||||||
tfidf_weekly_data=${similarity_data}/tfidf_weekly
|
|
||||||
similarity_weekly_data=${similarity_data}/weekly
|
|
||||||
lsi_components=[10,50,100,200,300,400,500,600,700,850,1000,1500]
|
|
||||||
|
|
||||||
lsi_similarities: ${similarity_data}/subreddit_comment_terms_10k_LSI ${similarity_data}/subreddit_comment_authors-tf_10k_LSI ${similarity_data}/subreddit_comment_authors_10k_LSI ${similarity_data}/subreddit_comment_terms_30k_LSI ${similarity_data}/subreddit_comment_authors-tf_30k_LSI ${similarity_data}/subreddit_comment_authors_30k_LSI
|
|
||||||
|
|
||||||
all: ${tfidf_data}/comment_terms_100k.parquet ${tfidf_data}/comment_terms_30k.parquet ${tfidf_data}/comment_terms_10k.parquet ${tfidf_data}/comment_authors_100k.parquet ${tfidf_data}/comment_authors_30k.parquet ${tfidf_data}/comment_authors_10k.parquet ${similarity_data}/subreddit_comment_authors_30k.feather ${similarity_data}/subreddit_comment_authors_10k.feather ${similarity_data}/subreddit_comment_terms_10k.feather ${similarity_data}/subreddit_comment_terms_30k.feather ${similarity_data}/subreddit_comment_authors-tf_30k.feather ${similarity_data}/subreddit_comment_authors-tf_10k.feather ${similarity_data}/subreddit_comment_terms_100k.feather ${similarity_data}/subreddit_comment_authors_100k.feather ${similarity_data}/subreddit_comment_authors-tf_100k.feather ${similarity_weekly_data}/comment_terms.parquet
|
|
||||||
|
|
||||||
#${tfidf_weekly_data}/comment_terms_100k.parquet ${tfidf_weekly_data}/comment_authors_100k.parquet ${tfidf_weekly_data}/comment_terms_30k.parquet ${tfidf_weekly_data}/comment_authors_30k.parquet ${similarity_weekly_data}/comment_terms_100k.parquet ${similarity_weekly_data}/comment_authors_100k.parquet ${similarity_weekly_data}/comment_terms_30k.parquet ${similarity_weekly_data}/comment_authors_30k.parquet
|
|
||||||
|
|
||||||
# /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_130k.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_130k.parquet /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_130k.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_130k.parquet /gscratch/comdata/output/reddit_similarity/comment_terms_weekly_130k.parquet
|
|
||||||
|
|
||||||
# all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet
|
# all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet
|
||||||
|
|
||||||
${similarity_weekly_data}/comment_terms.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv ${tfidf_weekly_data}/comment_terms.parquet
|
|
||||||
${srun_singularity} python3 weekly_cosine_similarities.py terms --topN=10000 --outfile=${similarity_weekly_data}/comment_terms.parquet
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_terms_10k.feather: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
|
# /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet: cosine_similarities.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
|
||||||
${srun_singularity} python3 cosine_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_10k.feather --topN=10000
|
# start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.feather
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_terms_10k_LSI: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
|
/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
|
||||||
${srun_singularity} python3 lsi_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=200
|
start_spark_and_run.sh 1 tfidf.py terms --topN=10000
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_terms_30k_LSI: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
|
/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
|
||||||
${srun_singularity} python3 lsi_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_30k_LSI --topN=30000 --n_components=${lsi_components} --min_df=200
|
start_spark_and_run.sh 1 tfidf.py authors --topN=10000
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_terms_30k.feather: ${tfidf_data}/comment_terms_30k.parquet similarities_helper.py
|
/gscratch/comdata/output/reddit_similarity/comment_authors_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
|
||||||
${srun_singularity} python3 cosine_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_30k.feather --topN=30000
|
start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors_30k.feather: ${tfidf_data}/comment_authors_30k.parquet similarities_helper.py
|
/gscratch/comdata/output/reddit_similarity/comment_terms.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet
|
||||||
${srun_singularity} python3 cosine_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_30k.feather --topN=30000
|
start_spark_and_run.sh 1 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors_10k.feather: ${tfidf_data}/comment_authors_10k.parquet similarities_helper.py
|
# /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet: cosine_similarities.py /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet
|
||||||
${srun_singularity} python3 cosine_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_10k.feather --topN=10000
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors_10k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 lsi_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=2
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors_30k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 lsi_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_30k_LSI --topN=30000 --n_components=${lsi_components} --min_df=2
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors-tf_30k.feather: ${tfidf_data}/comment_authors_30k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 cosine_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_30k.feather --topN=30000
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors-tf_10k.feather: ${tfidf_data}/comment_authors_10k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 cosine_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_10k.feather --topN=10000
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors-tf_10k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 lsi_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=2
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors-tf_30k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 lsi_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_30k_LSI --topN=30000 --n_components=${lsi_components} --min_df=2
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_terms_100k.feather: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 cosine_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_100k.feather --topN=100000
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors_100k.feather: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 cosine_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_100k.feather --topN=100000
|
|
||||||
|
|
||||||
${similarity_data}/subreddit_comment_authors-tf_100k.feather: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
|
|
||||||
${srun_singularity} python3 cosine_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_100k.feather --topN=100000
|
|
||||||
|
|
||||||
${tfidf_data}/comment_terms_100k.feather/: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
mkdir -p ${tfidf_data}/
|
|
||||||
start_spark_and_run.sh 4 tfidf.py terms --topN=100000 --outpath=${tfidf_data}/comment_terms_100k.feather
|
|
||||||
|
|
||||||
${tfidf_data}/comment_terms_30k.feather: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
mkdir -p ${tfidf_data}/
|
|
||||||
start_spark_and_run.sh 4 tfidf.py terms --topN=30000 --outpath=${tfidf_data}/comment_terms_30k.feather
|
|
||||||
|
|
||||||
${tfidf_data}/comment_terms_10k.feather: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
mkdir -p ${tfidf_data}/
|
|
||||||
start_spark_and_run.sh 4 tfidf.py terms --topN=10000 --outpath=${tfidf_data}/comment_terms_10k.feather
|
|
||||||
|
|
||||||
${tfidf_data}/comment_authors_100k.feather: /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
mkdir -p ${tfidf_data}/
|
|
||||||
start_spark_and_run.sh 4 tfidf.py authors --topN=100000 --outpath=${tfidf_data}/comment_authors_100k.feather
|
|
||||||
|
|
||||||
${tfidf_data}/comment_authors_10k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
mkdir -p ${tfidf_data}/
|
|
||||||
start_spark_and_run.sh 4 tfidf.py authors --topN=10000 --outpath=${tfidf_data}/comment_authors_10k.parquet
|
|
||||||
|
|
||||||
${tfidf_data}/comment_authors_30k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
mkdir -p ${tfidf_data}/
|
|
||||||
start_spark_and_run.sh 4 tfidf.py authors --topN=30000 --outpath=${tfidf_data}/comment_authors_30k.parquet
|
|
||||||
|
|
||||||
${tfidf_data}/tfidf_weekly/comment_terms_100k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
start_spark_and_run.sh 4 tfidf.py terms_weekly --topN=100000 --outpath=${similarity_data}/tfidf_weekly/comment_authors_100k.parquet
|
|
||||||
|
|
||||||
${tfidf_data}/tfidf_weekly/comment_authors_100k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_ppnum_comments.csv
|
|
||||||
start_spark_and_run.sh 4 tfidf.py authors_weekly --topN=100000 --outpath=${tfidf_weekly_data}/comment_authors_100k.parquet
|
|
||||||
|
|
||||||
${tfidf_weekly_data}/comment_terms_30k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
start_spark_and_run.sh 4 tfidf.py terms_weekly --topN=30000 --outpath=${tfidf_weekly_data}/comment_authors_30k.parquet
|
|
||||||
|
|
||||||
${tfidf_weekly_data}/comment_authors_30k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv
|
|
||||||
start_spark_and_run.sh 4 tfidf.py authors_weekly --topN=30000 --outpath=${tfidf_weekly_data}/comment_authors_30k.parquet
|
|
||||||
|
|
||||||
${similarity_weekly_data}/comment_terms_100k.parquet: weekly_cosine_similarities.py similarities_helper.py ${tfidf_weekly_data}/comment_terms_100k.parquet
|
|
||||||
${srun_singularity} python3 weekly_cosine_similarities.py terms --topN=100000 --outfile=${similarity_weekly_data}/comment_authors_100k.parquet
|
|
||||||
|
|
||||||
${similarity_weekly_data}/comment_authors_100k.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv ${tfidf_weekly_data}/comment_authors_100k.parquet
|
|
||||||
${srun_singularity} python3 weekly_cosine_similarities.py authors --topN=100000 --outfile=${similarity_weekly_data}/comment_authors_100k.parquet
|
|
||||||
|
|
||||||
${similarity_weekly_data}/comment_terms_30k.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv ${tfidf_weekly_data}/comment_terms_30k.parquet
|
|
||||||
${srun_singularity} python3 weekly_cosine_similarities.py terms --topN=30000 --outfile=${similarity_weekly_data}/comment_authors_30k.parquet
|
|
||||||
|
|
||||||
${similarity_weekly_data}/comment_authors_30k.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments.csv ${tfidf_weekly_data}/comment_authors_30k.parquet
|
|
||||||
${srun_singularity} python3 weekly_cosine_similarities.py authors --topN=30000 --outfile=${similarity_weekly_data}/comment_authors_30k.parquet
|
|
||||||
|
|
||||||
# ${tfidf_weekly_data}/comment_authors_130k.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
|
|
||||||
# start_spark_and_run.sh 1 tfidf.py authors_weekly --topN=130000
|
|
||||||
|
|
||||||
# /gscratch/comdata/output/reddit_similarity/comment_authors_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
|
|
||||||
# start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
|
|
||||||
|
|
||||||
# /gscratch/comdata/output/reddit_similarity/comment_terms.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet
|
|
||||||
# start_spark_and_run.sh 1 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
|
|
||||||
|
|
||||||
# /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet: cosine_similarities.py ${tfidf_weekly_data}/comment_authors.parquet
|
|
||||||
# start_spark_and_run.sh 1 weekly_cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10000_weely.parquet
|
# start_spark_and_run.sh 1 weekly_cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10000_weely.parquet
|
||||||
|
|
||||||
# /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
|
/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
|
||||||
# start_spark_and_run.sh 1 cosine_similarities.py author-tf --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
|
start_spark_and_run.sh 1 cosine_similarities.py author-tf --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
|
||||||
|
|||||||
175
similarities/README.md
Normal file
175
similarities/README.md
Normal file
@@ -0,0 +1,175 @@
|
|||||||
|
# Subreddit similarity
|
||||||
|
|
||||||
|
This directory holds the code that computes pairwise similarities between
|
||||||
|
subreddits — both term-based (from TF-IDF over comment text) and
|
||||||
|
author-based (from overlapping commenter sets). Similarity matrices
|
||||||
|
produced here feed downstream clustering (`../clustering/`) and density
|
||||||
|
analysis (`../density/`).
|
||||||
|
|
||||||
|
## Datasets
|
||||||
|
|
||||||
|
Subreddit similarity datasets based on comment terms and comment authors
|
||||||
|
are available on Hyak in `/gscratch/comdata/output/reddit_similarity`.
|
||||||
|
The overall approach to subreddit similarity seems to work reasonably
|
||||||
|
well and the code is stabilizing. If you want help using these
|
||||||
|
similarities in a project, just reach out to
|
||||||
|
[Nate](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29).
|
||||||
|
|
||||||
|
By default, the scripts here take a `TopN` parameter which selects the
|
||||||
|
subreddits to include in the similarity dataset according to how many
|
||||||
|
total comments they have. You can alternatively pass a value to the
|
||||||
|
`included_subreddits` parameter for a file with the names of the
|
||||||
|
subreddits you would like to include on each line.
|
||||||
|
|
||||||
|
## Scripts
|
||||||
|
|
||||||
|
| Script | What it does |
|
||||||
|
|---|---|
|
||||||
|
| `tfidf.py` | Builds TF-IDF vectors for subreddits. Fire CLI subcommands for `authors`, `terms`, `authors_weekly`, `terms_weekly`. |
|
||||||
|
| `cosine_similarities.py` | Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands `author`, `term`, `author-tf`. |
|
||||||
|
| `weekly_cosine_similarities.py` | Same idea but operating on the weekly TF-IDF vectors. |
|
||||||
|
| `wang_similarity.py` | A variant similarity computation based on user overlaps in the style of Wang et al. |
|
||||||
|
| `top_subreddits_by_comments.py` | Produces the `subreddits_by_num_comments.csv` ranking used to pick the top-N subreddits for the similarity matrices. |
|
||||||
|
| `similarities_helper.py` | Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. |
|
||||||
|
| `Makefile` | Wires everything together with the canonical Hyak output paths. |
|
||||||
|
|
||||||
|
## Methods
|
||||||
|
|
||||||
|
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common and
|
||||||
|
simple information-retrieval technique that we can use to quantify the
|
||||||
|
topic of a subreddit. The goal of TF-IDF is to build a vector for each
|
||||||
|
subreddit that scores every term (or phrase) according to how
|
||||||
|
characteristic it is of the overall lexicon used in that subreddit. For
|
||||||
|
example, the most characteristic terms in the subreddit `/r/christianity`
|
||||||
|
in the current version of the TF-IDF model are:
|
||||||
|
|
||||||
|
| Term | tf_idf |
|
||||||
|
|:------------:|:------:|
|
||||||
|
| christians | 0.581 |
|
||||||
|
| christianity | 0.569 |
|
||||||
|
| kjv | 0.568 |
|
||||||
|
| bible | 0.557 |
|
||||||
|
| scripture | 0.55 |
|
||||||
|
|
||||||
|
TF-IDF stands for "term frequency — inverse document frequency" because
|
||||||
|
it is the product of two terms "term frequency" and "inverse document
|
||||||
|
frequency." Term frequency quantifies the amount that a term appears in
|
||||||
|
a subreddit (document). Inverse document frequency quantifies how much
|
||||||
|
that term appears in other subreddits (documents). As you can see on
|
||||||
|
the Wikipedia page, there are many possible ways of constructing and
|
||||||
|
combining these terms.
|
||||||
|
|
||||||
|
I chose to normalize term frequency by the maximum (raw) term frequency
|
||||||
|
for each subreddit:
|
||||||
|
|
||||||
|
$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}$$
|
||||||
|
|
||||||
|
I use the log inverse document frequency:
|
||||||
|
|
||||||
|
$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
|
||||||
|
|
||||||
|
I then combine them using some smoothing to get:
|
||||||
|
|
||||||
|
$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
|
||||||
|
|
||||||
|
(Other normalization strategies are worth trying — see the note in
|
||||||
|
`TODO`.)
|
||||||
|
|
||||||
|
### Building TF-IDF vectors
|
||||||
|
|
||||||
|
The process for building TF-IDF vectors has four steps:
|
||||||
|
|
||||||
|
1. Extracting terms using `../ngrams/tf_comments.py`
|
||||||
|
2. Detecting common phrases using `../ngrams/top_comment_phrases.py`
|
||||||
|
3. Extracting terms and common phrases using
|
||||||
|
`../ngrams/tf_comments.py --mwe-pass='second'`
|
||||||
|
4. Building IDF and TF-IDF scores in `tfidf.py`
|
||||||
|
|
||||||
|
#### Running `tf_comments.py` on the backfill queue
|
||||||
|
|
||||||
|
The main reason that I did it in four steps instead of one is to take
|
||||||
|
advantage of the backfill queue for running `tf_comments.py`. This step
|
||||||
|
requires reading all of the text in every comment and converting it to
|
||||||
|
a bag of words at the subreddit level. This is a lot of computation
|
||||||
|
that is easily parallelizable. The script `../ngrams/run_tf_jobs.sh`
|
||||||
|
partially automates running steps 1 (or 3) on the backfill queue.
|
||||||
|
|
||||||
|
#### Phrase detection using pointwise mutual information
|
||||||
|
|
||||||
|
TF-IDF is simple, but only uses single words (unigrams). Sequences of
|
||||||
|
multiple words can be important to account for how words have different
|
||||||
|
meanings in different contexts or how sequences of words refer to
|
||||||
|
distinct things like names. Dealing with context or longer sequences of
|
||||||
|
words is a common challenge in natural language processing since the
|
||||||
|
number of possible n-grams grows like crazy as n gets bigger. Phrase
|
||||||
|
detection helps this problem by limiting the set of n-grams to those
|
||||||
|
most informative.
|
||||||
|
|
||||||
|
But how do we detect phrases? I implemented [pointwise mutual
|
||||||
|
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information),
|
||||||
|
which is a pretty simple way but seems to work pretty well.
|
||||||
|
|
||||||
|
PMI is a quantity derived from information theory. The intuition is
|
||||||
|
that if two words occur together quite frequently compared to how often
|
||||||
|
they appear separately then the cooccurrance is likely to be
|
||||||
|
informative.
|
||||||
|
|
||||||
|
$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
|
||||||
|
|
||||||
|
In `../ngrams/tf_comments.py` if `--mwe-pass=first` then a 10% sample
|
||||||
|
of 1-4-grams (sequences of terms up to length 4) will be written to a
|
||||||
|
file to be consumed by `../ngrams/top_comment_phrases.py`.
|
||||||
|
`top_comment_phrases.py` computes the PMI for these possible phrases
|
||||||
|
and writes those that occur at least 3500 times in the sample of
|
||||||
|
n-grams and have a PMI of at least 3 (about 65000 expressions).
|
||||||
|
|
||||||
|
`tf_comments.py --mwe-pass=second` then uses the detected phrases and
|
||||||
|
adds them to the term frequency data.
|
||||||
|
|
||||||
|
## Cosine similarity
|
||||||
|
|
||||||
|
Once the TF-IDF vectors are built, making a similarity score between
|
||||||
|
two subreddits is straightforward using cosine similarity.
|
||||||
|
|
||||||
|
$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
|
||||||
|
|
||||||
|
Intuitively, we represent two subreddits as lines in a high-dimensional
|
||||||
|
space (TF-IDF vectors). In linear algebra, the dot product ($\cdot$)
|
||||||
|
between two vectors takes their weighted sum (e.g. linear regression is
|
||||||
|
a dot product of a vector of covariates and a vector of weights). The
|
||||||
|
vectors might have different lengths — if one subreddit has more words
|
||||||
|
in comments than the other — so in cosine similarity the dot product
|
||||||
|
is normalized by the magnitude (length) of the vectors. It turns out
|
||||||
|
that this is equivalent to taking the cosine of the two vectors. So
|
||||||
|
cosine similarity in essence quantifies the angle between the two lines
|
||||||
|
in high-dimensional space. If the cosine similarity between two
|
||||||
|
subreddits is greater then their TF-IDF vectors are more correlated.
|
||||||
|
|
||||||
|
Cosine similarity with TF-IDF is popular (indeed it has been applied to
|
||||||
|
Reddit in research several times before) because it quantifies the
|
||||||
|
correlation between the most characteristic terms for two communities.
|
||||||
|
|
||||||
|
Compared to other approaches to similarity like those using word
|
||||||
|
embeddings or topic models it may struggle to handle polysemy, synonymy,
|
||||||
|
or correlations between different terms. Using phrase detection helps
|
||||||
|
with this a little bit. The advantages of this approach are simplicity
|
||||||
|
and scalability. I'm thinking about using [latent semantic
|
||||||
|
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
|
||||||
|
intermediate step to improve upon similarities based on raw TF-IDFs.
|
||||||
|
|
||||||
|
Even still, computing similarities between a large number of subreddits
|
||||||
|
is computationally expensive and requires $n(n-1)/2$ dot-product
|
||||||
|
evaluations. This can be sped up by passing
|
||||||
|
`similarity-threshold=X` where $X>0$ into `cosine_similarities.py`. I
|
||||||
|
used a cosine similarity function that's built into the spark matrix
|
||||||
|
library which supports the `DIMSUM` algorithm for approximating
|
||||||
|
matrix-matrix products. This algorithm is commonly used in industry
|
||||||
|
(i.e. at Twitter, Google) for large-scale similarity scoring.
|
||||||
|
|
||||||
|
## See also
|
||||||
|
|
||||||
|
The CDSC wiki page
|
||||||
|
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
|
||||||
|
is the landing page for this project on the wiki. The methods writeup
|
||||||
|
above used to live there; it now lives here so that doc and code stay
|
||||||
|
in sync.
|
||||||
1
similarities/TODO
Normal file
1
similarities/TODO
Normal file
@@ -0,0 +1 @@
|
|||||||
|
Try normalizing tf by the mean or std instead of the max to avoid penalizing subreddits with very active users.
|
||||||
@@ -2,13 +2,12 @@ import pandas as pd
|
|||||||
import fire
|
import fire
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from similarities_helper import similarities, column_similarities
|
from similarities_helper import similarities, column_similarities
|
||||||
from functools import partial
|
|
||||||
|
|
||||||
def cosine_similarities(infile, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
|
def cosine_similarities(infile, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
|
||||||
|
|
||||||
return similarities(infile=infile, simfunc=column_similarities, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=exclude_phrases,from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
|
return similarities(infile=infile, simfunc=column_similarities, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=exclude_phrases,from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
|
||||||
|
|
||||||
# change so that these take in an input as an optional argument (for speed, but also for idf).
|
|
||||||
def term_cosine_similarities(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet', min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
|
def term_cosine_similarities(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet', min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
|
||||||
|
|
||||||
return cosine_similarities(infile,
|
return cosine_similarities(infile,
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
#!/usr/bin/bash
|
#!/usr/bin/bash
|
||||||
start_spark_cluster.sh
|
start_spark_cluster.sh
|
||||||
singularity exec /gscratch/comdata/users/nathante/cdsc_base.sif spark-submit --master spark://$(hostname).hyak.local:7077 lsi_similarities.py author --outfile=/gscratch/comdata/output//reddit_similarity/subreddit_comment_authors_10k_LSI.feather --topN=10000
|
spark-submit --master spark://$(hostname):18899 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
|
||||||
singularity exec /gscratch/comdata/users/nathante/cdsc_base.sif stop-all.sh
|
stop-all.sh
|
||||||
|
|||||||
@@ -1,61 +0,0 @@
|
|||||||
import pandas as pd
|
|
||||||
import fire
|
|
||||||
from pathlib import Path
|
|
||||||
from similarities_helper import similarities, lsi_column_similarities
|
|
||||||
from functools import partial
|
|
||||||
|
|
||||||
def lsi_similarities(infile, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, from_date=None, to_date=None, tfidf_colname='tf_idf',n_components=100,n_iter=5,random_state=1968,algorithm='arpack'):
|
|
||||||
print(n_components,flush=True)
|
|
||||||
|
|
||||||
simfunc = partial(lsi_column_similarities,n_components=n_components,n_iter=n_iter,random_state=random_state,algorithm=algorithm)
|
|
||||||
|
|
||||||
return similarities(infile=infile, simfunc=simfunc, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
|
|
||||||
|
|
||||||
# change so that these take in an input as an optional argument (for speed, but also for idf).
|
|
||||||
def term_lsi_similarities(outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, from_date=None, to_date=None, n_components=300,n_iter=5,random_state=1968,algorithm='arpack'):
|
|
||||||
|
|
||||||
return lsi_similarities('/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet',
|
|
||||||
'term',
|
|
||||||
outfile,
|
|
||||||
min_df,
|
|
||||||
max_df,
|
|
||||||
included_subreddits,
|
|
||||||
topN,
|
|
||||||
from_date,
|
|
||||||
to_date,
|
|
||||||
n_components=n_components
|
|
||||||
)
|
|
||||||
|
|
||||||
def author_lsi_similarities(outfile, min_df=2, max_df=None, included_subreddits=None, topN=10000, from_date=None, to_date=None,n_components=300,n_iter=5,random_state=1968,algorithm='arpack'):
|
|
||||||
return lsi_similarities('/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_100k.parquet',
|
|
||||||
'author',
|
|
||||||
outfile,
|
|
||||||
min_df,
|
|
||||||
max_df,
|
|
||||||
included_subreddits,
|
|
||||||
topN,
|
|
||||||
from_date=from_date,
|
|
||||||
to_date=to_date,
|
|
||||||
n_components=n_components
|
|
||||||
)
|
|
||||||
|
|
||||||
def author_tf_similarities(outfile, min_df=2, max_df=None, included_subreddits=None, topN=10000, from_date=None, to_date=None,n_components=300,n_iter=5,random_state=1968,algorithm='arpack'):
|
|
||||||
return lsi_similarities('/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_100k.parquet',
|
|
||||||
'author',
|
|
||||||
outfile,
|
|
||||||
min_df,
|
|
||||||
max_df,
|
|
||||||
included_subreddits,
|
|
||||||
topN,
|
|
||||||
from_date=from_date,
|
|
||||||
to_date=to_date,
|
|
||||||
tfidf_colname='relative_tf',
|
|
||||||
n_components=n_components
|
|
||||||
)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
fire.Fire({'term':term_lsi_similarities,
|
|
||||||
'author':author_lsi_similarities,
|
|
||||||
'author-tf':author_tf_similarities})
|
|
||||||
|
|
||||||
@@ -2,14 +2,11 @@ from pyspark.sql import SparkSession
|
|||||||
from pyspark.sql import Window
|
from pyspark.sql import Window
|
||||||
from pyspark.sql import functions as f
|
from pyspark.sql import functions as f
|
||||||
from enum import Enum
|
from enum import Enum
|
||||||
from multiprocessing import cpu_count, Pool
|
|
||||||
from pyspark.mllib.linalg.distributed import CoordinateMatrix
|
from pyspark.mllib.linalg.distributed import CoordinateMatrix
|
||||||
from tempfile import TemporaryDirectory
|
from tempfile import TemporaryDirectory
|
||||||
import pyarrow
|
import pyarrow
|
||||||
import pyarrow.dataset as ds
|
import pyarrow.dataset as ds
|
||||||
from sklearn.metrics import pairwise_distances
|
|
||||||
from scipy.sparse import csr_matrix, issparse
|
from scipy.sparse import csr_matrix, issparse
|
||||||
from sklearn.decomposition import TruncatedSVD
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import pathlib
|
import pathlib
|
||||||
@@ -20,150 +17,128 @@ class tf_weight(Enum):
|
|||||||
MaxTF = 1
|
MaxTF = 1
|
||||||
Norm05 = 2
|
Norm05 = 2
|
||||||
|
|
||||||
infile = "/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet"
|
infile = "/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet"
|
||||||
cache_file = "/gscratch/comdata/users/nathante/cdsc_reddit/similarities/term_tfidf_entries_bak.parquet"
|
|
||||||
|
|
||||||
def termauthor_tfidf(term_tfidf_callable, author_tfidf_callable):
|
def reindex_tfidf_time_interval(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
|
||||||
|
term = term_colname
|
||||||
|
term_id = term + '_id'
|
||||||
|
term_id_new = term + '_id_new'
|
||||||
|
|
||||||
# subreddits missing after this step don't have any terms that have a high enough idf
|
spark = SparkSession.builder.getOrCreate()
|
||||||
# try rewriting without merges
|
conf = spark.sparkContext.getConf()
|
||||||
def reindex_tfidf(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, week=None, from_date=None, to_date=None, rescale_idf=True, tf_family=tf_weight.MaxTF):
|
print(exclude_phrases)
|
||||||
print("loading tfidf", flush=True)
|
tfidf_weekly = spark.read.parquet(infile)
|
||||||
tfidf_ds = ds.dataset(infile)
|
|
||||||
|
# create the time interval
|
||||||
|
if from_date is not None:
|
||||||
|
if type(from_date) is str:
|
||||||
|
from_date = datetime.fromisoformat(from_date)
|
||||||
|
|
||||||
|
tfidf_weekly = tfidf_weekly.filter(tfidf_weekly.week >= from_date)
|
||||||
|
|
||||||
|
if to_date is not None:
|
||||||
|
if type(to_date) is str:
|
||||||
|
to_date = datetime.fromisoformat(to_date)
|
||||||
|
tfidf_weekly = tfidf_weekly.filter(tfidf_weekly.week < to_date)
|
||||||
|
|
||||||
|
tfidf = tfidf_weekly.groupBy(["subreddit","week", term_id, term]).agg(f.sum("tf").alias("tf"))
|
||||||
|
tfidf = _calc_tfidf(tfidf, term_colname, tf_weight.Norm05)
|
||||||
|
tempdir = prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits)
|
||||||
|
tfidf = spark.read_parquet(tempdir.name)
|
||||||
|
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
|
||||||
|
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
||||||
|
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
|
||||||
|
return(tempdir, subreddit_names)
|
||||||
|
|
||||||
|
def reindex_tfidf(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False):
|
||||||
|
spark = SparkSession.builder.getOrCreate()
|
||||||
|
conf = spark.sparkContext.getConf()
|
||||||
|
print(exclude_phrases)
|
||||||
|
|
||||||
|
tfidf = spark.read.parquet(infile)
|
||||||
|
|
||||||
if included_subreddits is None:
|
if included_subreddits is None:
|
||||||
included_subreddits = select_topN_subreddits(topN)
|
included_subreddits = select_topN_subreddits(topN)
|
||||||
else:
|
else:
|
||||||
included_subreddits = set(map(str.strip,map(str.lower,open(included_subreddits))))
|
included_subreddits = set(map(str.strip,map(str.lower,open(included_subreddits))))
|
||||||
|
|
||||||
ds_filter = ds.field("subreddit").isin(included_subreddits)
|
if exclude_phrases == True:
|
||||||
|
tfidf = tfidf.filter(~f.col(term_colname).contains("_"))
|
||||||
|
|
||||||
if min_df is not None:
|
print("creating temporary parquet with matrix indicies")
|
||||||
ds_filter &= ds.field("count") >= min_df
|
tempdir = prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits)
|
||||||
|
|
||||||
if max_df is not None:
|
tfidf = spark.read.parquet(tempdir.name)
|
||||||
ds_filter &= ds.field("count") <= max_df
|
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
|
||||||
|
|
||||||
if week is not None:
|
|
||||||
ds_filter &= ds.field("week") == week
|
|
||||||
|
|
||||||
if from_date is not None:
|
|
||||||
ds_filter &= ds.field("week") >= from_date
|
|
||||||
|
|
||||||
if to_date is not None:
|
|
||||||
ds_filter &= ds.field("week") <= to_date
|
|
||||||
|
|
||||||
term = term_colname
|
|
||||||
term_id = term + '_id'
|
|
||||||
term_id_new = term + '_id_new'
|
|
||||||
|
|
||||||
projection = {
|
|
||||||
'subreddit_id':ds.field('subreddit_id'),
|
|
||||||
term_id:ds.field(term_id),
|
|
||||||
'relative_tf':ds.field("relative_tf").cast('float32')
|
|
||||||
}
|
|
||||||
|
|
||||||
if not rescale_idf:
|
|
||||||
projection = {
|
|
||||||
'subreddit_id':ds.field('subreddit_id'),
|
|
||||||
term_id:ds.field(term_id),
|
|
||||||
'relative_tf':ds.field('relative_tf').cast('float32'),
|
|
||||||
'tf_idf':ds.field('tf_idf').cast('float32')}
|
|
||||||
|
|
||||||
tfidf_ds = ds.dataset(infile)
|
|
||||||
|
|
||||||
df = tfidf_ds.to_table(filter=ds_filter,columns=projection)
|
|
||||||
|
|
||||||
df = df.to_pandas(split_blocks=True,self_destruct=True)
|
|
||||||
print("assigning indexes",flush=True)
|
|
||||||
df['subreddit_id_new'] = df.groupby("subreddit_id").ngroup()
|
|
||||||
grouped = df.groupby(term_id)
|
|
||||||
df[term_id_new] = grouped.ngroup()
|
|
||||||
|
|
||||||
if rescale_idf:
|
|
||||||
print("computing idf", flush=True)
|
|
||||||
df['new_count'] = grouped[term_id].transform('count')
|
|
||||||
N_docs = df.subreddit_id_new.max() + 1
|
|
||||||
df['idf'] = np.log(N_docs/(1+df.new_count),dtype='float32') + 1
|
|
||||||
if tf_family == tf_weight.MaxTF:
|
|
||||||
df["tf_idf"] = df.relative_tf * df.idf
|
|
||||||
else: # tf_fam = tf_weight.Norm05
|
|
||||||
df["tf_idf"] = (0.5 + 0.5 * df.relative_tf) * df.idf
|
|
||||||
|
|
||||||
print("assigning names")
|
|
||||||
subreddit_names = tfidf_ds.to_table(filter=ds_filter,columns=['subreddit','subreddit_id'])
|
|
||||||
batches = subreddit_names.to_batches()
|
|
||||||
|
|
||||||
with Pool(cpu_count()) as pool:
|
|
||||||
chunks = pool.imap_unordered(pull_names,batches)
|
|
||||||
subreddit_names = pd.concat(chunks,copy=False).drop_duplicates()
|
|
||||||
|
|
||||||
subreddit_names = subreddit_names.set_index("subreddit_id")
|
|
||||||
new_ids = df.loc[:,['subreddit_id','subreddit_id_new']].drop_duplicates()
|
|
||||||
new_ids = new_ids.set_index('subreddit_id')
|
|
||||||
subreddit_names = subreddit_names.join(new_ids,on='subreddit_id').reset_index()
|
|
||||||
subreddit_names = subreddit_names.drop("subreddit_id",1)
|
|
||||||
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
||||||
return(df, subreddit_names)
|
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
|
||||||
|
spark.stop()
|
||||||
|
return (tempdir, subreddit_names)
|
||||||
|
|
||||||
def pull_names(batch):
|
|
||||||
return(batch.to_pandas().drop_duplicates())
|
|
||||||
|
|
||||||
def similarities(infile, simfunc, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, from_date=None, to_date=None, tfidf_colname='tf_idf'):
|
def similarities(infile, simfunc, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
|
||||||
'''
|
'''
|
||||||
tfidf_colname: set to 'relative_tf' to use normalized term frequency instead of tf-idf, which can be useful for author-based similarities.
|
tfidf_colname: set to 'relative_tf' to use normalized term frequency instead of tf-idf, which can be useful for author-based similarities.
|
||||||
'''
|
'''
|
||||||
|
if from_date is not None or to_date is not None:
|
||||||
|
tempdir, subreddit_names = reindex_tfidf_time_interval(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=False, from_date=from_date, to_date=to_date)
|
||||||
|
|
||||||
|
else:
|
||||||
|
tempdir, subreddit_names = reindex_tfidf(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=False)
|
||||||
|
|
||||||
def proc_sims(sims, outfile):
|
print("loading matrix")
|
||||||
if issparse(sims):
|
# mat = read_tfidf_matrix("term_tfidf_entries7ejhvnvl.parquet", term_colname)
|
||||||
sims = sims.todense()
|
mat = read_tfidf_matrix(tempdir.name, term_colname, tfidf_colname)
|
||||||
|
print(f'computing similarities on mat. mat.shape:{mat.shape}')
|
||||||
|
print(f"size of mat is:{mat.data.nbytes}")
|
||||||
|
sims = simfunc(mat)
|
||||||
|
del mat
|
||||||
|
|
||||||
print(f"shape of sims:{sims.shape}")
|
if issparse(sims):
|
||||||
print(f"len(subreddit_names.subreddit.values):{len(subreddit_names.subreddit.values)}",flush=True)
|
sims = sims.todense()
|
||||||
sims = pd.DataFrame(sims)
|
|
||||||
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1)
|
|
||||||
sims['_subreddit'] = subreddit_names.subreddit.values
|
|
||||||
|
|
||||||
p = Path(outfile)
|
print(f"shape of sims:{sims.shape}")
|
||||||
|
print(f"len(subreddit_names.subreddit.values):{len(subreddit_names.subreddit.values)}")
|
||||||
|
sims = pd.DataFrame(sims)
|
||||||
|
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1)
|
||||||
|
sims['subreddit'] = subreddit_names.subreddit.values
|
||||||
|
|
||||||
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
|
p = Path(outfile)
|
||||||
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
|
|
||||||
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
|
|
||||||
outfile.parent.mkdir(exist_ok=True, parents=True)
|
|
||||||
|
|
||||||
sims.to_feather(outfile)
|
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
|
||||||
|
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
|
||||||
|
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
|
||||||
|
|
||||||
|
sims.to_feather(outfile)
|
||||||
|
tempdir.cleanup()
|
||||||
|
|
||||||
|
def read_tfidf_matrix_weekly(path, term_colname, week, tfidf_colname='tf_idf'):
|
||||||
term = term_colname
|
term = term_colname
|
||||||
term_id = term + '_id'
|
term_id = term + '_id'
|
||||||
term_id_new = term + '_id_new'
|
term_id_new = term + '_id_new'
|
||||||
|
|
||||||
entries, subreddit_names = reindex_tfidf(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN,from_date=from_date,to_date=to_date)
|
dataset = ds.dataset(path,format='parquet')
|
||||||
mat = csr_matrix((entries[tfidf_colname],(entries[term_id_new], entries.subreddit_id_new)))
|
entries = dataset.to_table(columns=[tfidf_colname,'subreddit_id_new', term_id_new],filter=ds.field('week')==week).to_pandas()
|
||||||
|
return(csr_matrix((entries[tfidf_colname], (entries[term_id_new]-1, entries.subreddit_id_new-1))))
|
||||||
|
|
||||||
print("loading matrix")
|
def read_tfidf_matrix(path, term_colname, tfidf_colname='tf_idf'):
|
||||||
|
term = term_colname
|
||||||
# mat = read_tfidf_matrix("term_tfidf_entries7ejhvnvl.parquet", term_colname)
|
term_id = term + '_id'
|
||||||
|
term_id_new = term + '_id_new'
|
||||||
print(f'computing similarities on mat. mat.shape:{mat.shape}')
|
dataset = ds.dataset(path,format='parquet')
|
||||||
print(f"size of mat is:{mat.data.nbytes}",flush=True)
|
print(f"tfidf_colname:{tfidf_colname}")
|
||||||
sims = simfunc(mat)
|
entries = dataset.to_table(columns=[tfidf_colname, 'subreddit_id_new',term_id_new]).to_pandas()
|
||||||
del mat
|
return(csr_matrix((entries[tfidf_colname],(entries[term_id_new]-1, entries.subreddit_id_new-1))))
|
||||||
|
|
||||||
if hasattr(sims,'__next__'):
|
|
||||||
for simmat, name in sims:
|
|
||||||
proc_sims(simmat, Path(outfile)/(str(name) + ".feather"))
|
|
||||||
else:
|
|
||||||
proc_sims(simmat, outfile)
|
|
||||||
|
|
||||||
def write_weekly_similarities(path, sims, week, names):
|
def write_weekly_similarities(path, sims, week, names):
|
||||||
sims['week'] = week
|
sims['week'] = week
|
||||||
p = pathlib.Path(path)
|
p = pathlib.Path(path)
|
||||||
if not p.is_dir():
|
if not p.is_dir():
|
||||||
p.mkdir(exist_ok=True,parents=True)
|
p.mkdir()
|
||||||
|
|
||||||
# reformat as a pairwise list
|
# reformat as a pairwise list
|
||||||
sims = sims.melt(id_vars=['_subreddit','week'],value_vars=names.subreddit.values)
|
sims = sims.melt(id_vars=['subreddit','week'],value_vars=names.subreddit.values)
|
||||||
sims.to_parquet(p / week.isoformat())
|
sims.to_parquet(p / week.isoformat())
|
||||||
|
|
||||||
def column_overlaps(mat):
|
def column_overlaps(mat):
|
||||||
@@ -175,62 +150,136 @@ def column_overlaps(mat):
|
|||||||
|
|
||||||
return intersection / den
|
return intersection / den
|
||||||
|
|
||||||
def test_lsi_sims():
|
def column_similarities(mat):
|
||||||
term = "term"
|
norm = np.matrix(np.power(mat.power(2).sum(axis=0),0.5,dtype=np.float32))
|
||||||
|
mat = mat.multiply(1/norm)
|
||||||
|
sims = mat.T @ mat
|
||||||
|
return(sims)
|
||||||
|
|
||||||
|
|
||||||
|
def prep_tfidf_entries_weekly(tfidf, term_colname, min_df, max_df, included_subreddits):
|
||||||
|
term = term_colname
|
||||||
term_id = term + '_id'
|
term_id = term + '_id'
|
||||||
term_id_new = term + '_id_new'
|
term_id_new = term + '_id_new'
|
||||||
|
|
||||||
t1 = time.perf_counter()
|
if min_df is None:
|
||||||
entries, subreddit_names = reindex_tfidf("/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k_repartitioned.parquet",
|
min_df = 0.1 * len(included_subreddits)
|
||||||
term_colname='term',
|
tfidf = tfidf.filter(f.col('count') >= min_df)
|
||||||
min_df=2000,
|
if max_df is not None:
|
||||||
topN=10000
|
tfidf = tfidf.filter(f.col('count') <= max_df)
|
||||||
)
|
|
||||||
t2 = time.perf_counter()
|
|
||||||
print(f"first load took:{t2 - t1}s")
|
|
||||||
|
|
||||||
entries, subreddit_names = reindex_tfidf("/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet",
|
tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
|
||||||
term_colname='term',
|
|
||||||
min_df=2000,
|
|
||||||
topN=10000
|
|
||||||
)
|
|
||||||
t3=time.perf_counter()
|
|
||||||
|
|
||||||
print(f"second load took:{t3 - t2}s")
|
# we might not have the same terms or subreddits each week, so we need to make unique ids for each week.
|
||||||
|
sub_ids = tfidf.select(['subreddit_id','week']).distinct()
|
||||||
|
sub_ids = sub_ids.withColumn("subreddit_id_new",f.row_number().over(Window.partitionBy('week').orderBy("subreddit_id")))
|
||||||
|
tfidf = tfidf.join(sub_ids,['subreddit_id','week'])
|
||||||
|
|
||||||
mat = csr_matrix((entries['tf_idf'],(entries[term_id_new], entries.subreddit_id_new)))
|
# only use terms in at least min_df included subreddits in a given week
|
||||||
sims = list(lsi_column_similarities(mat, [10,50]))
|
new_count = tfidf.groupBy([term_id,'week']).agg(f.count(term_id).alias('new_count'))
|
||||||
sims_og = sims
|
tfidf = tfidf.join(new_count,[term_id,'week'],how='inner')
|
||||||
sims_test = list(lsi_column_similarities(mat,[10,50],algorithm='randomized',n_iter=10))
|
|
||||||
|
|
||||||
# n_components is the latent dimensionality. sklearn recommends 100. More might be better
|
# reset the term ids
|
||||||
# if n_components is a list we'll return a list of similarities with different latent dimensionalities
|
term_ids = tfidf.select([term_id,'week']).distinct()
|
||||||
# if algorithm is 'randomized' instead of 'arpack' then n_iter gives the number of iterations.
|
term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.partitionBy('week').orderBy(term_id)))
|
||||||
# this function takes the svd and then the column similarities of it
|
tfidf = tfidf.join(term_ids,[term_id,'week'])
|
||||||
def lsi_column_similarities(tfidfmat,n_components=300,n_iter=10,random_state=1968,algorithm='randomized'):
|
|
||||||
# first compute the lsi of the matrix
|
|
||||||
# then take the column similarities
|
|
||||||
print("running LSI",flush=True)
|
|
||||||
|
|
||||||
if type(n_components) is int:
|
tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
|
||||||
n_components = [n_components]
|
tfidf = tfidf.withColumn("tf_idf", (tfidf.relative_tf * tfidf.idf).cast('float'))
|
||||||
|
|
||||||
n_components = sorted(n_components,reverse=True)
|
tempdir =TemporaryDirectory(suffix='.parquet',prefix='term_tfidf_entries',dir='.')
|
||||||
|
|
||||||
svd_components = n_components[0]
|
tfidf = tfidf.repartition('week')
|
||||||
svd = TruncatedSVD(n_components=svd_components,random_state=random_state,algorithm=algorithm,n_iter=n_iter)
|
|
||||||
mod = svd.fit(tfidfmat.T)
|
tfidf.write.parquet(tempdir.name,mode='overwrite',compression='snappy')
|
||||||
lsimat = mod.transform(tfidfmat.T)
|
return(tempdir)
|
||||||
for n_dims in n_components:
|
|
||||||
sims = column_similarities(lsimat[:,np.arange(n_dims)])
|
|
||||||
if len(n_components) > 1:
|
|
||||||
yield (sims, n_dims)
|
|
||||||
else:
|
|
||||||
return sims
|
|
||||||
|
|
||||||
|
|
||||||
def column_similarities(mat):
|
def prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits):
|
||||||
return 1 - pairwise_distances(mat,metric='cosine')
|
term = term_colname
|
||||||
|
term_id = term + '_id'
|
||||||
|
term_id_new = term + '_id_new'
|
||||||
|
|
||||||
|
if min_df is None:
|
||||||
|
min_df = 0.1 * len(included_subreddits)
|
||||||
|
tfidf = tfidf.filter(f.col('count') >= min_df)
|
||||||
|
if max_df is not None:
|
||||||
|
tfidf = tfidf.filter(f.col('count') <= max_df)
|
||||||
|
|
||||||
|
tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
|
||||||
|
|
||||||
|
# reset the subreddit ids
|
||||||
|
sub_ids = tfidf.select('subreddit_id').distinct()
|
||||||
|
sub_ids = sub_ids.withColumn("subreddit_id_new", f.row_number().over(Window.orderBy("subreddit_id")))
|
||||||
|
tfidf = tfidf.join(sub_ids,'subreddit_id')
|
||||||
|
|
||||||
|
# only use terms in at least min_df included subreddits
|
||||||
|
new_count = tfidf.groupBy(term_id).agg(f.count(term_id).alias('new_count'))
|
||||||
|
tfidf = tfidf.join(new_count,term_id,how='inner')
|
||||||
|
|
||||||
|
# reset the term ids
|
||||||
|
term_ids = tfidf.select([term_id]).distinct()
|
||||||
|
term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.orderBy(term_id)))
|
||||||
|
tfidf = tfidf.join(term_ids,term_id)
|
||||||
|
|
||||||
|
tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
|
||||||
|
tfidf = tfidf.withColumn("tf_idf", (tfidf.relative_tf * tfidf.idf).cast('float'))
|
||||||
|
|
||||||
|
tempdir =TemporaryDirectory(suffix='.parquet',prefix='term_tfidf_entries',dir='.')
|
||||||
|
|
||||||
|
tfidf.write.parquet(tempdir.name,mode='overwrite',compression='snappy')
|
||||||
|
return tempdir
|
||||||
|
|
||||||
|
|
||||||
|
# try computing cosine similarities using spark
|
||||||
|
def spark_cosine_similarities(tfidf, term_colname, min_df, included_subreddits, similarity_threshold):
|
||||||
|
term = term_colname
|
||||||
|
term_id = term + '_id'
|
||||||
|
term_id_new = term + '_id_new'
|
||||||
|
|
||||||
|
if min_df is None:
|
||||||
|
min_df = 0.1 * len(included_subreddits)
|
||||||
|
|
||||||
|
tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
|
||||||
|
tfidf = tfidf.cache()
|
||||||
|
|
||||||
|
# reset the subreddit ids
|
||||||
|
sub_ids = tfidf.select('subreddit_id').distinct()
|
||||||
|
sub_ids = sub_ids.withColumn("subreddit_id_new",f.row_number().over(Window.orderBy("subreddit_id")))
|
||||||
|
tfidf = tfidf.join(sub_ids,'subreddit_id')
|
||||||
|
|
||||||
|
# only use terms in at least min_df included subreddits
|
||||||
|
new_count = tfidf.groupBy(term_id).agg(f.count(term_id).alias('new_count'))
|
||||||
|
tfidf = tfidf.join(new_count,term_id,how='inner')
|
||||||
|
|
||||||
|
# reset the term ids
|
||||||
|
term_ids = tfidf.select([term_id]).distinct()
|
||||||
|
term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.orderBy(term_id)))
|
||||||
|
tfidf = tfidf.join(term_ids,term_id)
|
||||||
|
|
||||||
|
tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
|
||||||
|
tfidf = tfidf.withColumn("tf_idf", tfidf.relative_tf * tfidf.idf)
|
||||||
|
|
||||||
|
# step 1 make an rdd of entires
|
||||||
|
# sorted by (dense) spark subreddit id
|
||||||
|
n_partitions = int(len(included_subreddits)*2 / 5)
|
||||||
|
|
||||||
|
entries = tfidf.select(f.col(term_id_new)-1,f.col("subreddit_id_new")-1,"tf_idf").rdd.repartition(n_partitions)
|
||||||
|
|
||||||
|
# put like 10 subredis in each partition
|
||||||
|
|
||||||
|
# step 2 make it into a distributed.RowMatrix
|
||||||
|
coordMat = CoordinateMatrix(entries)
|
||||||
|
|
||||||
|
coordMat = CoordinateMatrix(coordMat.entries.repartition(n_partitions))
|
||||||
|
|
||||||
|
# this needs to be an IndexedRowMatrix()
|
||||||
|
mat = coordMat.toRowMatrix()
|
||||||
|
|
||||||
|
#goal: build a matrix of subreddit columns and tf-idfs rows
|
||||||
|
sim_dist = mat.columnSimilarities(threshold=similarity_threshold)
|
||||||
|
|
||||||
|
return (sim_dist, tfidf)
|
||||||
|
|
||||||
|
|
||||||
def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
|
def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
|
||||||
@@ -282,9 +331,7 @@ def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weig
|
|||||||
else: # tf_fam = tf_weight.Norm05
|
else: # tf_fam = tf_weight.Norm05
|
||||||
df = df.withColumn("tf_idf", (0.5 + 0.5 * df.relative_tf) * df.idf)
|
df = df.withColumn("tf_idf", (0.5 + 0.5 * df.relative_tf) * df.idf)
|
||||||
|
|
||||||
df = df.repartition(400,'subreddit','week')
|
return df
|
||||||
dfwriter = df.write.partitionBy("week").sortBy("subreddit")
|
|
||||||
return dfwriter
|
|
||||||
|
|
||||||
def _calc_tfidf(df, term_colname, tf_family):
|
def _calc_tfidf(df, term_colname, tf_family):
|
||||||
term = term_colname
|
term = term_colname
|
||||||
@@ -295,7 +342,7 @@ def _calc_tfidf(df, term_colname, tf_family):
|
|||||||
|
|
||||||
df = df.join(max_subreddit_terms, on='subreddit')
|
df = df.join(max_subreddit_terms, on='subreddit')
|
||||||
|
|
||||||
df = df.withColumn("relative_tf", (df.tf / df.sr_max_tf))
|
df = df.withColumn("relative_tf", df.tf / df.sr_max_tf)
|
||||||
|
|
||||||
# group by term. term is unique
|
# group by term. term is unique
|
||||||
idf = df.groupby([term]).count()
|
idf = df.groupby([term]).count()
|
||||||
@@ -338,28 +385,10 @@ def build_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm
|
|||||||
df = df.groupBy(['subreddit',term]).agg(f.sum('tf').alias('tf'))
|
df = df.groupBy(['subreddit',term]).agg(f.sum('tf').alias('tf'))
|
||||||
|
|
||||||
df = _calc_tfidf(df, term_colname, tf_family)
|
df = _calc_tfidf(df, term_colname, tf_family)
|
||||||
df = df.repartition('subreddit')
|
|
||||||
dfwriter = df.write.sortBy("subreddit","tf")
|
return df
|
||||||
return dfwriter
|
|
||||||
|
|
||||||
def select_topN_subreddits(topN, path="/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments_nonsfw.csv"):
|
def select_topN_subreddits(topN, path="/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments_nonsfw.csv"):
|
||||||
rankdf = pd.read_csv(path)
|
rankdf = pd.read_csv(path)
|
||||||
included_subreddits = set(rankdf.loc[rankdf.comments_rank <= topN,'subreddit'].values)
|
included_subreddits = set(rankdf.loc[rankdf.comments_rank <= topN,'subreddit'].values)
|
||||||
return included_subreddits
|
return included_subreddits
|
||||||
|
|
||||||
|
|
||||||
def repartition_tfidf(inpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet",
|
|
||||||
outpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k_repartitioned.parquet"):
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
df = spark.read.parquet(inpath)
|
|
||||||
df = df.repartition(400,'subreddit')
|
|
||||||
df.write.parquet(outpath,mode='overwrite')
|
|
||||||
|
|
||||||
|
|
||||||
def repartition_tfidf_weekly(inpath="/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet",
|
|
||||||
outpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_repartitioned.parquet"):
|
|
||||||
spark = SparkSession.builder.getOrCreate()
|
|
||||||
df = spark.read.parquet(inpath)
|
|
||||||
df = df.repartition(400,'subreddit','week')
|
|
||||||
dfwriter = df.write.partitionBy("week")
|
|
||||||
dfwriter.parquet(outpath,mode='overwrite')
|
|
||||||
|
|||||||
@@ -15,9 +15,10 @@ def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_
|
|||||||
else:
|
else:
|
||||||
include_subs = select_topN_subreddits(topN)
|
include_subs = select_topN_subreddits(topN)
|
||||||
|
|
||||||
dfwriter = func(df, include_subs, term_colname)
|
df = func(df, include_subs, term_colname)
|
||||||
|
|
||||||
|
df.write.parquet(outpath,mode='overwrite',compression='snappy')
|
||||||
|
|
||||||
dfwriter.parquet(outpath,mode='overwrite',compression='snappy')
|
|
||||||
spark.stop()
|
spark.stop()
|
||||||
|
|
||||||
def tfidf(inpath, outpath, topN, term_colname, exclude, included_subreddits):
|
def tfidf(inpath, outpath, topN, term_colname, exclude, included_subreddits):
|
||||||
|
|||||||
@@ -3,78 +3,78 @@ from pyspark.sql import SparkSession
|
|||||||
from pyspark.sql import Window
|
from pyspark.sql import Window
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import pyarrow
|
import pyarrow
|
||||||
import pyarrow.dataset as ds
|
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
import fire
|
import fire
|
||||||
from itertools import islice, chain
|
from itertools import islice
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from similarities_helper import *
|
from similarities_helper import *
|
||||||
from multiprocessing import Pool, cpu_count
|
from multiprocessing import Pool, cpu_count
|
||||||
from functools import partial
|
|
||||||
|
|
||||||
|
def _week_similarities(tempdir, term_colname, week):
|
||||||
|
print(f"loading matrix: {week}")
|
||||||
|
mat = read_tfidf_matrix_weekly(tempdir.name, term_colname, week)
|
||||||
|
print('computing similarities')
|
||||||
|
sims = column_similarities(mat)
|
||||||
|
del mat
|
||||||
|
|
||||||
def _week_similarities(week, simfunc, tfidf_path, term_colname, min_df, max_df, included_subreddits, topN, outdir:Path):
|
names = subreddit_names.loc[subreddit_names.week == week]
|
||||||
term = term_colname
|
sims = pd.DataFrame(sims.todense())
|
||||||
term_id = term + '_id'
|
|
||||||
term_id_new = term + '_id_new'
|
|
||||||
print(f"loading matrix: {week}")
|
|
||||||
entries, subreddit_names = reindex_tfidf(infile = tfidf_path,
|
|
||||||
term_colname=term_colname,
|
|
||||||
min_df=min_df,
|
|
||||||
max_df=max_df,
|
|
||||||
included_subreddits=included_subreddits,
|
|
||||||
topN=topN,
|
|
||||||
week=week)
|
|
||||||
mat = csr_matrix((entries[tfidf_colname],(entries[term_id_new], entries.subreddit_id_new)))
|
|
||||||
print('computing similarities')
|
|
||||||
sims = column_similarities(mat)
|
|
||||||
del mat
|
|
||||||
sims = pd.DataFrame(sims.todense())
|
|
||||||
sims = sims.rename({i: sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1)
|
|
||||||
sims['_subreddit'] = names.subreddit.values
|
|
||||||
outfile = str(Path(outdir) / str(week))
|
|
||||||
write_weekly_similarities(outfile, sims, week, names)
|
|
||||||
|
|
||||||
def pull_weeks(batch):
|
sims = sims.rename({i: sr for i, sr in enumerate(names.subreddit.values)}, axis=1)
|
||||||
return set(batch.to_pandas()['week'])
|
sims['_subreddit'] = names.subreddit.values
|
||||||
|
|
||||||
|
write_weekly_similarities(outfile, sims, week, names)
|
||||||
|
|
||||||
#tfidf = spark.read.parquet('/gscratch/comdata/users/nathante/subreddit_tfidf_weekly.parquet')
|
#tfidf = spark.read.parquet('/gscratch/comdata/users/nathante/subreddit_tfidf_weekly.parquet')
|
||||||
def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, max_df=None, included_subreddits = None, topN = 500):
|
def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, included_subreddits = None, topN = 500):
|
||||||
|
spark = SparkSession.builder.getOrCreate()
|
||||||
|
conf = spark.sparkContext.getConf()
|
||||||
print(outfile)
|
print(outfile)
|
||||||
tfidf_ds = ds.dataset(tfidf_path)
|
tfidf = spark.read.parquet(tfidf_path)
|
||||||
tfidf_ds = tfidf_ds.to_table(columns=["week"])
|
|
||||||
batches = tfidf_ds.to_batches()
|
if included_subreddits is None:
|
||||||
|
included_subreddits = select_topN_subreddits(topN)
|
||||||
|
else:
|
||||||
|
included_subreddits = set(open(included_subreddits))
|
||||||
|
|
||||||
with Pool(cpu_count()) as pool:
|
print(f"computing weekly similarities for {len(included_subreddits)} subreddits")
|
||||||
weeks = set(chain( * pool.imap_unordered(pull_weeks,batches)))
|
|
||||||
|
|
||||||
weeks = sorted(weeks)
|
print("creating temporary parquet with matrix indicies")
|
||||||
|
tempdir = prep_tfidf_entries_weekly(tfidf, term_colname, min_df, max_df=None, included_subreddits=included_subreddits)
|
||||||
|
|
||||||
|
tfidf = spark.read.parquet(tempdir.name)
|
||||||
|
|
||||||
|
# the ids can change each week.
|
||||||
|
subreddit_names = tfidf.select(['subreddit','subreddit_id_new','week']).distinct().toPandas()
|
||||||
|
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
|
||||||
|
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
|
||||||
|
spark.stop()
|
||||||
|
|
||||||
|
weeks = sorted(list(subreddit_names.week.drop_duplicates()))
|
||||||
# do this step in parallel if we have the memory for it.
|
# do this step in parallel if we have the memory for it.
|
||||||
# should be doable with pool.map
|
# should be doable with pool.map
|
||||||
|
|
||||||
print(f"computing weekly similarities")
|
def week_similarities_helper(week):
|
||||||
week_similarities_helper = partial(_week_similarities,simfunc=column_similarities, tfidf_path=tfidf_path, term_colname=term_colname, outdir=outfile, min_df=min_df,max_df=max_df,included_subreddits=included_subreddits,topN=topN)
|
_week_similarities(tempdir, term_colname, week)
|
||||||
|
|
||||||
with Pool(cpu_count()) as pool: # maybe it can be done with 40 cores on the huge machine?
|
with Pool(cpu_count()) as pool: # maybe it can be done with 40 cores on the huge machine?
|
||||||
list(pool.map(week_similarities_helper,weeks))
|
list(pool.map(week_similarities_helper,weeks))
|
||||||
|
|
||||||
def author_cosine_similarities_weekly(outfile, min_df=2, max_df=None, included_subreddits=None, topN=500):
|
def author_cosine_similarities_weekly(outfile, min_df=2 , included_subreddits=None, topN=500):
|
||||||
return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
|
return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
|
||||||
outfile,
|
outfile,
|
||||||
'author',
|
'author',
|
||||||
min_df,
|
min_df,
|
||||||
max_df,
|
|
||||||
included_subreddits,
|
included_subreddits,
|
||||||
topN)
|
topN)
|
||||||
|
|
||||||
def term_cosine_similarities_weekly(outfile, min_df=None, max_df=None, included_subreddits=None, topN=500):
|
def term_cosine_similarities_weekly(outfile, min_df=None, included_subreddits=None, topN=500):
|
||||||
return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
|
return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
|
||||||
outfile,
|
outfile,
|
||||||
'term',
|
'term',
|
||||||
min_df,
|
min_df,
|
||||||
max_df,
|
included_subreddits,
|
||||||
included_subreddits,
|
topN)
|
||||||
topN)
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
fire.Fire({'authors':author_cosine_similarities_weekly,
|
fire.Fire({'authors':author_cosine_similarities_weekly,
|
||||||
|
|||||||
Reference in New Issue
Block a user