18
0

14 Commits

Author SHA1 Message Date
0ea57b2377 datasets/add_months.sh: fail on leftover files, add --clean to wipe them
Without --clean, the script now exits with a clear error if temp or
staging directories from a previous run exist. Pass --clean to remove
them automatically before starting. README example updated to include
the flag.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 19:10:16 -07:00
6c6e05c360 datasets/README.md: document srun workflow, PYTHON var, container notes
Update the add_months and Step 6 sections with lessons learned from the
first run attempt:
- Replace salloc with srun (releases node automatically on completion)
- Document the PYTHON variable override needed for parallel/venv
- Note that .zst decompression uses the zstandard library due to
  Singularity container restrictions on the system zstd binary
- Add full srun invocation with bash -l, tee logging, and tmux guidance
- Update Step 6 walkthrough to use srun instead of salloc

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 19:05:45 -07:00
4854d4f537 datasets/add_months.sh: run comments and submissions Part 1 together
Combine task lists and run a single parallel call so all 32 files
(16 comments + 16 submissions) parse simultaneously.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:51:56 -07:00
bf6ccbc84a datasets/helper.py: use zstandard library for .zst decompression
The Python environment runs inside a Singularity container that cannot
exec the host's /usr/bin/zstd via subprocess. Replace the subprocess
call with the zstandard Python library, which was already a dependency.
Other formats (bz2, xz, gz) still use subprocess as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:48:04 -07:00
18925dfe5b datasets/: add PYTHON variable to add_months scripts
GNU parallel spawns fresh shells that don't inherit the active venv.
Using an explicit PYTHON path ensures the right interpreter is used in
parallel tasks. Defaults to python3 but can be overridden:

  PYTHON=/path/to/venv/bin/python3 ./add_months.sh ...

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:42:05 -07:00
926e9bc364 datasets/: fat-node add_months.sh; multinode variant as separate script
add_months.sh now targets a single fat node directly: starts a local
Spark cluster via start_spark_cluster.sh, submits jobs, stops the
cluster. No salloc needed.

add_months_multinode.sh is a new script for the multi-node case using
start_spark_and_run.sh from a login node. Usage takes NODES as first arg.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:35:12 -07:00
526dc03732 datasets/add_months.sh: stop before copy step to force manual verification
The script now exits after Part 2 so the copy and cleanup commands must
be run manually. This prevents the live datasets from being touched
without a deliberate verification step in between.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:22:03 -07:00
6b18840604 datasets/: stage new layer before touching live datasets in add_months
Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.

- sort_and_write gains out_by_subreddit/out_by_author params (replaces
  mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
  and find-based copy step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:17:38 -07:00
2d1d760142 datasets/: replace add_new_month with layered append workflow
Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:59:36 -07:00
1851132a06 move dataset + similarity docs from wiki into repo READMEs
The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:

- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
  prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
  adapted to the new script names and dropping obsolete notes about
  pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
  section — TF-IDF math, PMI phrase detection, cosine similarity — with
  MediaWiki math converted to markdown LaTeX and script references
  updated to current paths.

The wiki page has been trimmed to a landing page that points at these
README files in gitea.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:20:21 -07:00
33150243cd datasets/: split parquet scripts; share logic in dumps_helper.py
Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:

- dumps_helper.py — schemas, simdjson parser, a generic parse_record
  loop with per-field handler dispatch, and parse_dump / gen_task_list
  / sort_and_write workers. The only per-type code is the field-handler
  dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
  with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
  for the Spark sort. pyspark is imported lazily inside sort_and_write
  so Part 1 callers don't pay the import cost.

Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.

Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.

Runners and README updated for the new CLIs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:51:41 -07:00
8965a251b6 refactor datasets/ pipeline; add build/add-month workflows
Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.

Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.

Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:30:54 -07:00
d201930951 rewrite README, remove dead pushshift scripts and old/
Pushshift's files.pushshift.io archive is gone since Reddit cut off
third-party API access in 2023, so the dumps/ pull and SHA-check scripts
no longer work. The old/ directory of pre-refactor scripts was likewise
superseded by current versions in similarities/.

README rewritten to credit Nate as original developer, name current
maintainers, document the directory layout, point at the CDSC wiki for
the ArcticShift/torrent-based workflow, fix several stale script paths,
and correct an incorrect tf-normalization formula (max, not sum).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 15:53:33 -07:00
53f5b8c03c add note to try other tf normalization strategies. 2022-03-31 12:17:16 -07:00
96 changed files with 2777 additions and 2598 deletions

246
README.md
View File

@@ -2,51 +2,111 @@
title: Utilities for Reddit Data Science title: Utilities for Reddit Data Science
--- ---
`cdsc_reddit` is a collection of tools for working with Reddit data on the
Hyak super computing system at the University of Washington. It is built
around [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
and [pyarrow](https://arrow.apache.org/docs/python/) so that the underlying
pipelines scale to the full Pushshift archive.
The reddit_cdsc project contains tools for working with Reddit data. The project is designed for the hyak super computing system at The University of Washington. It consists of a set of python and bash scripts and uses the [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html "Pyspark documentation") and [pyarrow](https://arrow.apache.org/docs/python/ "documentation of python arrow bindings") to process large datasets. As of November 1st 2020, the project is under active development by [Nate TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Washington.29 "Nate's profile on the Community Data Science Collective Wiki") and provides scripts for: The project was originally developed by [Nate
TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29)
and is now maintained by a rotating set of researchers in the Community
Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and
others.
- Pulling and updating dumps from [Pushshift](https://pushshift.io "Pushshift.io") in `pull_pushshift_comments.sh` and `pull_pushshift_submissions.sh`. At a high level, the codebase covers four kinds of work:
- Uncompressing and parsing the dumps into [Parquet](https://parquet.apache.org/ "apahce parquet website") [datasets](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
- Running text analysis based on [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") including
- Extracting terms from Reddit comments in `tf_comments.py`
- Detecting common phrases based on [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information")
- Building TF-IDF vectors for each subreddit `idf_comments.py` and (more experimentally) at the subreddit-week level `idf_comments_weekly.py`
- Computing cosine similarities between subreddits based on TF-IDF `term_cosine_similarity.py`.
Right now, two steps are still in earlier stages of progress: - **Ingest.** Turning Pushshift comment and submission dumps into
partitioned Parquet datasets that are fast to query by subreddit or by
author.
- **Text features.** Building per-subreddit TF-IDF vectors over comment
text, including a phrase-detection pass based on pointwise mutual
information.
- **Similarity, clustering, and density.** Computing cosine similarities
between subreddits (by terms or by overlapping authors), clustering the
resulting similarity matrices, and summarizing how dense each
neighborhood is.
- **Time series and visualization.** Pulling activity time series per
subreddit and producing t-SNE plots of the clustering output.
- Approach comparable to tf-idf for similarity between subreddits in terms of comment authors. Several pieces are still rough — the user interfaces for many of the
- Clustering subreddits based on cosine-similarities using [power iteration clustering (PIC)](http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf "Paper on power iteration clustering") scripts assume familiarity with the project, and the TF-IDF pipeline does
not yet strip hyperlinks or bot comments, so subreddits with similar
automod messages can look misleadingly similar.
The TF-IDF for comments still has some kinks to iron out to remove hyper links and bot comments. Right now subreddits that have similar automoderation messages appear very similar. ## Repository layout
The user interfaces for most of the scripts are pretty crappy and need to be refined for re-use by others. | Directory | What's in it |
|---|---|
| `datasets/` | Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
| `ngrams/` | Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
| `similarities/` | TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
| `clustering/` | Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
| `density/` | Per-subreddit overlap density measures derived from the similarity matrices. |
| `timeseries/` | Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
| `visualization/` | Altair-based interactive plots of subreddit clusters. |
| `bots/` | Heuristics for flagging likely bot accounts. |
| `examples/` | Small standalone examples using pyarrow. |
## Pulling data from [Pushshift](https://pushshift.io "Pushshift.io") ## ## Sourcing the dumps
- `pull_pushshift_comments.sh` uses wget to download comment dumps to `/gscratch/comdata/raw_data/reddit_dumps/comments`. It doesn't download files that already exists and runs `check_comments_shas.sh` to verify the files downloaded correctly. Pushshift was effectively wound down after Reddit cut off third-party API
access in 2023, and the original `files.pushshift.io` archive is gone.
Collection of new Reddit comment and submission data has since been
picked up by [ArcticShift](https://github.com/ArthurHeitmann/arctic_shift),
which publishes both the historical Pushshift archive and the new data
it continues to collect, with monthly updates redistributed as academic
torrents by Reddit users `u/Watchful1` and `u/RaiderBDev`. Fetching the
dumps from a torrent client is a manual prerequisite to running the rest
of this pipeline; step-by-step instructions for the current CDSC
workflow — including which torrents to pull and how to stage the `.zst`
files on Hyak — live on the CDSC wiki at
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit).
The earlier `dumps/` directory of `pull_pushshift_*.sh` and SHA-check
scripts has been removed since the URLs they pointed at no longer
resolve.
- `pull_pushshift_submissions.sh` does the same for submissions and puts them in `/gscratch/comdata/raw_data/reddit_dumps/comments`. ## Building Parquet datasets
## Building Parquet Datasets ## The raw dumps are huge compressed JSON files with a lot of metadata that
we usually don't need. They aren't indexed, so it's expensive to pull data
for just a handful of subreddits, and they are awkward to read directly
into Spark. Extracting the useful fields and rewriting the data as
Parquet makes everything downstream cheaper. The conversion happens in
two steps:
Pushshift dumps are huge compressed json files with a lot of metadata that we may not need. It isn't indexed so it's expensive to pull data from just a handful of subreddits. It also turns out that it's a pain to read these compressed files straight into spark. Extracting useful variables from the dumps and building parquet datasets will make them easier to work with. This happens in two steps: 1. Extracting JSON into temporary, unpartitioned Parquet files using
pyarrow (`comments_2_parquet_part1.py`,
`submissions_2_parquet_part1.py`).
2. Repartitioning and sorting the data using PySpark
(`comments_2_parquet_part2.py`, `submissions_2_parquet_part2.py`).
1. Extracting json into (temporary, unpartitioned) parquet files using pyarrow. The final datasets live in `/gscratch/comdata/output/`:
2. Repartitioning and sorting the data using pyspark.
The final datasets are in `/gscratch/comdata/output.` - `reddit_comments_by_author.parquet` — comments partitioned and sorted by
author (lowercase).
- `reddit_comments_by_subreddit.parquet` — comments partitioned and sorted
by subreddit (lowercase).
- `reddit_submissions_by_author.parquet` — submissions partitioned and
sorted by author (lowercase).
- `reddit_submissions_by_subreddit.parquet` — submissions partitioned and
sorted by subreddit (lowercase).
- `reddit_comments_by_author.parquet` has comments partitioned and sorted by username (lowercase). Splitting the work this way lets us decompress and parse the dumps in the
- `reddit_comments_by_subreddit.parquet` has comments partitioned and sorted by subreddit name (lowercase). Hyak backfill queue and then sort them in Spark. Partitioning makes it
- `reddit_submissions_by_author.parquet` has submissions partitioned and sorted by username (lowercase). possible to read data for specific subreddits or authors efficiently, and
- `reddit_submissions_by_subreddit.parquet` has submissions partitioned and sorted by subreddit name (lowercase). sorting makes per-subreddit or per-user aggregations cheap. More
documentation on using these files lives on the [CDSC
wiki](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets).
Breaking this down into two steps is useful because it allows us to decompress and parse the dumps in the backfill queue and then sort them in spark. Partitioning the data makes it possible to efficiently read data for specific subreddits or authors. Sorting it means that you can efficiently compute agreggations at the subreddit or user level. More documentation on using these files is available [here](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets"). ## TF-IDF subreddit similarity
## TF-IDF Subreddit Similarity ## [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a simple
information-retrieval technique we use to quantify the topic of a
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") is common and simple information retrieval technique that we can use to quantify the topic of a subreddit. The goal of TF-IDF is to build a vector for each subreddit that scores every term (or phrase) according to how characteristic it is of the overall lexicon used in that subreddit. For example, the most characteristic terms in the subreddit /r/christianity in the current version of the TF-IDF model are: subreddit. The goal is to build a vector for each subreddit that scores
every term (or phrase) according to how characteristic it is of the
lexicon used there. For example, the most characteristic terms in
`/r/christianity` in the current model are:
| Term | tf_idf | | Term | tf_idf |
|:------------:|:------:| |:------------:|:------:|
@@ -56,61 +116,121 @@ Breaking this down into two steps is useful because it allows us to decompress a
| bible | 0.557 | | bible | 0.557 |
| scripture | 0.55 | | scripture | 0.55 |
TF-IDF stands for "term frequency - inverse document frequency" because it is the product of two terms "term frequency" and "inverse document frequency." Term frequency quantifies the amount that a term appears in a subreddit (document). Inverse document frequency quantifies how much that term appears in other subreddits (documents). As you can see on the Wikipedia page, there are many possible ways of constructing and combining these terms. TF-IDF is the product of two pieces: *term frequency* (how often a term
appears in a subreddit) and *inverse document frequency* (how rare the
term is across other subreddits). There are many ways to construct and
combine these; the [Wikipedia
page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) catalogs the common
variants.
$x + y = z_{1,d}$ We normalize term frequency by the maximum raw term frequency for each
subreddit:
I chose to normalize term frequency by the maximum (raw) term frequency for each subreddit: $$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}$$
$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\sum_{t^{'} \in d}{f_{t^{'},d}}}$
I use the log inverse document frequency: and use the log inverse document frequency:
$\mathrm{idf}_{t} = log\frac{N}{| {d \in D : t \in d} |}$
I then combine them using some smoothing to get: $$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$ combined with a smoothing term:
### Building TF-IDF vectors ### $$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
The process for building TF-IDF vectors has four steps: (Other normalization strategies are worth trying — see the note in
`similarities/TODO`.)
1. Extracting terms using `tf_comments.py` ### Building TF-IDF vectors
2. Detecting common phrases using `top_comment_phrases.py`
3. Extracting terms and common phrases using `tf_comments.py --mwe-pass='second'`
4. Building idf and tf-idf scores in `idf_comments.py`
#### Running `tf_comments.py` on the backfill queue #### The pipeline has four steps:
The main reason that I did it in 4 steps instead of one is to take advantage of the backfill queue for running `tf_comments.py`. This step requires reading all of the text in every comment and converting it to a bag of words at the subreddit-level. This is a lot of computation that is easily parallelizable. The script `run_tf_jobs.sh` partially automates running steps 1 (or 3) on the backfill queue. 1. Extract terms with `ngrams/tf_comments.py`.
2. Detect common phrases with `ngrams/top_comment_phrases.py`.
3. Re-extract terms together with detected phrases via
`ngrams/tf_comments.py --mwe-pass=second`.
4. Compute IDF and TF-IDF scores in `similarities/tfidf.py`.
#### Phrase detection using Pointwise Mutual Information #### #### Running `tf_comments.py` on the backfill queue
TF-IDF is simple, but only uses single words (unigrams). Sequences of multiple words can be important to account for how words have different meanings in different contexts or how sequences of words refer to distinct things like names. Dealing with context or longer sequences of words is a common challenge in natural language processing since the number of possible n-grams grows like crazy as n gets bigger. Phrase detection helps this problem by limiting the set of n-grams to those most informative. The main reason for the four-step layout is that `tf_comments.py` is
trivially parallel — it reads every comment and rewrites each subreddit
as a bag of words — so it benefits from being farmed out to the Hyak
backfill queue. `ngrams/run_tf_jobs.sh` partially automates the dispatch.
But how do we detect phrases? I implemented [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information"), which is a pretty simple way, but seems to work pretty well. #### Phrase detection using pointwise mutual information
PMI is an quantity derived from information theory. The intuition is that if two words occur together quite frequently compared to how often they appear separately then the cooccurrance is likely to be informative. TF-IDF over unigrams misses the fact that sequences of words often carry
distinct meaning (names, fixed expressions, in-jokes). Considering every
possible n-gram is prohibitive because the candidate set explodes with
`n`, so we use phrase detection to limit ourselves to informative
n-grams.
$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.$ We use [pointwise mutual
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
(PMI), which is simple and works well in practice. The intuition is that
if two words co-occur much more often than their marginal frequencies
would predict, the pair is probably meaningful:
In `tf_comments.py` if `--mwe-pass=first` then a 10\% sample of 1-4-grams (sequences of terms up to length 4) will be written to a file to be consumed by `top_comment_phrases.py`. `top_comment_phrases.py` computes the PMI for these possible phrases and writes those that occur at least 3500 times in the sample of n-grams and have a PWMI of at least 3 (about 65000 expressions). $$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
`tf_comments.py --mwe-pass=second` then uses the detected phrases and adds them to the term frequency data. When `tf_comments.py` is run with `--mwe-pass=first`, it writes a 10%
sample of 1- to 4-grams to a file. `top_comment_phrases.py` then
computes PMI over that sample and keeps phrases that occur at least
3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
A second pass of `tf_comments.py --mwe-pass=second` folds those phrases
back into the term-frequency data.
### Cosine Similarity ### ### Cosine similarity
Once the tf-idf vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity. Once the TF-IDF vectors are built, computing a similarity score between
two subreddits is straightforward with cosine similarity:
$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}} \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$ $$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). Each subreddit is a vector in a high-dimensional term space. The dot
In linear algebra, the dot product ($\cdot$) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights). product gives a weighted sum of shared terms, and dividing by the
The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. vector magnitudes removes the effect of differing vocabulary size — what
It turns out that this is equivalent to taking the cosine of the two vectors. So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space. If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated. remains is the cosine of the angle between the two vectors. Cosine
similarity with TF-IDF is popular (and has been used on Reddit several
times in prior research) because it captures correlation between the
*most characteristic* terms of two communities.
Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities. Compared to approaches based on word embeddings or topic models, this
method can struggle with polysemy, synonymy, and correlations between
related terms. Phrase detection helps a little. The trade-off is
simplicity and scalability. Adding [latent semantic
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
intermediate step is on the wish-list for improving on raw TF-IDF
similarities.
Compared to other approach to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms. Using phrase detection helps with this a little bit. The advantages of this approach are simplicity and scalability. I'm thinking about using [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis "Wikipedia article on Latent semantic analysis") as an intermediate step to improve upon similarities based on raw tf-idfs. Even with these simplifications, similarity between a large number of
subreddits is expensive — naively $n^2$ dot-products. Passing
`--similarity-threshold=X` (with `X>0`) to the similarity scripts lets
Spark's built-in matrix library use the DIMSUM approximation, which is
the same algorithm Twitter and Google have used for large-scale
similarity scoring.
Even still, computing similarities between a large number of subreddits is computationally expensive and requires $n^2$ dot-product evaluations. ## Clustering, density, and time series
This can be sped up by passing `similarity-threshold=X` where $X>0$ into `term_comment_similarity.py`. I used a cosine similarity function that's built into the spark matrix library which supports the `DIMSUM` algorithm for approximating matrix-matrix products. This algorithm is commonly used in industry (i.e. at Twitter, Google) for large-scale similarity scoring.
The similarity matrices feed three follow-on analyses:
- `clustering/clustering.py` clusters a similarity matrix using
affinity propagation; `clustering/selection.py` and
`clustering/fit_tsne.py` are supporting scripts for hyperparameter
selection and 2-D embeddings.
- `density/overlap_density.py` computes a per-subreddit overlap density
measure from the similarity matrix.
- `timeseries/cluster_timeseries.py` and `timeseries/choose_clusters.py`
pull subreddit-level activity time series and join them against
clustering output.
`visualization/tsne_vis.py` renders interactive Altair plots of the
clustering output — see the prebuilt HTML files in `visualization/` for
examples.
## Bot detection
`bots/good_bad_bot.py` computes user-level features (compression rate
of comment text, frequency of self-identification as a bot, etc.) that
are useful for filtering bot accounts out of downstream analyses. This
is preliminary work; nothing in the pipeline currently consumes it
automatically.

View File

@@ -1,2 +0,0 @@
from .timeseries import load_clusters, load_densities, build_cluster_timeseries

74
bots/good_bad_bot.py Normal file
View File

@@ -0,0 +1,74 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.sql.types import FloatType
import zlib
def zlib_entropy_rate(s):
sb = s.encode()
if len(sb) == 0:
return None
else:
return len(zlib.compress(s.encode(),level=6))/len(s.encode())
zlib_entropy_rate_udf = f.udf(zlib_entropy_rate,FloatType())
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_comments_by_author.parquet",compression='snappy')
df = df.withColumn("saidbot",f.lower(f.col("body")).like("%bot%"))
# df = df.filter(df.subreddit=='seattle')
# df = df.cache()
botreplies = df.filter(f.lower(df.body).rlike(".*[good|bad] bot.*"))
botreplies = botreplies.select([f.col("parent_id").substr(4,100).alias("bot_comment_id"),f.lower(f.col("body")).alias("good_bad_bot"),f.col("link_id").alias("gbbb_link_id")])
botreplies = botreplies.groupby(['bot_comment_id']).agg(f.count('good_bad_bot').alias("N_goodbad_votes"),
f.sum((f.lower(f.col('good_bad_bot')).like('%good bot%').astype("double"))).alias("n_good_votes"),
f.sum((f.lower(f.col('good_bad_bot')).like('%bad bot%').astype("double"))).alias("n_bad_votes"))
comments_by_author = df.select(['author','id','saidbot']).groupBy('author').agg(f.count('id').alias("N_comments"),
f.mean(f.col('saidbot').astype("double")).alias("prop_saidbot"),
f.sum(f.col('saidbot').astype("double")).alias("n_saidbot"))
# pd_comments_by_author = comments_by_author.toPandas()
# pd_comments_by_author['frac'] = 500 / pd_comments_by_author['N_comments']
# pd_comments_by_author.loc[pd_comments_by_author.frac > 1, 'frac'] = 1
# fractions = pd_comments_by_author.loc[:,['author','frac']]
# fractions = fractions.set_index('author').to_dict()['frac']
# sampled_author_comments = df.sampleBy("author",fractions).groupBy('author').agg(f.concat_ws(" ", f.collect_list('body')).alias('comments'))
df = df.withColumn("randn",f.randn(seed=1968))
win = Window.partitionBy("author").orderBy("randn")
df = df.withColumn("randRank",f.rank().over(win))
sampled_author_comments = df.filter(f.col("randRank") <= 1000)
sampled_author_comments = sampled_author_comments.groupBy('author').agg(f.concat_ws(" ", f.collect_list('body')).alias('comments'))
author_entropy_rates = sampled_author_comments.select(['author',zlib_entropy_rate_udf(f.col('comments')).alias("entropy_rate")])
parents = df.join(botreplies, on=df.id==botreplies.bot_comment_id,how='right_outer')
win1 = Window.partitionBy("author")
parents = parents.withColumn("first_bot_reply",f.min(f.col("CreatedAt")).over(win1))
first_bot_reply = parents.filter(f.col("first_bot_reply")==f.col("CreatedAt"))
first_bot_reply = first_bot_reply.withColumnRenamed("CreatedAt","FB_CreatedAt")
first_bot_reply = first_bot_reply.withColumnRenamed("id","FB_id")
comments_since_first_bot_reply = df.join(first_bot_reply,on = 'author',how='right_outer').filter(f.col("CreatedAt")>=f.col("first_bot_reply"))
comments_since_first_bot_reply = comments_since_first_bot_reply.groupBy("author").agg(f.count("id").alias("N_comments_since_firstbot"))
bots = parents.groupby(['author']).agg(f.sum('N_goodbad_votes').alias("N_goodbad_votes"),
f.sum(f.col('n_good_votes')).alias("n_good_votes"),
f.sum(f.col('n_bad_votes')).alias("n_bad_votes"),
f.count(f.col('author')).alias("N_bot_posts"))
bots = bots.join(comments_by_author,on="author",how='left_outer')
bots = bots.join(comments_since_first_bot_reply,on="author",how='left_outer')
bots = bots.join(author_entropy_rates,on='author',how='left_outer')
bots = bots.orderBy("N_goodbad_votes",ascending=False)
bots = bots.repartition(1)
bots.write.parquet("/gscratch/comdata/output/reddit_good_bad_bot.parquet",mode='overwrite')

View File

@@ -1,36 +1,55 @@
srun_singularity=srun -p compute-bigmem -A comdata --time=48:00:00 --mem=362G -c 40 /bin/bash -c #srun_cdsc='srun -p comdata-int -A comdata --time=300:00:00 --time-min=00:15:00 --mem=100G --ntasks=1 --cpus-per-task=28'
similarity_data=../../data/reddit_similarity srun_singularity=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity.sh
clustering_data=../../data/reddit_clustering similarity_data=/gscratch/comdata/output/reddit_similarity
kmeans_selection_grid=--max_iters=[3000] --n_inits=[10] --n_clusters=[100,500,1000,1250,1500,1750,2000] clustering_data=/gscratch/comdata/output/reddit_clustering
hdbscan_selection_grid=--min_cluster_sizes=[2,3,4,5] --min_samples=[2,3,4,5] --cluster_selection_epsilons=[0,0.01,0.05,0.1,0.15,0.2] --cluster_selection_methods=[eom,leaf] selection_grid="--max_iter=3000 --convergence_iter=15,30,100 --damping=0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99, --preference_quantile=0.1,0.3,0.5,0.7,0.9"
affinity_selection_grid=--dampings=[0.5,0.6,0.7,0.8,0.95,0.97,0.99] --preference_quantiles=[0.1,0.3,0.5,0.7,0.9] --convergence_iters=[15] #selection_grid="--max_iter=3000 --convergence_iter=[15] --preference_quantile=[0.5] --damping=[0.99]"
all:$(clustering_data)/subreddit_comment_authors_10k/selection_data.csv $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv
# $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS $(clustering_data)/subreddit_authors-tf_similarities_30k.feather/SUCCESS
# $(clustering_data)/subreddit_comment_terms_30k.feather/SUCCESS
authors_tf_10k_input_lsi=$(similarity_data)/subreddit_comment_authors-tf_10k_LSI $(clustering_data)/subreddit_comment_authors_10k/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_authors_10k.feather clustering.py
authors_tf_10k_output_lsi=$(clustering_data)/subreddit_comment_authors-tf_10k_LSI $(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_10k.feather $(clustering_data)/subreddit_comment_authors_10k $(clustering_data)/subreddit_comment_authors_10k/selection_data.csv $(selection_grid) -J 20
all:authors_tf_10k_lsi $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_terms_10k.feather clustering.py
$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_terms_10k.feather $(clustering_data)/subreddit_comment_terms_10k $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv $(selection_grid) -J 20
authors_tf_10k_lsi:${authors_tf_10k_output_lsi}/kmeans/selection_data.csv ${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv ${authors_tf_10k_output_lsi}/affinity/selection_data.csv $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv:clustering.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather
$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather $(clustering_data)/subreddit_comment_authors-tf_10k $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv $(selection_grid) -J 20
## LSI Models # $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS:selection.py $(similarity_data)/subreddit_comment_authors_30k.feather clustering.py
${authors_tf_10k_output_lsi}/kmeans/selection_data.csv:clustering.py ${authors_tf_10k_input_lsi} clustering_base.py kmeans_clustering.py # $(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_30k.feather $(clustering_data)/subreddit_comment_authors_30k $(selection_grid) -J 10 && touch $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS
$(srun_singularity) -c "source ~/.bashrc; python3 kmeans_clustering_lsi.py --inpath=${authors_tf_10k_input_lsi} --outpath=${authors_tf_10k_output_lsi}/kmeans --savefile=${authors_tf_10k_output_lsi}/kmeans/selection_data.csv $(kmeans_selection_grid)"
${authors_tf_10k_output_lsi}/affinity/selection_data.csv:clustering.py ${authors_tf_10k_input_lsi} clustering_base.py affinity_clustering.py # $(clustering_data)/subreddit_comment_terms_30k.feather/SUCCESS:selection.py $(similarity_data)/subreddit_comment_terms_30k.feather clustering.py
$(srun_singularity) -c "source ~/.bashrc; python3 affinity_clustering_lsi.py --inpath=${authors_tf_10k_input_lsi} --outpath=${authors_tf_10k_output_lsi}/affinity --savefile=${authors_tf_10k_output_lsi}/affinity/selection_data.csv $(affinity_selection_grid)" # $(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_terms_30k.feather $(clustering_data)/subreddit_comment_terms_30k $(selection_grid) -J 10 && touch $(clustering_data)/subreddit_comment_terms_30k.feather/SUCCESS
${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv:clustering.py ${authors_tf_10k_input_lsi} clustering_base.py hdbscan_clustering.py # $(clustering_data)/subreddit_authors-tf_similarities_30k.feather/SUCCESS:clustering.py $(similarity_data)/subreddit_comment_authors-tf_30k.feather
$(srun_singularity) -c "source ~/.bashrc; python3 hdbscan_clustering_lsi.py --inpath=${authors_tf_10k_input_lsi} --outpath=${authors_tf_10k_output_lsi}/hdbscan --savefile=${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv $(hdbscan_selection_grid)" # $(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors-tf_30k.feather $(clustering_data)/subreddit_comment_authors-tf_30k $(selection_grid) -J 8 && touch $(clustering_data)/subreddit_authors-tf_similarities_30k.feather/SUCCESS
${authors_tf_10k_output_lsi}/best_hdbscan.feather:${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv pick_best_clustering.py
$(srun_singularity) -c "source ~/.bashrc; python3 pick_best_clustering.py $< $@ --min_clusters=50 --max_isolates=5000 --min_cluster_size=2"
${authors_tf_10k_input_lsi}: # $(clustering_data)/subreddit_comment_authors_100k.feather:clustering.py $(similarity_data)/subreddit_comment_authors_100k.feather
$(MAKE) -C ../similarities # $(srun_singularity) python3 clustering.py $(similarity_data)/subreddit_comment_authors_100k.feather $(clustering_data)/subreddit_comment_authors_100k.feather ---max_iter=400 --convergence_iter=15 --preference_quantile=0.85 --damping=0.85
clean: # $(clustering_data)/comment_terms_100k.feather:clustering.py $(similarity_data)/subreddit_comment_terms_100k.feather
rm -f ${authors_tf_10k_output_lsi}/affinity/selection_data.csv # $(srun_singularity) python3 clustering.py $(similarity_data)/comment_terms_10000.feather $(clustering_data)/comment_terms_10000.feather ---max_iter=1000 --convergence_iter=15 --preference_quantile=0.9 --damping=0.5
rm -f ${authors_tf_10k_output_lsi}/kmeans/selection_data.csv
rm -f ${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv
PHONY: clean # $(clustering_data)/subreddit_comment_author-tf_100k.feather:clustering.py $(similarity_data)/subreddit_comment_author-tf_100k.feather
# $(srun_singularity) python3 clustering.py $(similarity_data)/subreddit_comment_author-tf_100k.parquet $(clustering_data)/subreddit_comment_author-tf_100k.feather ---max_iter=400 --convergence_iter=15 --preference_quantile=0.5 --damping=0.85
# it's pretty difficult to get a result that isn't one huge megacluster. A sign that it's bullcrap
# /gscratch/comdata/output/reddit_clustering/wang_similarity_10000.feather:clustering.py /gscratch/comdata/output/reddit_similarity/wang_similarity_10000.feather
# ./clustering.py /gscratch/comdata/output/reddit_similarity/wang_similarity_10000.feather /gscratch/comdata/output/reddit_clustering/wang_similarity_10000.feather ---max_iter=400 --convergence_iter=15 --preference_quantile=0.9 --damping=0.85
# /gscratch/comdata/output/reddit_tsne/subreddit_author_tf_similarities_10000.feather:fit_tsne.py /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
# start_spark_and_run.sh 1 fit_tsne.py --similarities=/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet --output=/gscratch/comdata/output/reddit_tsne/subreddit_author_tf_similarities_10000.feather
# /gscratch/comdata/output/reddit_tsne/wang_similarity_10000.feather:fit_tsne.py /gscratch/comdata/output/reddit_similarity/wang_similarity_10000.feather
# python3 fit_tsne.py --similarities=/gscratch/comdata/output/reddit_similarity/wang_similarity_10000.feather --output=/gscratch/comdata/output/reddit_tsne/wang_similarity_10000.feather
# /gscratch/comdata/output/reddit_tsne/comment_authors_10000.feather:clustering.py /gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
# # $srun_cdsc python3
# start_spark_and_run.sh 1 fit_tsne.py --similarities=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather --output=/gscratch/comdata/output/reddit_tsne/comment_authors_10000.feather

View File

@@ -1,129 +0,0 @@
from sklearn.cluster import AffinityPropagation
from dataclasses import dataclass
from clustering_base import clustering_result, clustering_job
from grid_sweep import grid_sweep
from pathlib import Path
from itertools import product, starmap
import fire
import sys
import numpy as np
# silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying.
@dataclass
class affinity_clustering_result(clustering_result):
damping:float
convergence_iter:int
preference_quantile:float
preference:float
max_iter:int
class affinity_job(clustering_job):
def __init__(self, infile, outpath, name, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True):
super().__init__(infile,
outpath,
name,
call=self._affinity_clustering,
preference_quantile=preference_quantile,
damping=damping,
max_iter=max_iter,
convergence_iter=convergence_iter,
random_state=1968,
verbose=verbose)
self.damping=damping
self.max_iter=max_iter
self.convergence_iter=convergence_iter
self.preference_quantile=preference_quantile
def _affinity_clustering(self, mat, preference_quantile, *args, **kwargs):
mat = 1-mat
preference = np.quantile(mat, preference_quantile)
self.preference = preference
print(f"preference is {preference}")
print("data loaded")
sys.stdout.flush()
clustering = AffinityPropagation(*args,
preference=preference,
affinity='precomputed',
copy=False,
**kwargs).fit(mat)
return clustering
def get_info(self):
result = super().get_info()
self.result=affinity_clustering_result(**result.__dict__,
damping=self.damping,
max_iter=self.max_iter,
convergence_iter=self.convergence_iter,
preference_quantile=self.preference_quantile,
preference=self.preference)
return self.result
class affinity_grid_sweep(grid_sweep):
def __init__(self,
inpath,
outpath,
*args,
**kwargs):
super().__init__(affinity_job,
_afffinity_grid_sweep,
inpath,
outpath,
self.namer,
*args,
**kwargs)
def namer(self,
damping,
max_iter,
convergence_iter,
preference_quantile):
return f"damp-{damping}_maxit-{max_iter}_convit-{convergence_iter}_prefq-{preference_quantile}"
def run_affinity_grid_sweep(savefile, inpath, outpath, dampings=[0.8], max_iters=[3000], convergence_iters=[30], preference_quantiles=[0.5],n_cores=10):
"""Run affinity clustering once or more with different parameters.
Usage:
affinity_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --max_iters=<csv> --dampings=<csv> --preference_quantiles=<csv>
Keword arguments:
savefile: path to save the metadata and diagnostics
inpath: path to feather data containing a labeled matrix of subreddit similarities.
outpath: path to output fit kmeans clusterings.
dampings:one or more numbers in [0.5, 1). damping parameter in affinity propagatin clustering.
preference_quantiles:one or more numbers in (0,1) for selecting the 'preference' parameter.
convergence_iters:one or more integers of number of iterations without improvement before stopping.
max_iters: one or more numbers of different maximum interations.
"""
obj = affinity_grid_sweep(inpath,
outpath,
map(float,dampings),
map(int,max_iters),
map(int,convergence_iters),
map(float,preference_quantiles))
obj.run(n_cores)
obj.save(savefile)
def test_select_affinity_clustering():
# select_hdbscan_clustering("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_30k_LSI",
# "test_hdbscan_author30k",
# min_cluster_sizes=[2],
# min_samples=[1,2],
# cluster_selection_epsilons=[0,0.05,0.1,0.15],
# cluster_selection_methods=['eom','leaf'],
# lsi_dimensions='all')
inpath = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/"
outpath = "test_affinity";
dampings=[0.8,0.9]
max_iters=[100000]
convergence_iters=[15]
preference_quantiles=[0.5,0.7]
gs = affinity_lsi_grid_sweep(inpath, 'all', outpath, dampings, max_iters, convergence_iters, preference_quantiles)
gs.run(20)
gs.save("test_affinity/lsi_sweep.csv")
if __name__ == "__main__":
fire.Fire(run_affinity_grid_sweep)

View File

@@ -1,99 +0,0 @@
import fire
from affinity_clustering import affinity_clustering_result, affinity_job, affinity_grid_sweep
from grid_sweep import grid_sweep
from lsi_base import lsi_result_mixin, lsi_grid_sweep, lsi_mixin
from dataclasses import dataclass
@dataclass
class affinity_clustering_result_lsi(affinity_clustering_result, lsi_result_mixin):
pass
class affinity_lsi_job(affinity_job, lsi_mixin):
def __init__(self, infile, outpath, name, lsi_dims, *args, **kwargs):
super().__init__(infile,
outpath,
name,
*args,
**kwargs)
super().set_lsi_dims(lsi_dims)
def get_info(self):
result = super().get_info()
self.result = affinity_clustering_result_lsi(**result.__dict__,
lsi_dimensions=self.lsi_dims)
return self.result
class affinity_lsi_grid_sweep(lsi_grid_sweep):
def __init__(self,
inpath,
lsi_dims,
outpath,
dampings=[0.9],
max_iters=[10000],
convergence_iters=[30],
preference_quantiles=[0.5]):
super().__init__(affinity_lsi_job,
_affinity_lsi_grid_sweep,
inpath,
lsi_dims,
outpath,
dampings,
max_iters,
convergence_iters,
preference_quantiles)
class _affinity_lsi_grid_sweep(grid_sweep):
def __init__(self,
inpath,
outpath,
lsi_dim,
*args,
**kwargs):
self.lsi_dim = lsi_dim
self.jobtype = affinity_lsi_job
super().__init__(self.jobtype,
inpath,
outpath,
self.namer,
[self.lsi_dim],
*args,
**kwargs)
def namer(self, *args, **kwargs):
s = affinity_grid_sweep.namer(self, *args[1:], **kwargs)
s += f"_lsi-{self.lsi_dim}"
return s
def run_affinity_lsi_grid_sweep(savefile, inpath, outpath, dampings=[0.8], max_iters=[3000], convergence_iters=[30], preference_quantiles=[0.5], lsi_dimensions='all',n_cores=30):
"""Run affinity clustering once or more with different parameters.
Usage:
affinity_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --max_iters=<csv> --dampings=<csv> --preference_quantiles=<csv> --lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
Keword arguments:
savefile: path to save the metadata and diagnostics
inpath: path to folder containing feather files with LSI similarity labeled matrices of subreddit similarities.
outpath: path to output fit kmeans clusterings.
dampings:one or more numbers in [0.5, 1). damping parameter in affinity propagatin clustering.
preference_quantiles:one or more numbers in (0,1) for selecting the 'preference' parameter.
convergence_iters:one or more integers of number of iterations without improvement before stopping.
max_iters: one or more numbers of different maximum interations.
lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
"""
obj = affinity_lsi_grid_sweep(inpath,
lsi_dimensions,
outpath,
map(float,dampings),
map(int,max_iters),
map(int,convergence_iters),
map(float,preference_quantiles))
obj.run(n_cores)
obj.save(savefile)
if __name__ == "__main__":
fire.Fire(run_affinity_lsi_grid_sweep)

View File

@@ -6,20 +6,21 @@ import numpy as np
from sklearn.cluster import AffinityPropagation from sklearn.cluster import AffinityPropagation
import fire import fire
from pathlib import Path from pathlib import Path
from multiprocessing import cpu_count
from dataclasses import dataclass
from clustering_base import sim_to_dist, process_clustering_result, clustering_result, read_similarity_mat
def affinity_clustering(similarities, output, *args, **kwargs): def read_similarity_mat(similarities, use_threads=True):
df = pd.read_feather(similarities, use_threads=use_threads)
mat = np.array(df.drop('_subreddit',1))
n = mat.shape[0]
mat[range(n),range(n)] = 1
return (df._subreddit,mat)
def affinity_clustering(similarities, *args, **kwargs):
subreddits, mat = read_similarity_mat(similarities) subreddits, mat = read_similarity_mat(similarities)
clustering = _affinity_clustering(mat, *args, **kwargs) return _affinity_clustering(mat, subreddits, *args, **kwargs)
cluster_data = process_clustering_result(clustering, subreddits)
cluster_data['algorithm'] = 'affinity'
return(cluster_data)
def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True): def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True):
''' '''
similarities: matrix of similarity scores similarities: feather file with a dataframe of similarity scores
preference_quantile: parameter controlling how many clusters to make. higher values = more clusters. 0.85 is a good value with 3000 subreddits. preference_quantile: parameter controlling how many clusters to make. higher values = more clusters. 0.85 is a good value with 3000 subreddits.
damping: parameter controlling how iterations are merged. Higher values make convergence faster and more dependable. 0.85 is a good value for the 10000 subreddits by author. damping: parameter controlling how iterations are merged. Higher values make convergence faster and more dependable. 0.85 is a good value for the 10000 subreddits by author.
''' '''
@@ -39,14 +40,25 @@ def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000,
verbose=verbose, verbose=verbose,
random_state=random_state).fit(mat) random_state=random_state).fit(mat)
cluster_data = process_clustering_result(clustering, subreddits)
output = Path(output) print(f"clustering took {clustering.n_iter_} iterations")
output.parent.mkdir(parents=True,exist_ok=True) clusters = clustering.labels_
print(f"found {len(set(clusters))} clusters")
cluster_data = pd.DataFrame({'subreddit': subreddits,'cluster':clustering.labels_})
cluster_sizes = cluster_data.groupby("cluster").count()
print(f"the largest cluster has {cluster_sizes.subreddit.max()} members")
print(f"the median cluster has {cluster_sizes.subreddit.median()} members")
print(f"{(cluster_sizes.subreddit==1).sum()} clusters have 1 member")
sys.stdout.flush()
cluster_data.to_feather(output) cluster_data.to_feather(output)
print(f"saved {output}") print(f"saved {output}")
return clustering return clustering
if __name__ == "__main__": if __name__ == "__main__":
fire.Fire(affinity_clustering) fire.Fire(affinity_clustering)

View File

@@ -1,151 +0,0 @@
import pickle
from pathlib import Path
import numpy as np
import pandas as pd
from dataclasses import dataclass
from sklearn.metrics import silhouette_score, silhouette_samples
from collections import Counter
# this is meant to be an interface, not created directly
class clustering_job:
def __init__(self, infile, outpath, name, call, *args, **kwargs):
self.outpath = Path(outpath)
self.call = call
self.args = args
self.kwargs = kwargs
self.infile = Path(infile)
self.name = name
self.hasrun = False
def run(self):
self.subreddits, self.mat = self.read_distance_mat(self.infile)
self.clustering = self.call(self.mat, *self.args, **self.kwargs)
self.cluster_data = self.process_clustering(self.clustering, self.subreddits)
self.outpath.mkdir(parents=True, exist_ok=True)
self.cluster_data.to_feather(self.outpath/(self.name + ".feather"))
self.hasrun = True
self.cleanup()
def cleanup(self):
self.cluster_data = None
self.mat = None
self.clustering=None
self.subreddits=None
def get_info(self):
if not self.hasrun:
self.run()
self.result = clustering_result(outpath=str(self.outpath.resolve()),
silhouette_score=self.score,
name=self.name,
n_clusters=self.n_clusters,
n_isolates=self.n_isolates,
silhouette_samples = self.silsampout
)
return self.result
def silhouette(self):
counts = Counter(self.clustering.labels_)
singletons = [key for key, value in counts.items() if value == 1]
isolates = (self.clustering.labels_ == -1) | (np.isin(self.clustering.labels_,np.array(singletons)))
scoremat = self.mat[~isolates][:,~isolates]
if self.n_clusters > 1:
score = silhouette_score(scoremat, self.clustering.labels_[~isolates], metric='precomputed')
silhouette_samp = silhouette_samples(self.mat, self.clustering.labels_, metric='precomputed')
silhouette_samp = pd.DataFrame({'subreddit':self.subreddits,'score':silhouette_samp})
self.outpath.mkdir(parents=True, exist_ok=True)
silsampout = self.outpath / ("silhouette_samples-" + self.name + ".feather")
self.silsampout = silsampout.resolve()
silhouette_samp.to_feather(self.silsampout)
else:
score = None
self.silsampout = None
return score
def read_distance_mat(self, similarities, use_threads=True):
print(similarities)
df = pd.read_feather(similarities, use_threads=use_threads)
mat = np.array(df.drop('_subreddit',axis=1))
n = mat.shape[0]
mat[range(n),range(n)] = 1
return (df._subreddit,1-mat)
def process_clustering(self, clustering, subreddits):
if hasattr(clustering,'n_iter_'):
print(f"clustering took {clustering.n_iter_} iterations")
clusters = clustering.labels_
self.n_clusters = len(set(clusters))
print(f"found {self.n_clusters} clusters")
cluster_data = pd.DataFrame({'subreddit': subreddits,'cluster':clustering.labels_})
self.score = self.silhouette()
print(f"silhouette_score:{self.score}")
cluster_sizes = cluster_data.groupby("cluster").count().reset_index()
print(f"the largest cluster has {cluster_sizes.loc[cluster_sizes.cluster!=-1].subreddit.max()} members")
print(f"the median cluster has {cluster_sizes.subreddit.median()} members")
n_isolates1 = (cluster_sizes.subreddit==1).sum()
print(f"{n_isolates1} clusters have 1 member")
n_isolates2 = cluster_sizes.loc[cluster_sizes.cluster==-1,:]['subreddit'].to_list()
if len(n_isolates2) > 0:
n_isloates2 = n_isolates2[0]
print(f"{n_isolates2} subreddits are in cluster -1",flush=True)
if n_isolates1 == 0:
self.n_isolates = n_isolates2
else:
self.n_isolates = n_isolates1
return cluster_data
class twoway_clustering_job(clustering_job):
def __init__(self, infile, outpath, name, call1, call2, args1, args2):
self.outpath = Path(outpath)
self.call1 = call1
self.args1 = args1
self.call2 = call2
self.args2 = args2
self.infile = Path(infile)
self.name = name
self.hasrun = False
self.args = args1|args2
def run(self):
self.subreddits, self.mat = self.read_distance_mat(self.infile)
self.step1 = self.call1(self.mat, **self.args1)
self.clustering = self.call2(self.mat, self.step1, **self.args2)
self.cluster_data = self.process_clustering(self.clustering, self.subreddits)
self.hasrun = True
self.after_run()
self.cleanup()
def after_run(self):
self.score = self.silhouette()
self.outpath.mkdir(parents=True, exist_ok=True)
print(self.outpath/(self.name+".feather"))
self.cluster_data.to_feather(self.outpath/(self.name + ".feather"))
def cleanup(self):
super().cleanup()
self.step1 = None
@dataclass
class clustering_result:
outpath:Path
silhouette_score:float
name:str
n_clusters:int
n_isolates:int
silhouette_samples:str

34
clustering/fit_tsne.py Normal file
View File

@@ -0,0 +1,34 @@
import fire
import pyarrow
import pandas as pd
from numpy import random
import numpy as np
from sklearn.manifold import TSNE
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet"
def fit_tsne(similarities, output, learning_rate=750, perplexity=50, n_iter=10000, early_exaggeration=20):
'''
similarities: feather file with a dataframe of similarity scores
learning_rate: parameter controlling how fast the model converges. Too low and you get outliers. Too high and you get a ball.
perplexity: number of neighbors to use. the default of 50 is often good.
'''
df = pd.read_feather(similarities)
n = df.shape[0]
mat = np.array(df.drop('subreddit',1),dtype=np.float64)
mat[range(n),range(n)] = 1
mat[mat > 1] = 1
dist = 2*np.arccos(mat)/np.pi
tsne_model = TSNE(2,learning_rate=750,perplexity=50,n_iter=10000,metric='precomputed',early_exaggeration=20,n_jobs=-1)
tsne_fit_model = tsne_model.fit(dist)
tsne_fit_whole = tsne_fit_model.fit_transform(dist)
plot_data = pd.DataFrame({'x':tsne_fit_whole[:,0],'y':tsne_fit_whole[:,1], 'subreddit':df.subreddit})
plot_data.to_feather(output)
if __name__ == "__main__":
fire.Fire(fit_tsne)

View File

@@ -1,49 +0,0 @@
from pathlib import Path
from multiprocessing import Pool, cpu_count
from itertools import product, chain
import pandas as pd
class grid_sweep:
def __init__(self, jobtype, inpath, outpath, namer, *args):
self.jobtype = jobtype
self.namer = namer
print(*args)
grid = list(product(*args))
inpath = Path(inpath)
outpath = Path(outpath)
self.hasrun = False
self.grid = [(inpath,outpath,namer(*g)) + g for g in grid]
self.jobs = [jobtype(*g) for g in self.grid]
def run(self, cores=20):
if cores is not None and cores > 1:
with Pool(cores) as pool:
infos = pool.map(self.jobtype.get_info, self.jobs)
else:
infos = map(self.jobtype.get_info, self.jobs)
self.infos = pd.DataFrame(infos)
self.hasrun = True
def save(self, outcsv):
if not self.hasrun:
self.run()
outcsv = Path(outcsv)
outcsv.parent.mkdir(parents=True, exist_ok=True)
self.infos.to_csv(outcsv)
class twoway_grid_sweep(grid_sweep):
def __init__(self, jobtype, inpath, outpath, namer, args1, args2, *args, **kwargs):
self.jobtype = jobtype
self.namer = namer
prod1 = product(* args1.values())
prod2 = product(* args2.values())
grid1 = [dict(zip(args1.keys(), pargs)) for pargs in prod1]
grid2 = [dict(zip(args2.keys(), pargs)) for pargs in prod2]
grid = product(grid1, grid2)
inpath = Path(inpath)
outpath = Path(outpath)
self.hasrun = False
self.grid = [(inpath,outpath,namer(**(g[0] | g[1])), g[0], g[1], *args) for g in grid]
self.jobs = [jobtype(*g) for g in self.grid]

View File

@@ -1,159 +0,0 @@
from clustering_base import clustering_result, clustering_job
from grid_sweep import grid_sweep
from dataclasses import dataclass
import hdbscan
from sklearn.neighbors import NearestNeighbors
import plotnine as pn
import numpy as np
from itertools import product, starmap, chain
import pandas as pd
from multiprocessing import cpu_count
import fire
def test_select_hdbscan_clustering():
# select_hdbscan_clustering("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_30k_LSI",
# "test_hdbscan_author30k",
# min_cluster_sizes=[2],
# min_samples=[1,2],
# cluster_selection_epsilons=[0,0.05,0.1,0.15],
# cluster_selection_methods=['eom','leaf'],
# lsi_dimensions='all')
inpath = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity/comment_authors_compex_LSI"
outpath = "test_hdbscan";
min_cluster_sizes=[2,3,4];
min_samples=[1,2,3];
cluster_selection_epsilons=[0,0.1,0.3,0.5];
cluster_selection_methods=[1];
lsi_dimensions='all'
gs = hdbscan_lsi_grid_sweep(inpath, "all", outpath, min_cluster_sizes, min_samples, cluster_selection_epsilons, cluster_selection_methods)
gs.run(20)
gs.save("test_hdbscan/lsi_sweep.csv")
# job1 = hdbscan_lsi_job(infile=inpath, outpath=outpath, name="test", lsi_dims=500, min_cluster_size=2, min_samples=1,cluster_selection_epsilon=0,cluster_selection_method='eom')
# job1.run()
# print(job1.get_info())
# df = pd.read_csv("test_hdbscan/selection_data.csv")
# test_select_hdbscan_clustering()
# check_clusters = pd.read_feather("test_hdbscan/500_2_2_0.1_eom.feather")
# silscores = pd.read_feather("test_hdbscan/silhouette_samples500_2_2_0.1_eom.feather")
# c = check_clusters.merge(silscores,on='subreddit')# fire.Fire(select_hdbscan_clustering)
class hdbscan_grid_sweep(grid_sweep):
def __init__(self,
inpath,
outpath,
*args,
**kwargs):
super().__init__(hdbscan_job, inpath, outpath, self.namer, *args, **kwargs)
def namer(self,
min_cluster_size,
min_samples,
cluster_selection_epsilon,
cluster_selection_method):
return f"mcs-{min_cluster_size}_ms-{min_samples}_cse-{cluster_selection_epsilon}_csm-{cluster_selection_method}"
@dataclass
class hdbscan_clustering_result(clustering_result):
min_cluster_size:int
min_samples:int
cluster_selection_epsilon:float
cluster_selection_method:str
class hdbscan_job(clustering_job):
def __init__(self, infile, outpath, name, min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0, cluster_selection_method='eom'):
super().__init__(infile,
outpath,
name,
call=hdbscan_job._hdbscan_clustering,
min_cluster_size=min_cluster_size,
min_samples=min_samples,
cluster_selection_epsilon=cluster_selection_epsilon,
cluster_selection_method=cluster_selection_method
)
self.min_cluster_size = min_cluster_size
self.min_samples = min_samples
self.cluster_selection_epsilon = cluster_selection_epsilon
self.cluster_selection_method = cluster_selection_method
# self.mat = 1 - self.mat
def _hdbscan_clustering(mat, *args, **kwargs):
print(f"running hdbscan clustering. args:{args}. kwargs:{kwargs}")
print(mat)
clusterer = hdbscan.HDBSCAN(metric='precomputed',
core_dist_n_jobs=cpu_count(),
*args,
**kwargs,
)
clustering = clusterer.fit(mat.astype('double'))
return(clustering)
def get_info(self):
result = super().get_info()
self.result = hdbscan_clustering_result(**result.__dict__,
min_cluster_size=self.min_cluster_size,
min_samples=self.min_samples,
cluster_selection_epsilon=self.cluster_selection_epsilon,
cluster_selection_method=self.cluster_selection_method)
return self.result
def run_hdbscan_grid_sweep(savefile, inpath, outpath, min_cluster_sizes=[2], min_samples=[1], cluster_selection_epsilons=[0], cluster_selection_methods=['eom']):
"""Run hdbscan clustering once or more with different parameters.
Usage:
hdbscan_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --min_cluster_sizes=<csv> --min_samples=<csv> --cluster_selection_epsilons=<csv> --cluster_selection_methods=<csv "eom"|"leaf">
Keword arguments:
savefile: path to save the metadata and diagnostics
inpath: path to feather data containing a labeled matrix of subreddit similarities.
outpath: path to output fit kmeans clusterings.
min_cluster_sizes: one or more integers indicating the minumum cluster size
min_samples: one ore more integers indicating the minimum number of samples used in the algorithm
cluster_selection_epsilon: one or more similarity thresholds for transition from dbscan to hdbscan
cluster_selection_method: "eom" or "leaf" eom gives larger clusters.
"""
obj = hdbscan_grid_sweep(inpath,
outpath,
map(int,min_cluster_sizes),
map(int,min_samples),
map(float,cluster_selection_epsilons),
cluster_selection_methods)
obj.run()
obj.save(savefile)
def KNN_distances_plot(mat,outname,k=2):
nbrs = NearestNeighbors(n_neighbors=k,algorithm='auto',metric='precomputed').fit(mat)
distances, indices = nbrs.kneighbors(mat)
d2 = distances[:,-1]
df = pd.DataFrame({'dist':d2})
df = df.sort_values("dist",ascending=False)
df['idx'] = np.arange(0,d2.shape[0]) + 1
p = pn.qplot(x='idx',y='dist',data=df,geom='line') + pn.scales.scale_y_continuous(minor_breaks = np.arange(0,50)/50,
breaks = np.arange(0,10)/10)
p.save(outname,width=16,height=10)
def make_KNN_plots():
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10k.feather"
subreddits, mat = read_similarity_mat(similarities)
mat = sim_to_dist(mat)
KNN_distances_plot(mat,k=2,outname='terms_knn_dist2.png')
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10k.feather"
subreddits, mat = read_similarity_mat(similarities)
mat = sim_to_dist(mat)
KNN_distances_plot(mat,k=2,outname='authors_knn_dist2.png')
similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k.feather"
subreddits, mat = read_similarity_mat(similarities)
mat = sim_to_dist(mat)
KNN_distances_plot(mat,k=2,outname='authors-tf_knn_dist2.png')
if __name__ == "__main__":
fire.Fire(run_hdbscan_grid_sweep)
# test_select_hdbscan_clustering()
#fire.Fire(select_hdbscan_clustering)

View File

@@ -1,101 +0,0 @@
from hdbscan_clustering import hdbscan_job, hdbscan_grid_sweep, hdbscan_clustering_result
from lsi_base import lsi_grid_sweep, lsi_mixin, lsi_result_mixin
from grid_sweep import grid_sweep
import fire
from dataclasses import dataclass
@dataclass
class hdbscan_clustering_result_lsi(hdbscan_clustering_result, lsi_result_mixin):
pass
class hdbscan_lsi_job(hdbscan_job, lsi_mixin):
def __init__(self, infile, outpath, name, lsi_dims, *args, **kwargs):
super().__init__(
infile,
outpath,
name,
*args,
**kwargs)
super().set_lsi_dims(lsi_dims)
def get_info(self):
partial_result = super().get_info()
self.result = hdbscan_clustering_result_lsi(**partial_result.__dict__,
lsi_dimensions=self.lsi_dims)
return self.result
class hdbscan_lsi_grid_sweep(lsi_grid_sweep):
def __init__(self,
inpath,
lsi_dims,
outpath,
min_cluster_sizes,
min_samples,
cluster_selection_epsilons,
cluster_selection_methods
):
super().__init__(hdbscan_lsi_job,
_hdbscan_lsi_grid_sweep,
inpath,
lsi_dims,
outpath,
min_cluster_sizes,
min_samples,
cluster_selection_epsilons,
cluster_selection_methods)
class _hdbscan_lsi_grid_sweep(grid_sweep):
def __init__(self,
inpath,
outpath,
lsi_dim,
*args,
**kwargs):
print(args)
print(kwargs)
self.lsi_dim = lsi_dim
self.jobtype = hdbscan_lsi_job
super().__init__(self.jobtype, inpath, outpath, self.namer, [self.lsi_dim], *args, **kwargs)
def namer(self, *args, **kwargs):
s = hdbscan_grid_sweep.namer(self, *args[1:], **kwargs)
s += f"_lsi-{self.lsi_dim}"
return s
def run_hdbscan_lsi_grid_sweep(savefile, inpath, outpath, min_cluster_sizes=[2], min_samples=[1], cluster_selection_epsilons=[0], cluster_selection_methods=[1],lsi_dimensions='all'):
"""Run hdbscan clustering once or more with different parameters.
Usage:
hdbscan_clustering_lsi --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --min_cluster_sizes=<csv> --min_samples=<csv> --cluster_selection_epsilons=<csv> --cluster_selection_methods=[eom]> --lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
Keword arguments:
savefile: path to save the metadata and diagnostics
inpath: path to folder containing feather files with LSI similarity labeled matrices of subreddit similarities.
outpath: path to output fit clusterings.
min_cluster_sizes: one or more integers indicating the minumum cluster size
min_samples: one ore more integers indicating the minimum number of samples used in the algorithm
cluster_selection_epsilons: one or more similarity thresholds for transition from dbscan to hdbscan
cluster_selection_methods: one or more of "eom" or "leaf" eom gives larger clusters.
lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
"""
obj = hdbscan_lsi_grid_sweep(inpath,
lsi_dimensions,
outpath,
list(map(int,min_cluster_sizes)),
list(map(int,min_samples)),
list(map(float,cluster_selection_epsilons)),
cluster_selection_methods)
obj.run(10)
obj.save(savefile)
if __name__ == "__main__":
fire.Fire(run_hdbscan_lsi_grid_sweep)

View File

@@ -1,105 +0,0 @@
from sklearn.cluster import KMeans
import fire
from pathlib import Path
from dataclasses import dataclass
from clustering_base import clustering_result, clustering_job
from grid_sweep import grid_sweep
@dataclass
class kmeans_clustering_result(clustering_result):
n_clusters:int
n_init:int
max_iter:int
class kmeans_job(clustering_job):
def __init__(self, infile, outpath, name, n_clusters, n_init=10, max_iter=100000, random_state=1968, verbose=True):
super().__init__(infile,
outpath,
name,
call=kmeans_job._kmeans_clustering,
n_clusters=n_clusters,
n_init=n_init,
max_iter=max_iter,
random_state=random_state,
verbose=verbose)
self.n_clusters=n_clusters
self.n_init=n_init
self.max_iter=max_iter
def _kmeans_clustering(mat, *args, **kwargs):
clustering = KMeans(*args,
**kwargs,
).fit(mat)
return clustering
def get_info(self):
result = super().get_info()
self.result = kmeans_clustering_result(**result.__dict__,
n_init=self.n_init,
max_iter=self.max_iter)
return self.result
class kmeans_grid_sweep(grid_sweep):
def __init__(self,
inpath,
outpath,
*args,
**kwargs):
super().__init__(kmeans_job, inpath, outpath, self.namer, *args, **kwargs)
def namer(self,
n_clusters,
n_init,
max_iter):
return f"nclusters-{n_clusters}_nit-{n_init}_maxit-{max_iter}"
def test_select_kmeans_clustering():
inpath = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/"
outpath = "test_kmeans";
n_clusters=[200,300,400];
n_init=[1,2,3];
max_iter=[100000]
gs = kmeans_lsi_grid_sweep(inpath, 'all', outpath, n_clusters, n_init, max_iter)
gs.run(1)
cluster_selection_epsilons=[0,0.1,0.3,0.5];
cluster_selection_methods=['eom'];
lsi_dimensions='all'
gs = hdbscan_lsi_grid_sweep(inpath, "all", outpath, min_cluster_sizes, min_samples, cluster_selection_epsilons, cluster_selection_methods)
gs.run(20)
gs.save("test_hdbscan/lsi_sweep.csv")
def run_kmeans_grid_sweep(savefile, inpath, outpath, n_clusters=[500], n_inits=[1], max_iters=[3000]):
"""Run kmeans clustering once or more with different parameters.
Usage:
kmeans_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --n_clusters=<csv number of clusters> --n_inits=<csv> --max_iters=<csv>
Keword arguments:
savefile: path to save the metadata and diagnostics
inpath: path to feather data containing a labeled matrix of subreddit similarities.
outpath: path to output fit kmeans clusterings.
n_clusters: one or more numbers of kmeans clusters to select.
n_inits: one or more numbers of different initializations to use for each clustering.
max_iters: one or more numbers of different maximum interations.
"""
obj = kmeans_grid_sweep(inpath,
outpath,
map(int,n_clusters),
map(int,n_inits),
map(int,max_iters))
obj.run(1)
obj.save(savefile)
if __name__ == "__main__":
fire.Fire(run_kmeans_grid_sweep)

View File

@@ -1,93 +0,0 @@
import fire
from dataclasses import dataclass
from kmeans_clustering import kmeans_job, kmeans_clustering_result, kmeans_grid_sweep
from lsi_base import lsi_mixin, lsi_result_mixin, lsi_grid_sweep
from grid_sweep import grid_sweep
@dataclass
class kmeans_clustering_result_lsi(kmeans_clustering_result, lsi_result_mixin):
pass
class kmeans_lsi_job(kmeans_job, lsi_mixin):
def __init__(self, infile, outpath, name, lsi_dims, *args, **kwargs):
super().__init__(infile,
outpath,
name,
*args,
**kwargs)
super().set_lsi_dims(lsi_dims)
def get_info(self):
result = super().get_info()
self.result = kmeans_clustering_result_lsi(**result.__dict__,
lsi_dimensions=self.lsi_dims)
return self.result
class _kmeans_lsi_grid_sweep(grid_sweep):
def __init__(self,
inpath,
outpath,
lsi_dim,
*args,
**kwargs):
print(args)
print(kwargs)
self.lsi_dim = lsi_dim
self.jobtype = kmeans_lsi_job
super().__init__(self.jobtype, inpath, outpath, self.namer, [self.lsi_dim], *args, **kwargs)
def namer(self, *args, **kwargs):
s = kmeans_grid_sweep.namer(self, *args[1:], **kwargs)
s += f"_lsi-{self.lsi_dim}"
return s
class kmeans_lsi_grid_sweep(lsi_grid_sweep):
def __init__(self,
inpath,
lsi_dims,
outpath,
n_clusters,
n_inits,
max_iters
):
super().__init__(kmeans_lsi_job,
_kmeans_lsi_grid_sweep,
inpath,
lsi_dims,
outpath,
n_clusters,
n_inits,
max_iters)
def run_kmeans_lsi_grid_sweep(savefile, inpath, outpath, n_clusters=[500], n_inits=[1], max_iters=[3000], lsi_dimensions="all"):
"""Run kmeans clustering once or more with different parameters.
Usage:
kmeans_clustering_lsi.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH d--lsi_dimensions=<"all"|csv number of LSI dimensions to use> --n_clusters=<csv number of clusters> --n_inits=<csv> --max_iters=<csv>
Keword arguments:
savefile: path to save the metadata and diagnostics
inpath: path to folder containing feather files with LSI similarity labeled matrices of subreddit similarities.
outpath: path to output fit kmeans clusterings.
lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
n_clusters: one or more numbers of kmeans clusters to select.
n_inits: one or more numbers of different initializations to use for each clustering.
max_iters: one or more numbers of different maximum interations.
"""
obj = kmeans_lsi_grid_sweep(inpath,
lsi_dimensions,
outpath,
list(map(int,n_clusters)),
list(map(int,n_inits)),
list(map(int,max_iters))
)
obj.run(1)
obj.save(savefile)
if __name__ == "__main__":
fire.Fire(run_kmeans_lsi_grid_sweep)

View File

@@ -1,44 +0,0 @@
from clustering_base import clustering_job, clustering_result
from grid_sweep import grid_sweep, twoway_grid_sweep
from dataclasses import dataclass
from itertools import chain
from pathlib import Path
class lsi_mixin():
def set_lsi_dims(self, lsi_dims):
self.lsi_dims = lsi_dims
@dataclass
class lsi_result_mixin:
lsi_dimensions:int
class lsi_grid_sweep(grid_sweep):
def __init__(self, jobtype, subsweep, inpath, lsi_dimensions, outpath, *args, **kwargs):
self.jobtype = jobtype
self.subsweep = subsweep
inpath = Path(inpath)
if lsi_dimensions == 'all':
lsi_paths = list(inpath.glob("*.feather"))
else:
lsi_paths = [inpath / (str(dim) + '.feather') for dim in lsi_dimensions]
print(lsi_paths)
lsi_nums = [int(p.stem) for p in lsi_paths]
self.hasrun = False
self.subgrids = [self.subsweep(lsi_path, outpath, lsi_dim, *args, **kwargs) for lsi_dim, lsi_path in zip(lsi_nums, lsi_paths)]
self.jobs = list(chain(*map(lambda gs: gs.jobs, self.subgrids)))
class twoway_lsi_grid_sweep(twoway_grid_sweep):
def __init__(self, jobtype, subsweep, inpath, lsi_dimensions, outpath, args1, args2):
self.jobtype = jobtype
self.subsweep = subsweep
inpath = Path(inpath)
if lsi_dimensions == 'all':
lsi_paths = list(inpath.glob("*.feather"))
else:
lsi_paths = [inpath / (str(dim) + '.feather') for dim in lsi_dimensions]
lsi_nums = [int(p.stem) for p in lsi_paths]
self.hasrun = False
self.subgrids = [self.subsweep(lsi_path, outpath, lsi_dim, args1, args2) for lsi_dim, lsi_path in zip(lsi_nums, lsi_paths)]
self.jobs = list(chain(*map(lambda gs: gs.jobs, self.subgrids)))

View File

@@ -1,33 +0,0 @@
#!/usr/bin/env python3
import fire
import pandas as pd
from pathlib import Path
import shutil
selection_data="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/clustering/comment_authors_compex_LSI/selection_data.csv"
outpath = 'test_best.feather'
min_clusters=50; max_isolates=7500; min_cluster_size=2
# pick the best clustering according to silhouette score subject to contraints
def pick_best_clustering(selection_data, output, min_clusters, max_isolates, min_cluster_size):
df = pd.read_csv(selection_data,index_col=0)
df = df.sort_values("silhouette_score",ascending=False)
# not sure I fixed the bug underlying this fully or not.
df['n_isolates_str'] = df.n_isolates.str.strip("[]")
df['n_isolates_0'] = df['n_isolates_str'].apply(lambda l: len(l) == 0)
df.loc[df.n_isolates_0,'n_isolates'] = 0
df.loc[~df.n_isolates_0,'n_isolates'] = df.loc[~df.n_isolates_0].n_isolates_str.apply(lambda l: int(l))
best_cluster = df[(df.n_isolates <= max_isolates)&(df.n_clusters >= min_clusters)&(df.min_cluster_size==min_cluster_size)]
best_cluster = best_cluster.iloc[0]
best_lsi_dimensions = best_cluster.lsi_dimensions
print(best_cluster.to_dict())
best_path = Path(best_cluster.outpath) / (str(best_cluster['name']) + ".feather")
shutil.copy(best_path,output)
print(f"lsi dimensions:{best_lsi_dimensions}")
if __name__ == "__main__":
fire.Fire(pick_best_clustering)

View File

@@ -1,38 +1,101 @@
import pandas as pd from sklearn.metrics import silhouette_score
import plotnine as pn from sklearn.cluster import AffinityPropagation
from functools import partial
from clustering import _affinity_clustering, read_similarity_mat
from dataclasses import dataclass
from multiprocessing import Pool, cpu_count, Array, Process
from pathlib import Path from pathlib import Path
from clustering.fit_tsne import fit_tsne from itertools import product, starmap
from visualization.tsne_vis import build_visualization import numpy as np
import pandas as pd
import fire
import sys
df = pd.read_csv("/gscratch/comdata/output/reddit_clustering/subreddit_comment_authors-tf_10k_LSI/hdbscan/selection_data.csv",index_col=0) # silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying.
# plot silhouette_score as a function of isolates @dataclass
df = df.sort_values("silhouette_score") class clustering_result:
outpath:Path
damping:float
max_iter:int
convergence_iter:int
preference_quantile:float
silhouette_score:float
alt_silhouette_score:float
name:str
df['n_isolates'] = df.n_isolates.str.split("\n0").apply(lambda rg: int(rg[1]))
p = pn.ggplot(df,pn.aes(x='n_isolates',y='silhouette_score')) + pn.geom_point()
p.save("isolates_x_score.png")
p = pn.ggplot(df,pn.aes(y='n_clusters',x='n_isolates',color='silhouette_score')) + pn.geom_point() def sim_to_dist(mat):
p.save("clusters_x_isolates.png") dist = 1-mat
dist[dist < 0] = 0
np.fill_diagonal(dist,0)
return dist
# the best result for hdbscan seems like this one: it has a decent number of def do_clustering(damping, convergence_iter, preference_quantile, name, mat, subreddits, max_iter, outdir:Path, random_state, verbose, alt_mat, overwrite=False):
# i think I prefer the 'eom' clustering style because larger clusters are less likely to suffer from ommitted variables if name is None:
best_eom = df[(df.n_isolates <5000)&(df.silhouette_score>0.4)&(df.cluster_selection_method=='eom')&(df.min_cluster_size==2)].iloc[df.shape[1]] name = f"damping-{damping}_convergenceIter-{convergence_iter}_preferenceQuantile-{preference_quantile}"
print(name)
sys.stdout.flush()
outpath = outdir / (str(name) + ".feather")
print(outpath)
clustering = _affinity_clustering(mat, subreddits, outpath, damping, max_iter, convergence_iter, preference_quantile, random_state, verbose)
mat = sim_to_dist(clustering.affinity_matrix_)
best_lsi = df[(df.n_isolates <5000)&(df.silhouette_score>0.4)&(df.cluster_selection_method=='leaf')&(df.min_cluster_size==2)].iloc[df.shape[1]] score = silhouette_score(mat, clustering.labels_, metric='precomputed')
tsne_data = Path("./clustering/authors-tf_lsi850_tsne.feather") if alt_mat is not None:
alt_distances = sim_to_dist(alt_mat)
alt_score = silhouette_score(alt_mat, clustering.labels_, metric='precomputed')
if not tnse_data.exists(): res = clustering_result(outpath=outpath,
fit_tsne("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/850.feather", damping=damping,
tnse_data) max_iter=max_iter,
convergence_iter=convergence_iter,
preference_quantile=preference_quantile,
silhouette_score=score,
alt_silhouette_score=score,
name=str(name))
build_visualization("./clustering/authors-tf_lsi850_tsne.feather", return res
Path(best_eom.outpath)/(best_eom['name']+'.feather'),
"./authors-tf_lsi850_best_eom.html")
build_visualization("./clustering/authors-tf_lsi850_tsne.feather", # alt similiarities is for checking the silhouette coefficient of an alternative measure of similarity (e.g., topic similarities for user clustering).
Path(best_leaf.outpath)/(best_leaf['name']+'.feather'),
"./authors-tf_lsi850_best_leaf.html")
def select_affinity_clustering(similarities, outdir, outinfo, damping=[0.9], max_iter=100000, convergence_iter=[30], preference_quantile=[0.5], random_state=1968, verbose=True, alt_similarities=None, J=None):
damping = list(map(float,damping))
convergence_iter = convergence_iter = list(map(int,convergence_iter))
preference_quantile = list(map(float,preference_quantile))
if type(outdir) is str:
outdir = Path(outdir)
outdir.mkdir(parents=True,exist_ok=True)
subreddits, mat = read_similarity_mat(similarities,use_threads=True)
if alt_similarities is not None:
alt_mat = read_similarity_mat(alt_similarities,use_threads=True)
else:
alt_mat = None
if J is None:
J = cpu_count()
pool = Pool(J)
# get list of tuples: the combinations of hyperparameters
hyper_grid = product(damping, convergence_iter, preference_quantile)
hyper_grid = (t + (str(i),) for i, t in enumerate(hyper_grid))
_do_clustering = partial(do_clustering, mat=mat, subreddits=subreddits, outdir=outdir, max_iter=max_iter, random_state=random_state, verbose=verbose, alt_mat=alt_mat)
# similarities = Array('d', mat)
# call pool.starmap
print("running clustering selection")
clustering_data = pool.starmap(_do_clustering, hyper_grid)
clustering_data = pd.DataFrame(list(clustering_data))
clustering_data.to_csv(outinfo)
return clustering_data
if __name__ == "__main__":
x = fire.Fire(select_affinity_clustering)

View File

@@ -1,4 +0,0 @@
from sklearn import metrics
from sklearn.cluster import AffinityPropagation
from functools import partial
# sillouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying.

View File

@@ -1,28 +0,0 @@
all: ../../data/reddit_comments_by_subreddit.parquet ../../data/reddit_submissions_by_subreddit.parquet
../../data/reddit_comments_by_subreddit.parquet:../../data/temp/reddit_comments.parquet
../start_spark_and_run.sh 4 comments_2_parquet_part2.py
../../data/temp/reddit_comments.parquet: comments_task_list.sh run_comments_jobs.sbatch
mkdir -p comments_jobs
mkdir -p ../../data/temp/
sbatch --wait --array=1-$(shell cat comments_task_list.sh | wc -l) run_comments_jobs.sbatch 0
temp_reddit_comments.parquet: ../../data/temp/reddit_comments.parquet
comments_task_list.sh: comments_2_parquet_part1.py
srun -p compute-bigmem -A comdata --nodes=1 --mem-per-cpu=9g -c 40 --time=120:00:00 bash -c "source ~/.bashrc && python3 comments_2_parquet_part1.py gen_task_list --overwrite=False"
submissions_task_list.sh: submissions_2_parquet_part1.py
srun -p compute-bigmem -A comdata --nodes=1 --mem-per-cpu=9g -c 40 --time=120:00:00 python3 submissions_2_parquet_part1.py gen_task_list
../../data/reddit_submissions_by_subreddit.parquet:../../data/temp/reddit_submissions.parquet
../start_spark_and_run.sh 4 submissions_2_parquet_part2.py
../../data/temp/reddit_submissions.parquet: submissions_task_list.sh run_submissions_jobs.sbatch
mkdir -p submissions_jobs
rm -rf ../../data/temp/reddit_submissions.parquet
mkdir -p ../../data/temp/
sbatch --wait --array=1-$(shell cat submissions_task_list.sh | wc -l) run_submissions_jobs.sbatch 0
temp_reddit_submissions.parquet: ../../data/temp/reddit_submissions.parquet

380
datasets/README.md Normal file
View File

@@ -0,0 +1,380 @@
# Reddit dumps → sorted parquet datasets
This directory holds the pipeline that turns compressed Reddit dump files
(`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the
sorted, repartitioned parquet datasets that the rest of the project
consumes.
## Pipeline overview
The raw dumps are huge compressed json files with a lot of metadata that
we may not need. They aren't indexed so it's expensive to pull data from
just a handful of subreddits. It also turns out that it's a pain to read
these compressed files straight into spark. Extracting useful variables
from the dumps and building parquet datasets makes them easier to work
with. This happens in two steps:
1. Extracting json into (temporary, unpartitioned) parquet files using
pyarrow.
2. Repartitioning and sorting the data using pyspark.
Breaking this down into two steps is useful because it allows us to
decompress and parse the dumps in the backfill queue and then sort them
in spark. Partitioning the data makes it possible to efficiently read
data for specific subreddits or authors. Sorting it means that you can
efficiently compute aggregations at the subreddit or user level. More
documentation on using these files is available on the [CDSC wiki][hyak-datasets].
The final datasets are in `/gscratch/comdata/output`:
- `reddit_comments_by_author.parquet` has comments partitioned and sorted
by username (lowercase).
- `reddit_comments_by_subreddit.parquet` has comments partitioned and
sorted by subreddit name (lowercase).
- `reddit_submissions_by_author.parquet` has submissions partitioned and
sorted by username (lowercase).
- `reddit_submissions_by_subreddit.parquet` has submissions partitioned
and sorted by subreddit name (lowercase).
[hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets
## Scripts
| Script | Role |
|---|---|
| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads a directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Accepts `--indir` and `--mode` to support layered appends; defaults match the build-from-scratch workflow. |
| `comments_merge.py`, `submissions_merge.py` | Merge entry points. Each is a Spark job that collapses all accumulated layers in the final datasets into a single clean layer. Launched via `start_spark_and_run.sh`. |
| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` / `merge_layers` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |
## The three workflows
### Build from scratch — `build_from_scratch.sh`
Use this when there is no existing parquet output, or when the upstream
data has changed in a way that requires reparsing everything. Wipes the
per-source temp directories, processes every `RC_*` / `RS_*` dump in the
raw dumps directory through Part 1 (in parallel via GNU parallel), then
runs the Part 2 Spark sort.
### Add new months — `add_months.sh YYYY-MM [YYYY-MM ...]`
> **NOTE: written but not yet tested. Remove this notice after a
> successful end-to-end run.**
Use this for routine incremental updates. Runs Part 1 on only the
specified months, then appends the sorted output as a new layer of
partition files alongside the existing ones. No existing data is
rewritten.
Each run adds one layer to each final dataset directory. Spark and DuckDB
read all layers together correctly. At a yearly update cadence the number
of layers stays small; use `merge_layers.sh` to collapse them when
needed.
#### Environment setup
The Python environment runs inside a Singularity container. Set `PYTHON`
to the full path of the venv interpreter so that `parallel` jobs use the
right Python (fresh shells spawned by `parallel` don't inherit the active
venv):
```sh
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3
```
The `.zst` decompression uses the `zstandard` Python library rather than
the system `zstd` binary, which is inaccessible from inside the container.
#### Dump directory
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
via environment variables if the files are not in the standard locations:
```sh
COMMENTS_DUMPDIR=/path/to/new/comments \
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
```
#### Running as a Slurm job
The recommended way to run `add_months.sh` is via `srun` on a fat
`cpu-g2` node. Using `srun` (rather than `salloc`) means the node is
released automatically as soon as the script finishes, regardless of the
walltime. Run from a login node inside a `tmux` session so the terminal
survives disconnections:
```sh
tmux new -s add_months
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
bash -l -c "
cd /mmfs1/gscratch/comdata/users/makohill/cdsc_reddit && \
PYTHON=/gscratch/comdata/users/makohill/cdsc_reddit/venv/bin/python3 \
COMMENTS_DUMPDIR=/path/to/new/comments \
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
./datasets/add_months.sh --clean 2025-01 2025-02 ... YYYY-MM
" 2>&1 | tee /gscratch/comdata/users/makohill/add_months_run.log
```
The `bash -l` flag sources `.bashrc` on the compute node so the Spark
environment is available. The `tee` command writes output to both the
terminal and a log file so you can review it later.
Detach from tmux with `Ctrl-b d` and reattach with `tmux attach -t add_months`.
For a multi-node Spark cluster instead, use `add_months_multinode.sh`
from a login node — it takes the number of nodes as its first argument.
### Merge layers — `merge_layers.sh`
> **NOTE: written but not yet tested. Remove this notice after a
> successful end-to-end run.**
Use this to collapse accumulated layers from incremental adds into a
single clean layer. Reads the existing final datasets, re-sorts
everything, writes to `.merging` temp paths, then atomically replaces the
originals via rename.
Run this when query performance has degraded due to many layers, or any
time you want a clean single-file-per-partition layout. The existing
datasets are safe until the rename step completes; see `merge_layers.sh`
for recovery notes if interrupted. As with `add_months.sh`, Part 2 can
run on a single fat node or via `start_spark_and_run.sh`.
## Running steps individually
Both `.sh` runners are written so that every meaningful step is a
separate, self-contained command. If something fails partway through, or
you want to inspect intermediate state, you can copy any single line out
of the runner and execute it standalone. For example:
```sh
# parse one specific file (skipping the rest of the workflow)
python3 comments_part1.py parse_dump RC_2025-03.zst
# override default dump/output paths from the CLI
python3 comments_part1.py parse_dump RC_2025-03.zst \
--dumpdir=/tmp/test --outdir=/tmp/out
# regenerate just the task list
python3 submissions_part1.py gen_task_list
```
The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
Hyak-provided wrapper not included in this repo); see the wiki for the
launch convention.
## Detailed walkthrough: refreshing the data on Hyak
This walkthrough describes the process we went through updating Reddit
data from the PushShift cutoff up to the end of 2024. Adapting it for
newer data should just involve using different academic torrent files
that start from 2025 onwards. For a single-month update, the
`add_new_month.sh` workflow above is much shorter; this walkthrough is
for the bulk-refresh case.
### Prerequisites
- [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config
and `.bashrc`)
- [Go through the Hyak Getting Started tutorial][hyak-syllabus]
Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`):
- [Watchful1's reddit explanation][watchful1-explainer] (separated by
subreddit), the [dataset not divided by subreddits][watchful1-bulk],
and the [GitHub repo with scripts for analyzing data][watchful1-repo]
- [RaiderBDev monthly dumps][raiderbdev-monthly] and
[RaiderBDev's ArcticShift API][arctic-shift]
- The [2005-06 to 2024-12 academic torrent][academic-torrent] used for
the 2005-2024 refresh
CDSC and Hyak docs:
- [Hyak docs — how to work with modules][hyak-modules]
- [CDSC — how to download Python or R packages][cdsc-pkgs]
- [CDSC — Hyak datasets information][hyak-datasets]
- [CDSC — Hyak Spark information][hyak-spark]
[hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak
[hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/
[watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/
[watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/
[watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master
[raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/
[arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift
[academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
[hyak-modules]: https://hyak.uw.edu/docs/tools/modules
[cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages
[hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark
### Step 1: data download on Nada and Hyak
We downloaded the [2005-2024 academic torrent][academic-torrent] and put
it on Nada (~2 days of downloading). We copied the raw data over to
Hyak's scrubbed directory in a new directory,
`/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw
data sorted into `/comments` or `/submissions`. The `/submissions`
directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst`
files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal
with.)
### Step 2: clone the repo on Hyak
On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the
working directory next to the raw data, e.g.
`/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant
code lives entirely in `datasets/`:
- `dumps_helper.py` — shared parsing and Spark logic
- `helper.py` — file-open helpers
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
The Spark wrapper scripts (`start_spark_and_run.sh`,
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
they are part of the CDSC Hyak environment and should already be on
PATH.
### Step 3: smoke-test Part 1 on a single file
Check out `any_machine`. We'll test submissions Part 1 with just one
file:
```sh
python3 submissions_part1.py parse_dump RS_2005-06.zst
```
To verify, go to your output directory and examine the start of the
file:
```sh
python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())"
```
You should see columns like `id`, `author`, `subreddit`, and `title`
printed out. Repeat the process with `comments_part1.py`; you should see
columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out.
**Note**: you may have to install relevant libraries before successfully
running the file:
```sh
pip install --user pyarrow simdjson zstandard fire
```
### Step 4: Part 1 — converting `.zst` to `.parquet` files
Now we'll convert all of our `.zst` compressed Reddit data to `.parquet`
files. First, to generate our task list, we'll run
```sh
python3 submissions_part1.py gen_task_list
```
There should be a script, `parse_submissions_task_list`, in the working
directory. Check the script (`less parse_submissions_task_list`); it
should have many lines that look like our earlier test command,
`python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of
our `.zst` files. Do the same process with comments to generate
`parse_comments_task_list`.
From a login node, run `tmux` to keep our job running and then
`any_machine` to check out a node to do computational work. We'll run
our tasks (from the task list) in parallel to optimize. Start with
submissions:
```sh
parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list
```
The `--joblog` flag creates a text file where you can see which tasks
completed successfully, and the `--results` flag creates a directory
where each task has its own stderr output to see the specific error
(this is best practice for debugging).
Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`).
We'll ssh into our computational node (`ssh n1234` — you can get the
node name by running `ourjobs`) and run `htop`
([more details on htop][htop-explainer]). You should see that the
machine's CPUs are getting close to 100% usage. If all looks good,
create a new window and repeat the process for comments.
[htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/
Once the job has successfully completed, you'll see that your CPUs are
closer to 0% usage in `htop` and your `submissions_joblog.txt` file
should show an `exitval` of 0 for all commands. Kill your node by
running `scancel 12345678` (the job ID can be found from `ourjobs`).
### Step 5: verify the per-source parquet files
We'll want to verify our `.parquet` files at this point. We compared the
new files' number of columns and rows to the old data: from the
`/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet`
directory, run
```sh
diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet)
```
and confirm there are no differences (same process with submissions).
This may or may not be relevant if we continue using the same academic
torrent to update data and have nothing to compare to, but you can still
check that the new data's number of columns and rows are fairly
continuous with the most recent data we already have.
### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
If the `.parquet` files reasonably appear to be complete, we can now
sort them by author and subreddit. The most efficient way to do so is via
`srun` on a `cpu-g2` node (128 CPUs, ~1 TB RAM). Using `srun` releases
the node automatically when the job finishes. Run from a login node
inside `tmux`:
```sh
srun -p cpu-g2 -A comdata --nodes=1 --time=72:00:00 -c 112 --mem=400G \
bash -l -c "
cd /path/to/cdsc_reddit/datasets && \
source \$SPARK_CONF_DIR/spark-env.sh && \
start_spark_cluster.sh && \
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT submissions_part2.py && \
spark-submit --master spark://\$(hostname):\$SPARK_MASTER_PORT comments_part2.py && \
stop-all.sh
"
```
[hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
Monitor via `htop` (as described in Step 4); the CPUs may not always
show high usage but you should see that memory is being used. Repeat
for the comments. Successful jobs will result in
`/gscratch/comdata/output` having four new directories:
`reddit_submissions_by_author.parquet`,
`reddit_submissions_by_subreddit.parquet`,
`reddit_comments_by_author.parquet`, and
`reddit_comments_by_subreddit.parquet`. Each should contain many
`snappy.parquet` files (e.g.
`part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`)
and `_SUCCESS`.
### Step 7: data verification
Verify and make sure the new data is reasonably complete before deleting
any of the old data. Do a simple time series to see how many posts there
are per day and make sure things don't fall off. It is also useful to
have lab members test out anything they're working on again with the
new parquet files.
## See also
The CDSC wiki page
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
is the landing page for this project on the wiki and provides
cross-links to related CDSC and Hyak documentation. The walkthrough
above used to live there; it now lives here so that doc and code stay
in sync.

162
datasets/add_months.sh Executable file
View File

@@ -0,0 +1,162 @@
#!/usr/bin/env bash
#
# Add one or more new months to the existing parquet datasets using a
# layered append. Designed to run on a single fat node (e.g. cpu-g2 with
# 128 cores / ~1TB RAM). For a multi-node Spark cluster instead, see
# add_months_multinode.sh.
#
# Usage:
# add_months.sh [--clean] YYYY-MM [YYYY-MM ...]
#
# Example:
# add_months.sh 2025-01 2025-02 2025-03
#
# If temp or staging directories from a previous run exist, the script
# will exit with an error. Pass --clean to wipe them before starting:
#
# The new .zst dump files must live at:
# $COMMENTS_DUMPDIR/RC_YYYY-MM.zst
# $SUBMISSIONS_DUMPDIR/RS_YYYY-MM.zst
#
# Override the dump directories via environment variables if the new files
# are not in the standard locations:
#
# COMMENTS_DUMPDIR=/path/to/new/comments \
# SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
# ./add_months.sh 2025-01 2025-02
#
# Workflow:
# Part 1 — parse new .zst files into per-month parquets (parallel)
# Part 2 — sort into staging directories, not the live datasets (Spark)
# [script exits here — verify staging before continuing]
# Copy — move staging files into live datasets (run manually after verify)
# Cleanup — remove temp and staging dirs (run manually after copy)
#
# NOTE: This script and its workflow are written but not yet tested.
# Remove this notice after a successful end-to-end run.
#
# Every command below is independently runnable for debugging.
set -e
cd "$(dirname "$0")"
CLEAN=0
if [ "${1:-}" = "--clean" ]; then
CLEAN=1
shift
fi
if [ $# -eq 0 ]; then
echo "Usage: $0 [--clean] YYYY-MM [YYYY-MM ...]" >&2
exit 1
fi
COMMENTS_DUMPDIR="${COMMENTS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/comments}"
SUBMISSIONS_DUMPDIR="${SUBMISSIONS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/submissions}"
PYTHON="${PYTHON:-python3}"
# Part 1 temp dirs (per-month parquets, parsed from .zst)
TEMP_COMMENTS="/gscratch/comdata/output/temp/add_months_comments.parquet"
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/add_months_submissions.parquet"
# Staging dirs (sorted new layer; inspected before copying to live)
STAGING_COMMENTS_SUB="/gscratch/comdata/output/temp/new_layer_comments_by_subreddit.parquet"
STAGING_COMMENTS_AUTH="/gscratch/comdata/output/temp/new_layer_comments_by_author.parquet"
STAGING_SUBMISSIONS_SUB="/gscratch/comdata/output/temp/new_layer_submissions_by_subreddit.parquet"
STAGING_SUBMISSIONS_AUTH="/gscratch/comdata/output/temp/new_layer_submissions_by_author.parquet"
# Live dataset dirs
LIVE_COMMENTS_SUB="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet"
LIVE_COMMENTS_AUTH="/gscratch/comdata/output/reddit_comments_by_author.parquet"
LIVE_SUBMISSIONS_SUB="/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet"
LIVE_SUBMISSIONS_AUTH="/gscratch/comdata/output/reddit_submissions_by_author.parquet"
# --- Check for leftover output from a previous run --------------------------
EXISTING=()
for d in "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS" \
"$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH" \
"$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"; do
[ -e "$d" ] && EXISTING+=("$d")
done
if [ ${#EXISTING[@]} -gt 0 ]; then
if [ $CLEAN -eq 1 ]; then
echo "Removing leftover files from previous run..."
rm -rf "${EXISTING[@]}"
rm -f add_months_tasks.txt add_months_joblog.txt
rm -rf add_months_logs/
else
echo "Error: leftover files from a previous run exist:" >&2
printf ' %s\n' "${EXISTING[@]}" >&2
echo "Re-run with --clean to remove them before starting." >&2
exit 1
fi
fi
# --- Part 1: parse new months in parallel (comments and submissions together) -
printf "$PYTHON comments_part1.py parse_dump RC_%s.zst --dumpdir=\"$COMMENTS_DUMPDIR\" --outdir=\"$TEMP_COMMENTS\"\n" "$@" \
> add_months_tasks.txt
printf "$PYTHON submissions_part1.py parse_dump RS_%s.zst --dumpdir=\"$SUBMISSIONS_DUMPDIR\" --outdir=\"$TEMP_SUBMISSIONS\"\n" "$@" \
>> add_months_tasks.txt
parallel --joblog add_months_joblog.txt --results add_months_logs \
< add_months_tasks.txt
# --- Part 2: sort new months into staging (Spark, single fat node) ----------
source "$SPARK_CONF_DIR/spark-env.sh"
start_spark_cluster.sh
spark-submit --master "spark://$(hostname):$SPARK_MASTER_PORT" \
comments_part2.py \
--indir="$TEMP_COMMENTS" \
--out_by_subreddit="$STAGING_COMMENTS_SUB" \
--out_by_author="$STAGING_COMMENTS_AUTH"
spark-submit --master "spark://$(hostname):$SPARK_MASTER_PORT" \
submissions_part2.py \
--indir="$TEMP_SUBMISSIONS" \
--out_by_subreddit="$STAGING_SUBMISSIONS_SUB" \
--out_by_author="$STAGING_SUBMISSIONS_AUTH"
stop-all.sh
# --- Verify: inspect staging before copying to live -------------------------
#
# The script stops here. Check the staging output looks right before running
# the copy step manually. The live datasets are untouched at this point.
# Example checks:
#
# ls -lah "$STAGING_COMMENTS_SUB" | head
# python3 -c "
# import pyarrow.parquet as pq, os
# f = sorted(os.listdir('$STAGING_COMMENTS_SUB'))[0]
# t = pq.read_table('$STAGING_COMMENTS_SUB/' + f, columns=['created_utc'])
# print(t.column('created_utc')[0].as_py(), t.column('created_utc')[-1].as_py())
# "
exit 0
# --- Copy: add staging files into live datasets -----------------------------
#
# Run these lines manually after verifying staging. This is the only step
# that touches the live datasets. It only adds new files — existing files
# are never deleted or overwritten.
find "$STAGING_COMMENTS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_SUB"/ \;
find "$STAGING_COMMENTS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_AUTH"/ \;
find "$STAGING_SUBMISSIONS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_SUB"/ \;
find "$STAGING_SUBMISSIONS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_AUTH"/ \;
# --- Cleanup: remove temp and staging dirs ----------------------------------
#
# Run after confirming the copy succeeded and the live datasets look right.
rm -f add_months_tasks.txt add_months_joblog.txt
rm -rf add_months_logs/
rm -rf "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS"
rm -rf "$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH"
rm -rf "$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"

View File

@@ -0,0 +1,89 @@
#!/usr/bin/env bash
#
# Multi-node variant of add_months.sh. Uses start_spark_and_run.sh to
# allocate a Spark cluster across multiple nodes via salloc. Run this
# from a login node.
#
# For the common single-fat-node case, use add_months.sh instead.
#
# Usage:
# add_months_multinode.sh NODES YYYY-MM [YYYY-MM ...]
#
# Example (2 nodes, 3 months):
# add_months_multinode.sh 2 2025-01 2025-02 2025-03
#
# Override dump directories via environment variables if needed:
#
# COMMENTS_DUMPDIR=/path/to/new/comments \
# SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
# ./add_months_multinode.sh 2 2025-01 2025-02
#
# NOTE: This script and its workflow are written but not yet tested.
# Remove this notice after a successful end-to-end run.
set -e
cd "$(dirname "$0")"
NODES="${1:-}"
if [ -z "$NODES" ] || [ $# -lt 2 ]; then
echo "Usage: $0 NODES YYYY-MM [YYYY-MM ...]" >&2
exit 1
fi
shift
MONTHS=("$@")
COMMENTS_DUMPDIR="${COMMENTS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/comments}"
SUBMISSIONS_DUMPDIR="${SUBMISSIONS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/submissions}"
PYTHON="${PYTHON:-python3}"
TEMP_COMMENTS="/gscratch/comdata/output/temp/add_months_comments.parquet"
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/add_months_submissions.parquet"
STAGING_COMMENTS_SUB="/gscratch/comdata/output/temp/new_layer_comments_by_subreddit.parquet"
STAGING_COMMENTS_AUTH="/gscratch/comdata/output/temp/new_layer_comments_by_author.parquet"
STAGING_SUBMISSIONS_SUB="/gscratch/comdata/output/temp/new_layer_submissions_by_subreddit.parquet"
STAGING_SUBMISSIONS_AUTH="/gscratch/comdata/output/temp/new_layer_submissions_by_author.parquet"
LIVE_COMMENTS_SUB="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet"
LIVE_COMMENTS_AUTH="/gscratch/comdata/output/reddit_comments_by_author.parquet"
LIVE_SUBMISSIONS_SUB="/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet"
LIVE_SUBMISSIONS_AUTH="/gscratch/comdata/output/reddit_submissions_by_author.parquet"
# --- Part 1: parse new months in parallel -----------------------------------
printf "$PYTHON comments_part1.py parse_dump RC_%s.zst --dumpdir=\"$COMMENTS_DUMPDIR\" --outdir=\"$TEMP_COMMENTS\"\n" "${MONTHS[@]}" \
> add_months_comments_tasks.txt
printf "$PYTHON submissions_part1.py parse_dump RS_%s.zst --dumpdir=\"$SUBMISSIONS_DUMPDIR\" --outdir=\"$TEMP_SUBMISSIONS\"\n" "${MONTHS[@]}" \
> add_months_submissions_tasks.txt
parallel --joblog add_months_comments_joblog.txt --results add_months_comments_logs \
< add_months_comments_tasks.txt
parallel --joblog add_months_submissions_joblog.txt --results add_months_submissions_logs \
< add_months_submissions_tasks.txt
# --- Part 2: sort new months into staging (multi-node Spark cluster) --------
start_spark_and_run.sh "$NODES" comments_part2.py \
--indir="$TEMP_COMMENTS" \
--out_by_subreddit="$STAGING_COMMENTS_SUB" \
--out_by_author="$STAGING_COMMENTS_AUTH"
start_spark_and_run.sh "$NODES" submissions_part2.py \
--indir="$TEMP_SUBMISSIONS" \
--out_by_subreddit="$STAGING_SUBMISSIONS_SUB" \
--out_by_author="$STAGING_SUBMISSIONS_AUTH"
# --- Verify staging, then copy and cleanup manually -------------------------
#
# See add_months.sh for verify/copy/cleanup commands — they are identical.
exit 0
find "$STAGING_COMMENTS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_SUB"/ \;
find "$STAGING_COMMENTS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_AUTH"/ \;
find "$STAGING_SUBMISSIONS_SUB" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_SUB"/ \;
find "$STAGING_SUBMISSIONS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_AUTH"/ \;
rm -rf "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS"
rm -rf "$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH"
rm -rf "$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"

56
datasets/build_from_scratch.sh Executable file
View File

@@ -0,0 +1,56 @@
#!/usr/bin/env bash
#
# Build the sorted, partitioned Reddit parquet datasets from scratch.
#
# Wipes the per-source temp directories, processes every RC_* and RS_* dump
# in the raw_data dumps directory through Part 1 (per-file, parallel), then
# runs the Part 2 Spark sort + repartition for both comments and submissions.
#
# Every command below is independently runnable — to debug a single stage,
# copy the line out and run it directly. Run the whole script end-to-end
# only when you trust each step.
#
# Prerequisites:
# - raw .zst dumps already staged in the dumpdir locations (see the
# defaults in dumps_helper.py, or override via --dumpdir)
# - GNU parallel installed
# - start_spark_and_run.sh on PATH (Hyak-provided wrapper)
#
# To add new months to an existing build without rebuilding from scratch,
# use add_months.sh.
set -e
cd "$(dirname "$0")"
TEMP_COMMENTS="/gscratch/comdata/output/temp/reddit_comments.parquet"
TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/reddit_submissions.parquet"
# --- Part 1a: comments ------------------------------------------------------
# wipe any existing comments temp output
rm -rf "$TEMP_COMMENTS"
# generate the per-file parse task list
python3 comments_part1.py gen_task_list
# run all comments parse tasks in parallel
parallel --joblog comments_joblog.txt --results comments_logs < parse_comments_task_list
# --- Part 1b: submissions ---------------------------------------------------
# wipe any existing submissions temp output
rm -rf "$TEMP_SUBMISSIONS"
# generate the per-file parse task list
python3 submissions_part1.py gen_task_list
# run all submissions parse tasks in parallel
parallel --joblog submissions_joblog.txt --results submissions_logs < parse_submissions_task_list
# --- Part 2: spark sort + repartition --------------------------------------
# sort comments and write reddit_comments_by_{subreddit,author}.parquet
start_spark_and_run.sh 1 comments_part2.py
# sort submissions and write reddit_submissions_by_{subreddit,author}.parquet
start_spark_and_run.sh 1 submissions_part2.py

View File

@@ -0,0 +1,26 @@
#!/bin/bash
## parallel_sql_job.sh
#SBATCH --job-name=tf_subreddit_comments
## Allocation Definition
#SBATCH --account=comdata-ckpt
#SBATCH --partition=ckpt
## Resources
## Nodes. This should always be 1 for parallel-sql.
#SBATCH --nodes=1
## Walltime (12 hours)
#SBATCH --time=12:00:00
## Memory per node
#SBATCH --mem=32G
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1
#SBATCH -D /gscratch/comdata/users/nathante/cdsc-reddit
source ./bin/activate
module load parallel_sql
echo $(which perl)
conda list pyarrow
which python3
#Put here commands to load other modules (e.g. matlab etc.)
#Below command means that parallel_sql will get tasks from the database
#and run them on the node (in parallel). So a 16 core node will have
#16 tasks running at one time.
parallel-sql --sql -a parallel --exit-on-term --jobs 4

View File

@@ -1,10 +0,0 @@
#!/usr/bin/env bash
## needs to be run by hand since i don't have a nice way of waiting on a parallel-sql job to complete
echo "#!/usr/bin/bash" > job_script.sh
#echo "source $(pwd)/../bin/activate" >> job_script.sh
echo "python3 $(pwd)/comments_2_parquet_part1.py" >> job_script.sh
srun -p compute-bigmem -A comdata --nodes=1 --mem-per-cpu=9g -c 40 --time=120:00:00 --pty job_script.sh
start_spark_and_run.sh 1 $(pwd)/comments_2_parquet_part2.py

View File

@@ -1,111 +0,0 @@
#!/usr/bin/env python3
import os
import json
from datetime import datetime
from multiprocessing import Pool
from itertools import islice
from helper import open_input_file, find_dumps
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pathlib import Path
import fire
def parse_comment(comment, names= None):
if names is None:
names = ["id","subreddit","link_id","parent_id","created_utc","author","ups","downs","score","edited","subreddit_type","subreddit_id","stickied","is_submitter","body","error"]
try:
comment = json.loads(comment)
except json.decoder.JSONDecodeError as e:
print(e)
print(comment)
row = [None for _ in names]
row[-1] = "json.decoder.JSONDecodeError|{0}|{1}".format(e,comment)
return tuple(row)
row = []
for name in names:
if name == 'created_utc':
row.append(datetime.fromtimestamp(int(comment['created_utc']),tz=None))
elif name == 'edited':
val = comment[name]
if type(val) == bool:
row.append(val)
row.append(None)
else:
row.append(True)
row.append(datetime.fromtimestamp(int(val),tz=None))
elif name == "time_edited":
continue
elif name not in comment:
row.append(None)
else:
row.append(comment[name])
return tuple(row)
# conf = sc._conf.setAll([('spark.executor.memory', '20g'), ('spark.app.name', 'extract_reddit_timeline'), ('spark.executor.cores', '26'), ('spark.cores.max', '26'), ('spark.driver.memory','84g'),('spark.driver.maxResultSize','0'),('spark.local.dir','../../data/spark_tmp')])
def parse_dump(partition):
dumpdir = f"../../data/reddit_dumps/comments/{partition}"
stream = open_input_file(dumpdir)
rows = map(parse_comment, stream)
schema = pa.schema([
pa.field('id', pa.string(), nullable=True),
pa.field('subreddit', pa.string(), nullable=True),
pa.field('link_id', pa.string(), nullable=True),
pa.field('parent_id', pa.string(), nullable=True),
pa.field('created_utc', pa.timestamp('ms'), nullable=True),
pa.field('author', pa.string(), nullable=True),
pa.field('ups', pa.int64(), nullable=True),
pa.field('downs', pa.int64(), nullable=True),
pa.field('score', pa.int64(), nullable=True),
pa.field('edited', pa.bool_(), nullable=True),
pa.field('time_edited', pa.timestamp('ms'), nullable=True),
pa.field('subreddit_type', pa.string(), nullable=True),
pa.field('subreddit_id', pa.string(), nullable=True),
pa.field('stickied', pa.bool_(), nullable=True),
pa.field('is_submitter', pa.bool_(), nullable=True),
pa.field('body', pa.string(), nullable=True),
pa.field('error', pa.string(), nullable=True),
])
p = Path("../../data/temp/reddit_comments.parquet")
p.mkdir(exist_ok=True,parents=True)
N=10000
with pq.ParquetWriter(f"../../data/temp/reddit_comments.parquet/{partition}.parquet",
schema=schema,
compression='snappy',
flavor='spark') as writer:
while True:
chunk = islice(rows,N)
pddf = pd.DataFrame(chunk, columns=schema.names)
table = pa.Table.from_pandas(pddf,schema=schema)
if table.shape[0] == 0:
break
writer.write_table(table)
writer.close()
def gen_task_list(dumpdir="../../data/raw_data/reddit_dumps/comments", overwrite=True):
files = list(find_dumps(dumpdir,base_pattern="RC_20*.*"))
with open("comments_task_list.sh",'w') as of:
for fpath in files:
partition = os.path.split(fpath)[1]
if (not Path(f"../../data/temp/reddit_comments.parquet/{partition}.parquet").exists()) or (overwrite is True):
of.write(f'python3 comments_2_parquet_part1.py parse_dump {partition}\n')
if __name__ == '__main__':
fire.Fire({'parse_dump':parse_dump,
'gen_task_list':gen_task_list})

View File

@@ -1,37 +0,0 @@
#!/usr/bin/env python3
# spark script to make sorted, and partitioned parquet files
import pyspark
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
conf = pyspark.SparkConf().setAppName("Reddit submissions to parquet")
conf = conf.set("spark.sql.shuffle.partitions",2400)
conf = conf.set('spark.sql.crossJoin.enabled',"true")
conf = conf.set('spark.debug.maxToStringFields',200)
sc = spark.sparkContext
df = spark.read.parquet("/gscratch/comdata/output/temp/reddit_comments.parquet",compression='snappy')
df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
df = df.drop('subreddit')
df = df.withColumnRenamed('subreddit_2','subreddit')
df = df.withColumnRenamed("created_utc","CreatedAt")
df = df.withColumn("Month",f.month(f.col("CreatedAt")))
df = df.withColumn("Year",f.year(f.col("CreatedAt")))
df = df.withColumn("Day",f.dayofmonth(f.col("CreatedAt")))
# df = df.repartition(1200,'subreddit')
# df2 = df.sort(["subreddit","CreatedAt","link_id","parent_id","Year","Month","Day"],ascending=True)
# df2 = df2.sortWithinPartitions(["subreddit","CreatedAt","link_id","parent_id","Year","Month","Day"],ascending=True)
# df2.write.parquet("/gscratch/scrubbed/comdata/reddit_comments_by_subreddit.parquet", mode='overwrite', compression='snappy')
#df = spark.read.parquet("/gscratch/scrubbed/comdata/reddit_comments_by_subreddit.parquet")
df = df.repartition(2400,'author','subreddit',"Year","Month","Day")
df3 = df.sort(["author","subreddit","Year","Month","Day","CreatedAt","link_id","parent_id"],ascending=True)
df3 = df3.sortWithinPartitions(["author","subreddit","Year","Month","Day","CreatedAt","link_id","parent_id"],ascending=True)
df3.write.parquet("/gscratch/scrubbed/comdata/reddit_comments_by_author.parquet", mode='overwrite',compression='snappy')

View File

@@ -0,0 +1,14 @@
#!/usr/bin/env python3
"""Collapse all layers in the comments final datasets into a single clean layer.
Must be launched from a login node via the Hyak-provided wrapper:
start_spark_and_run.sh 1 comments_merge.py
See merge_layers.sh and dumps_helper.merge_layers for details.
"""
from dumps_helper import COMMENTS, merge_layers
if __name__ == "__main__":
merge_layers(COMMENTS)

24
datasets/comments_part1.py Executable file
View File

@@ -0,0 +1,24 @@
#!/usr/bin/env python3
"""Part 1 for comments: parse one RC_*.zst dump into a parquet file.
CLI:
comments_part1.py parse_dump RC_2018-08.zst
comments_part1.py gen_task_list
comments_part1.py parse_dump RC_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
"""
import fire
from dumps_helper import COMMENTS, parse_dump, gen_task_list
def _parse_dump(partition, dumpdir=None, outdir=None):
parse_dump(COMMENTS, partition, dumpdir=dumpdir, outdir=outdir)
def _gen_task_list(dumpdir=None, tasklist=None):
gen_task_list(COMMENTS, 'comments_part1.py', dumpdir=dumpdir, tasklist=tasklist)
if __name__ == "__main__":
fire.Fire({'parse_dump': _parse_dump,
'gen_task_list': _gen_task_list})

21
datasets/comments_part2.py Executable file
View File

@@ -0,0 +1,21 @@
#!/usr/bin/env python3
"""Part 2 for comments: Spark sort + repartition into the final datasets.
Must be launched from a login node via the Hyak-provided wrapper:
start_spark_and_run.sh 1 comments_part2.py
start_spark_and_run.sh 1 comments_part2.py --indir=/path/to/parquets --mode=append
--indir defaults to the temp comments dir in dumps_helper.py.
--out_by_subreddit and --out_by_author default to the live dataset paths;
override them to write to staging directories first (see add_months.sh).
"""
import fire
from dumps_helper import COMMENTS, sort_and_write
if __name__ == "__main__":
fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
sort_and_write(COMMENTS, indir=indir,
out_by_subreddit=out_by_subreddit,
out_by_author=out_by_author))

343
datasets/dumps_helper.py Normal file
View File

@@ -0,0 +1,343 @@
"""Shared logic for the comments and submissions dump-to-parquet pipeline.
Used by comments_part1.py / submissions_part1.py (Part 1: one compressed
dump file → one parquet file) and comments_part2.py / submissions_part2.py
(Part 2: Spark sort + repartition of the per-source parquets).
The two dump types only differ in their schemas and a handful of
field-specific extractors. The parse loop, the file I/O wrapping, the
task-list generator, and the Spark sort are all shared here.
"""
import os
import shutil
from datetime import datetime
from itertools import islice
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import simdjson
from helper import find_dumps, open_fileset
_json = simdjson.Parser()
# --- field-level extractors ------------------------------------------------
def _ts(name):
"""Extractor for a unix-timestamp field (or None if missing)."""
def handler(record):
val = record.get(name)
if val is None:
return None
return datetime.fromtimestamp(int(val), tz=None)
return handler
def _edited(record):
"""Returns (edited, time_edited). The dump packs both into one `edited`
field that is either a bool (never edited / unknown timestamp) or a
unix timestamp."""
val = record.get('edited')
if isinstance(val, bool):
return (val, None)
if val is None:
return (None, None)
return (True, datetime.fromtimestamp(int(val), tz=None))
def _has_media(record):
"""Submissions don't have a `has_media` field directly — derive it."""
return record.get('media') is not None
# --- generic parse loop ----------------------------------------------------
def parse_record(line, fields, handlers):
"""Parse one JSON line into a tuple aligned with `fields`.
`handlers` maps field name → callable(record) returning either a single
value (one column) or a tuple of values (multiple consecutive columns,
consuming the next len(tuple)-1 entries in `fields`).
Fields without a handler are pulled from the record by name, with
missing keys yielding None.
The last field in `fields` is reserved for an error message string
and is set to None on success.
"""
try:
record = _json.parse(line)
except (ValueError, KeyError) as e:
row = [None] * len(fields)
row[-1] = f"parse error|{e}|{line}"
return tuple(row)
row = []
skip_next = 0
for name in fields:
if skip_next > 0:
skip_next -= 1
continue
handler = handlers.get(name)
if handler is None:
try:
row.append(record[name])
except KeyError:
row.append(None)
else:
result = handler(record)
if isinstance(result, tuple):
row.extend(result)
skip_next = len(result) - 1
else:
row.append(result)
return tuple(row)
# --- comments schema -------------------------------------------------------
COMMENT_FIELDS = [
'id', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'author',
'ups', 'downs', 'score', 'edited', 'time_edited', 'subreddit_type',
'subreddit_id', 'stickied', 'is_submitter', 'body', 'error',
]
COMMENT_SCHEMA = pa.schema([
pa.field('id', pa.string(), nullable=True),
pa.field('subreddit', pa.string(), nullable=True),
pa.field('link_id', pa.string(), nullable=True),
pa.field('parent_id', pa.string(), nullable=True),
pa.field('created_utc', pa.timestamp('ms'), nullable=True),
pa.field('author', pa.string(), nullable=True),
pa.field('ups', pa.int64(), nullable=True),
pa.field('downs', pa.int64(), nullable=True),
pa.field('score', pa.int64(), nullable=True),
pa.field('edited', pa.bool_(), nullable=True),
pa.field('time_edited', pa.timestamp('ms'), nullable=True),
pa.field('subreddit_type', pa.string(), nullable=True),
pa.field('subreddit_id', pa.string(), nullable=True),
pa.field('stickied', pa.bool_(), nullable=True),
pa.field('is_submitter', pa.bool_(), nullable=True),
pa.field('body', pa.string(), nullable=True),
pa.field('error', pa.string(), nullable=True),
])
COMMENT_HANDLERS = {
'created_utc': _ts('created_utc'),
'edited': _edited,
}
# --- submissions schema ----------------------------------------------------
SUBMISSION_FIELDS = [
'id', 'author', 'subreddit', 'title', 'created_utc', 'permalink', 'url',
'domain', 'score', 'ups', 'downs', 'over_18', 'has_media', 'selftext',
'retrieved_on', 'num_comments', 'gilded', 'edited', 'time_edited',
'subreddit_type', 'subreddit_id', 'subreddit_subscribers', 'name',
'is_self', 'stickied', 'quarantine', 'error',
]
SUBMISSION_SCHEMA = pa.schema([
pa.field('id', pa.string(), nullable=True),
pa.field('author', pa.string(), nullable=True),
pa.field('subreddit', pa.string(), nullable=True),
pa.field('title', pa.string(), nullable=True),
pa.field('created_utc', pa.timestamp('ms'), nullable=True),
pa.field('permalink', pa.string(), nullable=True),
pa.field('url', pa.string(), nullable=True),
pa.field('domain', pa.string(), nullable=True),
pa.field('score', pa.int64(), nullable=True),
pa.field('ups', pa.int64(), nullable=True),
pa.field('downs', pa.int64(), nullable=True),
pa.field('over_18', pa.bool_(), nullable=True),
pa.field('has_media', pa.bool_(), nullable=True),
pa.field('selftext', pa.string(), nullable=True),
pa.field('retrieved_on', pa.timestamp('ms'), nullable=True),
pa.field('num_comments', pa.int64(), nullable=True),
pa.field('gilded', pa.int64(), nullable=True),
pa.field('edited', pa.bool_(), nullable=True),
pa.field('time_edited', pa.timestamp('ms'), nullable=True),
pa.field('subreddit_type', pa.string(), nullable=True),
pa.field('subreddit_id', pa.string(), nullable=True),
pa.field('subreddit_subscribers', pa.int64(), nullable=True),
pa.field('name', pa.string(), nullable=True),
pa.field('is_self', pa.bool_(), nullable=True),
pa.field('stickied', pa.bool_(), nullable=True),
pa.field('quarantine', pa.bool_(), nullable=True),
pa.field('error', pa.string(), nullable=True),
])
SUBMISSION_HANDLERS = {
'created_utc': _ts('created_utc'),
'retrieved_on': _ts('retrieved_on'),
'edited': _edited,
'has_media': _has_media,
}
# --- per-type configuration ------------------------------------------------
# Defaults that the entry-point scripts pass through, exposed here so the
# field/schema/handler triplet, the canonical paths, and the dump filename
# pattern all live in one place.
COMMENTS = {
'fields': COMMENT_FIELDS,
'schema': COMMENT_SCHEMA,
'handlers': COMMENT_HANDLERS,
'dumpdir': "/gscratch/comdata/raw_data/reddit_dumps/comments",
'outdir': "/gscratch/comdata/output/temp/reddit_comments.parquet",
'file_pattern': 'RC_20*.*',
'task_list': 'parse_comments_task_list',
'output_by_subreddit': "/gscratch/comdata/output/reddit_comments_by_subreddit.parquet",
'output_by_author': "/gscratch/comdata/output/reddit_comments_by_author.parquet",
'subreddit_sort_keys': ["subreddit", "CreatedAt", "link_id", "parent_id", "Year", "Month", "Day"],
'author_sort_keys': ["author", "CreatedAt", "subreddit", "link_id", "parent_id", "Year", "Month", "Day"],
'app_name': "Reddit comments to parquet",
}
SUBMISSIONS = {
'fields': SUBMISSION_FIELDS,
'schema': SUBMISSION_SCHEMA,
'handlers': SUBMISSION_HANDLERS,
'dumpdir': "/gscratch/comdata/raw_data/reddit_dumps/submissions",
'outdir': "/gscratch/comdata/output/temp/reddit_submissions.parquet",
'file_pattern': 'RS_20*.*',
'task_list': 'parse_submissions_task_list',
'output_by_subreddit': "/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet",
'output_by_author': "/gscratch/comdata/output/reddit_submissions_by_author.parquet",
'subreddit_sort_keys': ["subreddit", "CreatedAt", "id"],
'author_sort_keys': ["author", "CreatedAt", "id"],
'app_name': "Reddit submissions to parquet",
}
# --- Part 1: parse one dump file -> one parquet ----------------------------
def parse_dump(config, partition, dumpdir=None, outdir=None, chunk_size=10000):
"""Read one compressed dump from `dumpdir/partition` and write a parquet
file to `outdir/<basename>.parquet`. Streams chunks of `chunk_size`
rows so memory stays bounded."""
dumpdir = dumpdir or config['dumpdir']
outdir = outdir or config['outdir']
schema = config['schema']
fields = config['fields']
handlers = config['handlers']
stream = open_fileset([os.path.join(dumpdir, partition)])
rows = (parse_record(line, fields, handlers) for line in stream)
os.makedirs(outdir, exist_ok=True)
outfile = os.path.join(outdir, os.path.splitext(partition)[0] + ".parquet")
with pq.ParquetWriter(outfile, schema=schema, compression='snappy', flavor='spark') as writer:
while True:
chunk = list(islice(rows, chunk_size))
if not chunk:
break
pddf = pd.DataFrame(chunk, columns=schema.names)
table = pa.Table.from_pandas(pddf, schema=schema)
writer.write_table(table)
def gen_task_list(config, script_name, dumpdir=None, tasklist=None):
"""Write a parallel-friendly task list of `script_name parse_dump <file>`
lines, one per dump file found under `dumpdir`."""
dumpdir = dumpdir or config['dumpdir']
tasklist = tasklist or config['task_list']
files = list(find_dumps(dumpdir, base_pattern=config['file_pattern']))
with open(tasklist, 'w') as of:
for fpath in files:
partition = os.path.split(fpath)[1]
of.write(f'python3 {script_name} parse_dump {partition}\n')
# --- Part 2: spark sort + repartition --------------------------------------
def sort_and_write(config, indir=None, out_by_subreddit=None, out_by_author=None):
"""Read a directory of per-source parquets, sort and repartition twice
(once by subreddit, once by author), and write the two output datasets.
indir defaults to config['outdir'].
out_by_subreddit and out_by_author default to config['output_by_subreddit']
and config['output_by_author']. Override them to write to staging directories
instead of the live datasets (see add_months.sh).
Pyspark is imported lazily so Part 1 callers don't pay the Spark startup
cost.
"""
from pyspark.sql import SparkSession, functions as f
indir = indir or config['outdir']
out_by_subreddit = out_by_subreddit or config['output_by_subreddit']
out_by_author = out_by_author or config['output_by_author']
spark = SparkSession.builder.appName(config['app_name']).getOrCreate()
df = spark.read.parquet(indir, compression='snappy')
df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
df = df.drop('subreddit')
df = df.withColumnRenamed('subreddit_2', 'subreddit')
df = df.withColumnRenamed("created_utc", "CreatedAt")
df = df.withColumn("Month", f.month(f.col("CreatedAt")))
df = df.withColumn("Year", f.year(f.col("CreatedAt")))
df = df.withColumn("Day", f.dayofmonth(f.col("CreatedAt")))
sub_keys = config['subreddit_sort_keys']
df_sub = df.repartition('subreddit').sort(sub_keys, ascending=True)
df_sub = df_sub.sortWithinPartitions(sub_keys, ascending=True)
df_sub.write.parquet(out_by_subreddit, mode='overwrite', compression='snappy')
auth_keys = config['author_sort_keys']
df_auth = df.repartition('author').sort(auth_keys, ascending=True)
df_auth = df_auth.sortWithinPartitions(auth_keys, ascending=True)
df_auth.write.parquet(out_by_author, mode='overwrite', compression='snappy')
def merge_layers(config):
"""Collapse all accumulated layers in the final datasets into a single
clean layer. Reads the existing by_subreddit dataset (which contains all
layers), re-sorts twice, writes to temp paths, then atomically replaces
the originals by renaming.
Safe to interrupt after the writes complete but before the renames — the
originals are untouched until the .merging directories exist. The .old
directories are left behind if the process is interrupted after renaming;
delete them manually once satisfied.
Pyspark is imported lazily so Part 1 callers don't pay the Spark startup
cost.
"""
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName(config['app_name'] + ' merge layers').getOrCreate()
# Both final datasets have identical rows; read from by_subreddit.
df = spark.read.parquet(config['output_by_subreddit'])
tmp_sub = config['output_by_subreddit'] + '.merging'
tmp_auth = config['output_by_author'] + '.merging'
sub_keys = config['subreddit_sort_keys']
df_sub = df.repartition('subreddit').sort(sub_keys, ascending=True)
df_sub = df_sub.sortWithinPartitions(sub_keys, ascending=True)
df_sub.write.parquet(tmp_sub, mode='overwrite', compression='snappy')
auth_keys = config['author_sort_keys']
df_auth = df.repartition('author').sort(auth_keys, ascending=True)
df_auth = df_auth.sortWithinPartitions(auth_keys, ascending=True)
df_auth.write.parquet(tmp_auth, mode='overwrite', compression='snappy')
# Atomic swap: rename old → .old, then .merging → final, then delete .old.
old_sub = config['output_by_subreddit'] + '.old'
old_auth = config['output_by_author'] + '.old'
os.rename(config['output_by_subreddit'], old_sub)
os.rename(tmp_sub, config['output_by_subreddit'])
os.rename(config['output_by_author'], old_auth)
os.rename(tmp_auth, config['output_by_author'])
shutil.rmtree(old_sub)
shutil.rmtree(old_auth)

View File

@@ -3,6 +3,9 @@ import re
from collections import defaultdict from collections import defaultdict
from os import path from os import path
import glob import glob
import io
import zstandard
def find_dumps(dumpdir, base_pattern): def find_dumps(dumpdir, base_pattern):
@@ -24,27 +27,32 @@ def open_fileset(files):
for fh in files: for fh in files:
print(fh) print(fh)
lines = open_input_file(fh) lines = open_input_file(fh)
yield from lines for line in lines:
yield line
def open_input_file(input_filename): def open_input_file(input_filename):
# .zst handled via the zstandard library to avoid subprocess/container issues
if re.match(r'.*\.zst$', input_filename):
fh = open(input_filename, 'rb')
dctx = zstandard.ZstdDecompressor()
return io.TextIOWrapper(dctx.stream_reader(fh), encoding='utf-8')
if re.match(r'.*\.7z$', input_filename): if re.match(r'.*\.7z$', input_filename):
cmd = ["7za", "x", "-so", input_filename, '*'] cmd = ["7za", "x", "-so", input_filename, '*']
elif re.match(r'.*\.gz$', input_filename):
cmd = ["zcat", input_filename]
elif re.match(r'.*\.bz2$', input_filename): elif re.match(r'.*\.bz2$', input_filename):
cmd = ["bzcat", "-dk", input_filename] cmd = ["bzcat", "-dk", input_filename]
elif re.match(r'.*\.bz', input_filename): elif re.match(r'.*\.bz', input_filename):
cmd = ["bzcat", "-dk", input_filename] cmd = ["bzcat", "-dk", input_filename]
elif re.match(r'.*\.xz', input_filename): elif re.match(r'.*\.xz', input_filename):
cmd = ["xzcat",'-dk', '-T 20',input_filename] cmd = ["xzcat", '-dk', '-T 20', input_filename]
elif re.match(r'.*\.zst',input_filename): elif re.match(r'.*\.gz', input_filename):
cmd = ['/kloneusr/bin/zstd','-dck', input_filename, '--memory=2048MB --stdout'] cmd = ["zcat", input_filename]
elif re.match(r'.*\.gz',input_filename): else:
cmd = ['gzip','-dc', input_filename] return open(input_filename, 'r')
try: try:
input_file = Popen(cmd, stdout=PIPE).stdout return Popen(cmd, stdout=PIPE).stdout
except NameError as e: except NameError as e:
print(e) print(e)
input_file = open(input_filename, 'r') return open(input_filename, 'r')
return input_file

32
datasets/merge_layers.sh Executable file
View File

@@ -0,0 +1,32 @@
#!/usr/bin/env bash
#
# Collapse all accumulated layers in the final parquet datasets into a
# single clean layer. Use this after several incremental adds via
# add_months.sh when you want to reduce the number of partition files.
#
# Reads the existing by_subreddit / by_author datasets, re-sorts everything,
# writes to temp paths, then atomically replaces the originals via rename.
# The old directories are removed once the new ones are in place.
#
# If the process is interrupted after writing the .merging directories but
# before the renames complete, re-run — the .merging directories will be
# overwritten and the originals are still intact. If interrupted after the
# renames, the .old directories are left behind; delete them manually once
# satisfied with the output.
#
# To add new months without merging, use add_months.sh.
# To rebuild everything from raw dumps, use build_from_scratch.sh.
#
# NOTE: This script and its workflow are written but not yet tested.
# Remove this notice after a successful end-to-end run.
#
# Every command below is independently runnable for debugging.
set -e
cd "$(dirname "$0")"
# merge and collapse comments layers
start_spark_and_run.sh 1 comments_merge.py
# merge and collapse submissions layers
start_spark_and_run.sh 1 submissions_merge.py

View File

@@ -1,24 +0,0 @@
#!/bin/bash
## tf reddit comments
#SBATCH --job-name="cdsc_reddit; parse comment dumps"
## Allocation Definition
#SBATCH --account=comdata
#SBATCH --partition=compute-bigmem
## Resources
## Nodes. This should always be 1 for parallel-sql.
#SBATCH --nodes=1
## Walltime (12 hours)
#SBATCH --time=24:00:00
## Memory per node
#SBATCH --mem=8G
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH
#SBATCH --chdir /gscratch/comdata/users/nathante/partitioning_reddit/dataverse/cdsc_reddit/datasets
#SBATCH --output=comments_jobs/%A_%a.out
#SBATCH --error=comments_jobs/%A_%a.out
. /opt/ohpc/admin/lmod/lmod/init/profile
source ~/.bashrc
TASK_NUM=$(( SLURM_ARRAY_TASK_ID + $1))
TASK_CALL=$(sed -n ${TASK_NUM}p ./comments_task_list.sh)
${TASK_CALL}

View File

@@ -1,23 +0,0 @@
#!/bin/bash
## tf reddit comments
#SBATCH --job-name="cdsc_reddit; parse submission dumps"
## Allocation Definition
#SBATCH --account=comdata-ckpt
#SBATCH --partition=ckpt
## Resources
## Nodes. This should always be 1 for parallel-sql.
#SBATCH --nodes=1
## Walltime (12 hours)
#SBATCH --time=24:00:00
## Memory per node
#SBATCH --mem=8G
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH
#SBATCH --chdir /gscratch/comdata/users/nathante/cdsc_reddit/datasets
#SBATCH --output=submissions_jobs/%A_%a.out
#SBATCH --error=submissions_jobs/%A_%a.out
TASK_NUM=$(( SLURM_ARRAY_TASK_ID + $1))
TASK_CALL=$(sed -n ${TASK_NUM}p ./submissions_task_list.sh)
${TASK_CALL}

View File

@@ -1,9 +0,0 @@
#!/usr/bin/env bash
## this should be run manually since we don't have a nice way to wait on parallel_sql jobs
srun -p compute-bigmem -A comdata --nodes=1 --mem-per-cpu=9g -c 40 --time=120:00:00 python3 $(pwd)/submissions_2_parquet_part1.py gen_task_list
start_spark_and_run.sh 1 $(pwd)/submissions_2_parquet_part2.py

View File

@@ -1,114 +0,0 @@
#!/usr/bin/env python3
# two stages:
# 1. from gz to arrow parquet (this script)
# 2. from arrow parquet to spark parquet (submissions_2_parquet_part2.py)
from datetime import datetime
from pathlib import Path
from itertools import islice
from helper import find_dumps, open_fileset
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import fire
import os
import json
def parse_submission(post, names = None):
if names is None:
names = ['id','author','subreddit','title','created_utc','permalink','url','domain','score','ups','downs','over_18','has_media','selftext','retrieved_on','num_comments','gilded','edited','time_edited','subreddit_type','subreddit_id','subreddit_subscribers','name','is_self','stickied','quarantine','error']
try:
post = json.loads(post)
except (ValueError) as e:
# print(e)
# print(post)
row = [None for _ in names]
row[-1] = "Error parsing json|{0}|{1}".format(e,post)
return tuple(row)
row = []
for name in names:
if name == 'created_utc' or name == 'retrieved_on':
val = post.get(name,None)
if val is not None:
row.append(datetime.fromtimestamp(int(post[name]),tz=None))
else:
row.append(None)
elif name == 'edited':
val = post[name]
if type(val) == bool:
row.append(val)
row.append(None)
else:
row.append(True)
row.append(datetime.fromtimestamp(int(val),tz=None))
elif name == "time_edited":
continue
elif name == 'has_media':
row.append(post.get('media',None) is not None)
elif name not in post:
row.append(None)
else:
row.append(post[name])
return tuple(row)
def parse_dump(partition):
N=10000
stream = open_fileset([f"/gscratch/comdata/raw_data/submissions/{partition}"])
rows = map(parse_submission,stream)
schema = pa.schema([
pa.field('id', pa.string(),nullable=True),
pa.field('author', pa.string(),nullable=True),
pa.field('subreddit', pa.string(),nullable=True),
pa.field('title', pa.string(),nullable=True),
pa.field('created_utc', pa.timestamp('ms'),nullable=True),
pa.field('permalink', pa.string(),nullable=True),
pa.field('url', pa.string(),nullable=True),
pa.field('domain', pa.string(),nullable=True),
pa.field('score', pa.int64(),nullable=True),
pa.field('ups', pa.int64(),nullable=True),
pa.field('downs', pa.int64(),nullable=True),
pa.field('over_18', pa.bool_(),nullable=True),
pa.field('has_media',pa.bool_(),nullable=True),
pa.field('selftext',pa.string(),nullable=True),
pa.field('retrieved_on', pa.timestamp('ms'),nullable=True),
pa.field('num_comments', pa.int64(),nullable=True),
pa.field('gilded',pa.int64(),nullable=True),
pa.field('edited',pa.bool_(),nullable=True),
pa.field('time_edited',pa.timestamp('ms'),nullable=True),
pa.field('subreddit_type',pa.string(),nullable=True),
pa.field('subreddit_id',pa.string(),nullable=True),
pa.field('subreddit_subscribers',pa.int64(),nullable=True),
pa.field('name',pa.string(),nullable=True),
pa.field('is_self',pa.bool_(),nullable=True),
pa.field('stickied',pa.bool_(),nullable=True),
pa.field('quarantine',pa.bool_(),nullable=True),
pa.field('error',pa.string(),nullable=True)])
Path("/gscratch/comdata/output/temp/reddit_submissions.parquet/").mkdir(exist_ok=True,parents=True)
with pq.ParquetWriter(f"/gscratch/comdata/output/temp/reddit_submissions.parquet/{partition}",schema=schema,compression='snappy',flavor='spark') as writer:
while True:
chunk = islice(rows,N)
pddf = pd.DataFrame(chunk, columns=schema.names)
table = pa.Table.from_pandas(pddf,schema=schema)
if table.shape[0] == 0:
break
writer.write_table(table)
writer.close()
def gen_task_list(dumpdir="/gscratch/comdata/raw_data/submissions"):
files = list(find_dumps(dumpdir,base_pattern="RS_20*.*"))
with open("submissions_task_list.sh",'w') as of:
for fpath in files:
partition = os.path.split(fpath)[1]
of.write(f'python3 submissions_2_parquet_part1.py parse_dump {partition}\n')
if __name__ == "__main__":
fire.Fire({'parse_dump':parse_dump,
'gen_task_list':gen_task_list})

View File

@@ -1,42 +0,0 @@
#!/usr/bin/env python3
# spark script to make sorted, and partitioned parquet files
import pyspark
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
import os
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
conf = pyspark.SparkConf().setAppName("Reddit submissions to parquet")
conf = conf.set("spark.sql.shuffle.partitions",2000)
conf = conf.set('spark.sql.crossJoin.enabled',"true")
conf = conf.set('spark.debug.maxToStringFields',200)
sqlContext = pyspark.SQLContext(sc)
df = spark.read.parquet("/gscratch/comdata/output/temp/reddit_submissions.parquet/")
df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
df = df.drop('subreddit')
df = df.withColumnRenamed('subreddit_2','subreddit')
df = df.withColumnRenamed("created_utc","CreatedAt")
df = df.withColumn("Month",f.month(f.col("CreatedAt")))
df = df.withColumn("Year",f.year(f.col("CreatedAt")))
df = df.withColumn("Day",f.dayofmonth(f.col("CreatedAt")))
df = df.withColumn("subreddit_hash",f.sha2(f.col("subreddit"), 256)[0:3])
# next we gotta resort it all.
df = df.repartition(800,"subreddit","Year","Month")
df2 = df.sort(["subreddit","Year","Month","CreatedAt","id"],ascending=True)
df2 = df.sortWithinPartitions(["subreddit","CreatedAt","id"],ascending=True)
df2.write.parquet("/gscratch/comdata/output/temp/reddit_submissions_by_subreddit.parquet2", mode='overwrite',compression='snappy')
# # we also want to have parquet files sorted by author then reddit.
df = df.repartition(800,"author","subreddit","Year","Month")
df3 = df.sort(["author","Year","Month","CreatedAt","id"],ascending=True)
df3 = df.sortWithinPartitions(["author","CreatedAt","id"],ascending=True)
df3.write.parquet("/gscratch/comdata/output/temp/reddit_submissions_by_author.parquet2", mode='overwrite',compression='snappy')

View File

@@ -0,0 +1,14 @@
#!/usr/bin/env python3
"""Collapse all layers in the submissions final datasets into a single clean layer.
Must be launched from a login node via the Hyak-provided wrapper:
start_spark_and_run.sh 1 submissions_merge.py
See merge_layers.sh and dumps_helper.merge_layers for details.
"""
from dumps_helper import SUBMISSIONS, merge_layers
if __name__ == "__main__":
merge_layers(SUBMISSIONS)

24
datasets/submissions_part1.py Executable file
View File

@@ -0,0 +1,24 @@
#!/usr/bin/env python3
"""Part 1 for submissions: parse one RS_*.zst dump into a parquet file.
CLI:
submissions_part1.py parse_dump RS_2018-08.zst
submissions_part1.py gen_task_list
submissions_part1.py parse_dump RS_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
"""
import fire
from dumps_helper import SUBMISSIONS, parse_dump, gen_task_list
def _parse_dump(partition, dumpdir=None, outdir=None):
parse_dump(SUBMISSIONS, partition, dumpdir=dumpdir, outdir=outdir)
def _gen_task_list(dumpdir=None, tasklist=None):
gen_task_list(SUBMISSIONS, 'submissions_part1.py', dumpdir=dumpdir, tasklist=tasklist)
if __name__ == "__main__":
fire.Fire({'parse_dump': _parse_dump,
'gen_task_list': _gen_task_list})

21
datasets/submissions_part2.py Executable file
View File

@@ -0,0 +1,21 @@
#!/usr/bin/env python3
"""Part 2 for submissions: Spark sort + repartition into the final datasets.
Must be launched from a login node via the Hyak-provided wrapper:
start_spark_and_run.sh 1 submissions_part2.py
start_spark_and_run.sh 1 submissions_part2.py --indir=/path/to/parquets --mode=append
--indir defaults to the temp submissions dir in dumps_helper.py.
--out_by_subreddit and --out_by_author default to the live dataset paths;
override them to write to staging directories first (see add_months.sh).
"""
import fire
from dumps_helper import SUBMISSIONS, sort_and_write
if __name__ == "__main__":
fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
sort_and_write(SUBMISSIONS, indir=indir,
out_by_subreddit=out_by_subreddit,
out_by_author=out_by_author))

View File

@@ -1,7 +1,10 @@
all: ../../data/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather all: /gscratch/comdata/output/reddit_density/comment_terms_10000.feather /gscratch/comdata/output/reddit_density/comment_authors_10000.feather /gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10000.feather
../../data/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather: overlap_density.py ../../data/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather /gscratch/comdata/output/reddit_density/comment_terms_10000.feather:overlap_density.py /gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather /gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
../start_spark_and_run.sh 1 overlap_density.py authors --inpath="../../data/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather" --outpath="../../data/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather" --agg=pd.DataFrame.sum start_spark_and_run.sh 1 overlap_density.py terms --inpath="/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather" --outpath="/gscratch/comdata/output/reddit_density/comment_terms_10000.feather" --agg=pd.DataFrame.sum
../../data/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather: /gscratch/comdata/output/reddit_density/comment_authors_10000.feather:overlap_density.py /gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
$(MAKE) -C ../similarities start_spark_and_run.sh 1 overlap_density.py authors --inpath="/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather" --outpath="/gscratch/comdata/output/reddit_density/comment_authors_10000.feather" --agg=pd.DataFrame.sum
/gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10000.feather: overlap_density.py /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
start_spark_and_run.sh 1 overlap_density.py authors --inpath="/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet" --outpath="/gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10000.feather" --agg=pd.DataFrame.sum

View File

@@ -1,6 +1,4 @@
#!/usr/bin/bash #!/usr/bin/bash
source ~/.bashrc
echo $(hostname)
start_spark_cluster.sh start_spark_cluster.sh
spark-submit --verbose --master spark://$(hostname):43015 overlap_density.py authors --inpath=../../data/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather --outpath=../../data/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather --agg=pd.DataFrame.sum spark-submit --master spark://$(hostname):18899 overlap_density.py authors --inpath=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather --outpath=/gscratch/comdata/output/reddit_density/comment_authors_10000.feather --agg=pd.DataFrame.sum
stop-all.sh stop-all.sh

View File

@@ -1,12 +1,11 @@
import pandas as pd import pandas as pd
from pandas.core.groupby import DataFrameGroupBy as GroupBy from pandas.core.groupby import DataFrameGroupBy as GroupBy
from pathlib import Path
import fire import fire
import numpy as np import numpy as np
import sys import sys
# sys.path.append("..") sys.path.append("..")
# sys.path.append("../similarities") sys.path.append("../similarities")
# from similarities.similarities_helper import pull_tfidf from similarities.similarities_helper import reindex_tfidf, reindex_tfidf_time_interval
# this is the mean of the ratio of the overlap to the focal size. # this is the mean of the ratio of the overlap to the focal size.
# mean shared membership per focal community member # mean shared membership per focal community member
@@ -14,12 +13,10 @@ import sys
def overlap_density(inpath, outpath, agg = pd.DataFrame.sum): def overlap_density(inpath, outpath, agg = pd.DataFrame.sum):
df = pd.read_feather(inpath) df = pd.read_feather(inpath)
df = df.drop('_subreddit',1) df = df.drop('subreddit',1)
np.fill_diagonal(df.values,0) np.fill_diagonal(df.values,0)
df = agg(df, 0).reset_index() df = agg(df, 0).reset_index()
df = df.rename({0:'overlap_density'},axis='columns') df = df.rename({0:'overlap_density'},axis='columns')
outpath = Path(outpath)
outpath.parent.mkdir(parents=True, exist_ok = True)
df.to_feather(outpath) df.to_feather(outpath)
return df return df
@@ -28,8 +25,6 @@ def overlap_density_weekly(inpath, outpath, agg = GroupBy.sum):
# exclude the diagonal # exclude the diagonal
df = df.loc[df.subreddit != df.variable] df = df.loc[df.subreddit != df.variable]
res = agg(df.groupby(['subreddit','week'])).reset_index() res = agg(df.groupby(['subreddit','week'])).reset_index()
outpath = Path(outpath)
outpath.parent.mkdir(parents=True, exist_ok = True)
res.to_feather(outpath) res.to_feather(outpath)
return res return res

View File

@@ -1,33 +0,0 @@
#!/usr/bin/env python3
# run from a build_machine
import requests
from os import path
import hashlib
shasums1 = requests.get("https://files.pushshift.io/reddit/comments/sha256sum.txt").text
#shasums2 = requests.get("https://files.pushshift.io/reddit/comments/daily/sha256sum.txt").text
shasums = shasums1
dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/comments"
for l in shasums.strip().split('\n'):
sha256_hash = hashlib.sha256()
parts = l.split(' ')
correct_sha256 = parts[0]
filename = parts[-1]
print(f"checking {filename}")
fpath = path.join(dumpdir,filename)
if path.isfile(fpath):
with open(fpath,'rb') as f:
for byte_block in iter(lambda: f.read(4096),b""):
sha256_hash.update(byte_block)
if sha256_hash.hexdigest() == correct_sha256:
print(f"{filename} checks out")
else:
print(f"ERROR! {filename} has the wrong hash. Redownload and recheck!")
else:
print(f"Skipping {filename} as it doesn't exist")

View File

@@ -1,31 +0,0 @@
#!/usr/bin/env python3
# run from a build_machine
import requests
from os import path
import hashlib
file1 = requests.get("https://files.pushshift.io/reddit/submissions/sha256sums.txt").text
file2 = requests.get("https://files.pushshift.io/reddit/submissions/old_v1_data/sha256sums.txt").text
dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/submissions"
for l in file1.strip().split('\n') + file2.strip().split('\n'):
sha256_hash = hashlib.sha256()
parts = l.split(' ')
correct_sha256 = parts[0]
filename = parts[-1]
print(f"checking {filename}")
fpath = path.join(dumpdir,filename)
if path.isfile(fpath):
with open(fpath,'rb') as f:
for byte_block in iter(lambda: f.read(4096),b""):
sha256_hash.update(byte_block)
if sha256_hash.hexdigest() == correct_sha256:
print(f"{filename} checks out")
else:
print(f"ERROR! {filename} has the wrong hash. Redownload and recheck!")
else:
print(f"Skipping {filename} as it doesn't exist")

View File

@@ -1,12 +0,0 @@
#!/bin/bash
user_agent='"nathante teblunthuis <nathante@uw.edu>"'
output_dir='/gscratch/comdata/raw_data/reddit_dumps/comments'
base_url='https://files.pushshift.io/reddit/comments/'
wget -r --no-parent -A 'RC_20*.bz2' -U $user_agent -P $output_dir -nd -nc $base_url
wget -r --no-parent -A 'RC_20*.xz' -U $user_agent -P $output_dir -nd -nc $base_url
wget -r --no-parent -A 'RC_20*.zst' -U $user_agent -P $output_dir -nd -nc $base_url
./check_comments_shas.py

View File

@@ -1,14 +0,0 @@
#!/bin/bash
user_agent='"nathante teblunthuis <nathante@uw.edu>"'
output_dir='/gscratch/comdata/raw_data/reddit_dumps/submissions'
base_url='https://files.pushshift.io/reddit/submissions/'
wget -r --no-parent -A 'RS_20*.bz2' --user-agent=$user_agent -P $output_dir -nd -nc $base_url
wget -r --no-parent -A 'RS_20*.xz' --user-agent=$user_agent -P $output_dir -nd -nc $base_url
wget -r --no-parent -A 'RS_20*.zst' --user-agent=$user_agent -P $output_dir -nd -nc $base_url
wget -r --no-parent -A 'RS_20*.bz2' --user-agent=$user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
wget -r --no-parent -A 'RS_20*.xz' --user-agent=$user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
wget -r --no-parent -A 'RS_20*.zst' --user-agent=$user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
./check_submission_shas.py

View File

@@ -1,34 +0,0 @@
from pathlib import Path
from itertools import chain, groupby
dumpdir = Path("/gscratch/comdata/raw_data/reddit_dumps/comments")
zst_files = dumpdir.glob("*.zst")
bz2_files = dumpdir.glob("*.bz2")
xz_files = dumpdir.glob("*.xz")
all_files = sorted(list(chain(zst_files, bz2_files, xz_files)))
groups = groupby(all_files, key = lambda p: p.stem)
kept_paths = []
removed_paths = []
priority = ['.zst','.xz','.bz2']
for stem, files in groups:
keep_file = None
remove_files = []
for f in files:
if keep_file is None:
keep_file = f
elif priority.index(keep_file.suffix) > priority.index(f.suffix):
remove_files.append(keep_file)
keep_file = f
else:
remove_files.append(f)
kept_paths.append(keep_file)
removed_paths.extend(remove_files)
(dumpdir / "to_remove").mkdir()
for f in removed_paths:
f.rename(f.parent / "to_remove" / f.name)

View File

@@ -1,34 +0,0 @@
from pathlib import Path
from itertools import chain, groupby
dumpdir = Path("/gscratch/comdata/raw_data/reddit_dumps/submissions")
zst_files = dumpdir.glob("*.zst")
bz2_files = dumpdir.glob("*.bz2")
xz_files = dumpdir.glob("*.xz")
all_files = sorted(list(chain(zst_files, bz2_files, xz_files)))
groups = groupby(all_files, key = lambda p: p.stem)
kept_paths = []
removed_paths = []
priority = ['.zst','.xz','.bz2']
for stem, files in groups:
keep_file = None
remove_files = []
for f in files:
if keep_file is None:
keep_file = f
elif priority.index(keep_file.suffix) > priority.index(f.suffix):
remove_files.append(keep_file)
keep_file = f
else:
remove_files.append(f)
kept_paths.append(keep_file)
removed_paths.extend(remove_files)
(dumpdir / "to_remove").mkdir()
for f in removed_paths:
f.rename(f.parent / "to_remove" / f.name)

View File

@@ -0,0 +1,17 @@
import pyarrow.dataset as ds
# A pyarrow dataset abstracts reading, writing, or filtering a parquet file. It does not read dataa into memory.
#dataset = ds.dataset(pathlib.Path('/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet/'), format='parquet', partitioning='hive')
dataset = ds.dataset('/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/', format='parquet')
# let's get all the comments to two subreddits:
subreddits_to_pull = ['seattle','seattlewa']
# a table is a low-level structured data format. This line pulls data into memory. Setting metadata_n_threads > 1 gives a little speed boost.
table = dataset.to_table(filter = ds.field('subreddit').isin(subreddits_to_pull), columns=['id','subreddit','CreatedAt','author','ups','downs','score','subreddit_id','stickied','title','url','is_self','selftext'])
# Since data from just these 2 subreddits fits in memory we can just turn our table into a pandas dataframe.
df = table.to_pandas()
# We should save this smaller dataset so we don't have to wait 15 min to pull from parquet next time.
df.to_csv("mydataset.csv")

View File

@@ -0,0 +1,38 @@
import pyarrow.dataset as ds
from itertools import groupby
# A pyarrow dataset abstracts reading, writing, or filtering a parquet file. It does not read dataa into memory.
dataset = ds.dataset('/gscratch/comdata/output/reddit_submissions_by_author.parquet', format='parquet')
# let's get all the comments to two subreddits:
subreddits_to_pull = ['seattlewa','seattle']
# instead of loading the data into a pandas dataframe all at once we can stream it.
scan_tasks = dataset.scan(filter = ds.field('subreddit').isin(subreddits_to_pull), columns=['id','subreddit','CreatedAt','author','ups','downs','score','subreddit_id','stickied','title','url','is_self','selftext'])
# simple function to execute scantasks and generate rows
def iterate_rows(scan_tasks):
for st in scan_tasks:
for rb in st.execute():
df = rb.to_pandas()
for t in df.itertuples():
yield t
row_iter = iterate_rows(scan_tasks)
# now we can use python's groupby function to read one author at a time
# note that the same author can appear more than once since the record batches may not be in the correct order.
author_submissions = groupby(row_iter, lambda row: row.author)
count_dict = {}
for auth, posts in author_submissions:
if auth in count_dict:
count_dict[auth] = count_dict[auth] + 1
else:
count_dict[auth] = 1
# since it's partitioned and sorted by author, we get one group for each author
any([ v != 1 for k,v in count_dict.items()])

View File

View File

@@ -1,25 +0,0 @@
outputdir=../../data/reddit_ngrams/
inputdir=../../data/reddit_comments_by_subreddit.parquet
authors_tfdir=${outputdir}/comment_authors.parquet
srun=sbatch --wait --verbose run_job.sbatch
all: ${outputdir}/comment_authors_sorted.parquet/_SUCCESS
tf_task_list_1: tf_comments.py
${srun} bash -c "python3 tf_comments.py gen_task_list --mwe_pass='first' --outputdir=${outputdir} --tf_task_list=$@ --inputdir=${inputdir}"
${outputdir}/comment_terms.parquet:tf_task_list_1
mkdir -p sbatch_log
sbatch --wait --verbose --array=1-$(shell cat $< | wc -l) run_array.sbatch 0 $<
${outputdir}/comment_authors.parquet:${outputdir}/comment_terms.parquet
-
${outputdir}/comment_authors_sorted.parquet:${outputdir}/comment_authors.parquet sort_tf_comments.py
../start_spark_and_run.sh 3 sort_tf_comments.py --inparquet=$< --outparquet=$@ --colname=author
${outputdir}/comment_authors_sorted.parquet/_SUCCESS:${outputdir}/comment_authors_sorted.parquet
${inputdir}:
$(MAKE) -C ../datasets

View File

@@ -1,19 +0,0 @@
#!/bin/bash
#SBATCH --job-name=reddit_comment_term_frequencies
#SBATCH --account=comdata
#SBATCH --partition=compute-bigmem
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=9g
#SBATCH --ntasks=1
#SBATCH --export=ALL
#SBATCH --time=48:00:00
#SBATCH --chdir=/gscratch/comdata/users/nathante/partitioning_reddit/dataverse/cdsc_reddit/ngrams
#SBATCH --error="sbatch_log/%A_%a.out"
#SBATCH --output="sbatch_log/%A_%a.out"
TASK_NUM=$(($SLURM_ARRAY_TASK_ID + $1))
TASK_CALL=$(sed -n ${TASK_NUM}p $2)
${TASK_CALL}

View File

@@ -1,18 +0,0 @@
#!/bin/bash
#SBATCH --job-name="simulate measurement error models"
## Allocation Definition
#SBATCH --account=comdata
#SBATCH --partition=compute-bigmem
## Resources
#SBATCH --nodes=1
## Walltime (4 hours)
#SBATCH --time=4:00:00
## Memory per node
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1
#SBATCH --ntasks-per-node=1
#SBATCH --chdir /gscratch/comdata/users/nathante/partitioning_reddit/dataverse/cdsc_reddit/ngrams/
#SBATCH --output=sbatch_log/%A_%a.out
#SBATCH --error=sbatch_log/%A_%a.err
echo "$@"
"$@"

View File

@@ -1,6 +1,8 @@
#!/usr/bin/env bash #!/usr/bin/env bash
module load parallel_sql
source ./bin/activate source ./bin/activate
python3 tf_comments.py gen_task_list python3 tf_comments.py gen_task_list
psu --del --Y
cat tf_task_list | psu --load
for job in $(seq 1 50); do sbatch checkpoint_parallelsql.sbatch; done; for job in $(seq 1 50); do sbatch checkpoint_parallelsql.sbatch; done;

View File

@@ -2,17 +2,12 @@
from pyspark.sql import functions as f from pyspark.sql import functions as f
from pyspark.sql import SparkSession from pyspark.sql import SparkSession
import fire
def main(inparquet, outparquet, colname): spark = SparkSession.builder.getOrCreate()
spark = SparkSession.builder.getOrCreate() df = spark.read.parquet("/gscratch/comdata/users/nathante/reddit_tfidf_test.parquet_temp/")
df = spark.read.parquet(inparquet)
df = df.repartition(2000,colname) df = df.repartition(2000,'term')
df = df.sort([colname,'week','subreddit']) df = df.sort(['term','week','subreddit'])
df = df.sortWithinPartitions([colname,'week','subreddit']) df = df.sortWithinPartitions(['term','week','subreddit'])
df.write.parquet(outparquet,mode='overwrite',compression='snappy') df.write.parquet("/gscratch/comdata/users/nathante/reddit_tfidf_test_sorted_tf.parquet_temp",mode='overwrite',compression='snappy')
if __name__ == '__main__':
fire.Fire(main)

View File

@@ -3,7 +3,6 @@ import pandas as pd
import pyarrow as pa import pyarrow as pa
import pyarrow.dataset as ds import pyarrow.dataset as ds
import pyarrow.parquet as pq import pyarrow.parquet as pq
import pyarrow.compute as pc
from itertools import groupby, islice, chain from itertools import groupby, islice, chain
import fire import fire
from collections import Counter from collections import Counter
@@ -14,33 +13,26 @@ from nltk.corpus import stopwords
from nltk.util import ngrams from nltk.util import ngrams
import string import string
from random import random from random import random
from redditcleaner import clean
from pathlib import Path # remove urls
from datetime import datetime # taken from https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
urlregex = re.compile(r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)")
# compute term frequencies for comments in each subreddit by week # compute term frequencies for comments in each subreddit by week
def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/', inputdir="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/", mwe_pass = 'first', excluded_users=None): def weekly_tf(partition, mwe_pass = 'first'):
dataset = ds.dataset(f'/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/{partition}', format='parquet')
if not os.path.exists("/gscratch/comdata/users/nathante/reddit_comment_ngrams_10p_sample/"):
os.mkdir("/gscratch/comdata/users/nathante/reddit_comment_ngrams_10p_sample/")
dataset = ds.dataset(Path(inputdir)/partition, format='parquet') if not os.path.exists("/gscratch/comdata/users/nathante/reddit_tfidf_test_authors.parquet_temp/"):
outputdir = Path(outputdir) os.mkdir("/gscratch/comdata/users/nathante/reddit_tfidf_test_authors.parquet_temp/")
samppath = outputdir / "reddit_comment_ngrams_10p_sample"
if not samppath.exists():
samppath.mkdir(parents=True, exist_ok=True)
ngram_output = partition.replace("parquet","txt") ngram_output = partition.replace("parquet","txt")
if excluded_users is not None:
excluded_users = set(map(str.strip,open(excluded_users)))
df = df.filter(~ (f.col("author").isin(excluded_users)))
ngram_path = samppath / ngram_output
if mwe_pass == 'first': if mwe_pass == 'first':
if ngram_path.exists(): if os.path.exists(f"/gscratch/comdata/output/reddit_ngrams/comment_ngrams_10p_sample/{ngram_output}"):
ngram_path.unlink() os.remove(f"/gscratch/comdata/output/reddit_ngrams/comment_ngrams_10p_sample/{ngram_output}")
dataset = dataset.filter(pc.field("CreatedAt") <= pa.scalar(datetime(2020,4,13)))
batches = dataset.to_batches(columns=['CreatedAt','subreddit','body','author']) batches = dataset.to_batches(columns=['CreatedAt','subreddit','body','author'])
@@ -73,10 +65,8 @@ def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/',
subreddit_weeks = groupby(rows, lambda r: (r.subreddit, r.week)) subreddit_weeks = groupby(rows, lambda r: (r.subreddit, r.week))
mwe_path = outputdir / "multiword_expressions.feather"
if mwe_pass != 'first': if mwe_pass != 'first':
mwe_dataset = pd.read_feather(mwe_path) mwe_dataset = pd.read_feather(f'/gscratch/comdata/output/reddit_ngrams/multiword_expressions.feather')
mwe_dataset = mwe_dataset.sort_values(['phrasePWMI'],ascending=False) mwe_dataset = mwe_dataset.sort_values(['phrasePWMI'],ascending=False)
mwe_phrases = list(mwe_dataset.phrase) mwe_phrases = list(mwe_dataset.phrase)
mwe_phrases = [tuple(s.split(' ')) for s in mwe_phrases] mwe_phrases = [tuple(s.split(' ')) for s in mwe_phrases]
@@ -105,8 +95,8 @@ def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/',
# lowercase # lowercase
text = text.lower() text = text.lower()
# redditcleaner removes reddit markdown(newlines, quotes, bullet points, links, strikethrough, spoiler, code, superscript, table, headings) # remove urls
text = clean(text) text = urlregex.sub("", text)
# sentence tokenize # sentence tokenize
sentences = sent_tokenize(text) sentences = sent_tokenize(text)
@@ -117,18 +107,19 @@ def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/',
# remove punctuation # remove punctuation
sentences = map(remove_punct, sentences) sentences = map(remove_punct, sentences)
# remove sentences with less than 2 words
sentences = filter(lambda sentence: len(sentence) > 2, sentences)
# datta et al. select relatively common phrases from the reddit corpus, but they don't really explain how. We'll try that in a second phase. # datta et al. select relatively common phrases from the reddit corpus, but they don't really explain how. We'll try that in a second phase.
# they say that the extract 1-4 grams from 10% of the sentences and then find phrases that appear often relative to the original terms # they say that the extract 1-4 grams from 10% of the sentences and then find phrases that appear often relative to the original terms
# here we take a 10 percent sample of sentences # here we take a 10 percent sample of sentences
if mwe_pass == 'first': if mwe_pass == 'first':
# remove sentences with less than 2 words
sentences = filter(lambda sentence: len(sentence) > 2, sentences)
sentences = list(sentences) sentences = list(sentences)
for sentence in sentences: for sentence in sentences:
if random() <= 0.1: if random() <= 0.1:
grams = list(chain(*map(lambda i : ngrams(sentence,i),range(4)))) grams = list(chain(*map(lambda i : ngrams(sentence,i),range(4))))
with open(ngram_path,'a') as gram_file: with open(f'/gscratch/comdata/output/reddit_ngrams/comment_ngrams_10p_sample/{ngram_output}','a') as gram_file:
for ng in grams: for ng in grams:
gram_file.write(' '.join(ng) + '\n') gram_file.write(' '.join(ng) + '\n')
for token in sentence: for token in sentence:
@@ -163,14 +154,7 @@ def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/',
outchunksize = 10000 outchunksize = 10000
termtf_outputdir = (outputdir / "comment_terms.parquet") with pq.ParquetWriter(f"/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet/{partition}",schema=schema,compression='snappy',flavor='spark') as writer, pq.ParquetWriter(f"/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet/{partition}",schema=author_schema,compression='snappy',flavor='spark') as author_writer:
termtf_outputdir.mkdir(parents=True, exist_ok=True)
authortf_outputdir = (outputdir / "comment_authors.parquet")
authortf_outputdir.mkdir(parents=True, exist_ok=True)
termtf_path = termtf_outputdir / partition
authortf_path = authortf_outputdir / partition
with pq.ParquetWriter(termtf_path, schema=schema, compression='snappy', flavor='spark') as writer, \
pq.ParquetWriter(authortf_path, schema=author_schema, compression='snappy', flavor='spark') as author_writer:
while True: while True:
@@ -199,12 +183,12 @@ def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/',
author_writer.close() author_writer.close()
def gen_task_list(mwe_pass='first', inputdir="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/", outputdir='/gscratch/comdata/output/reddit_ngrams/', tf_task_list='tf_task_list', excluded_users_file=None): def gen_task_list(mwe_pass='first'):
files = os.listdir(inputdir) files = os.listdir("/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/")
with open(tf_task_list,'w') as outfile: with open("tf_task_list",'w') as outfile:
for f in files: for f in files:
if f.endswith(".parquet"): if f.endswith(".parquet"):
outfile.write(f"./tf_comments.py weekly_tf --mwe-pass {mwe_pass} --inputdir {inputdir} --outputdir {outputdir} --excluded_users {excluded_users_file} {f}\n") outfile.write(f"./tf_comments.py weekly_tf --mwe-pass {mwe_pass} {f}\n")
if __name__ == "__main__": if __name__ == "__main__":
fire.Fire({"gen_task_list":gen_task_list, fire.Fire({"gen_task_list":gen_task_list,

View File

@@ -0,0 +1,58 @@
from pyspark.sql import functions as f
from pyspark.sql import Window
from pyspark.sql import SparkSession
import numpy as np
spark = SparkSession.builder.getOrCreate()
df = spark.read.text("/gscratch/comdata/users/nathante/reddit_comment_ngrams_10p_sample/")
df = df.withColumnRenamed("value","phrase")
# count phrase occurrances
phrases = df.groupby('phrase').count()
phrases = phrases.withColumnRenamed('count','phraseCount')
phrases = phrases.filter(phrases.phraseCount > 10)
# count overall
N = phrases.select(f.sum(phrases.phraseCount).alias("phraseCount")).collect()[0].phraseCount
print(f'analyzing PMI on a sample of {N} phrases')
logN = np.log(N)
phrases = phrases.withColumn("phraseLogProb", f.log(f.col("phraseCount")) - logN)
# count term occurrances
phrases = phrases.withColumn('terms',f.split(f.col('phrase'),' '))
terms = phrases.select(['phrase','phraseCount','phraseLogProb',f.explode(phrases.terms).alias('term')])
win = Window.partitionBy('term')
terms = terms.withColumn('termCount',f.sum('phraseCount').over(win))
terms = terms.withColumnRenamed('count','termCount')
terms = terms.withColumn('termLogProb',f.log(f.col('termCount')) - logN)
terms = terms.groupBy(terms.phrase, terms.phraseLogProb, terms.phraseCount).sum('termLogProb')
terms = terms.withColumnRenamed('sum(termLogProb)','termsLogProb')
terms = terms.withColumn("phrasePWMI", f.col('phraseLogProb') - f.col('termsLogProb'))
# join phrases to term counts
df = terms.select(['phrase','phraseCount','phraseLogProb','phrasePWMI'])
df = df.sort(['phrasePWMI'],descending=True)
df = df.sortWithinPartitions(['phrasePWMI'],descending=True)
df.write.parquet("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.parquet/",mode='overwrite',compression='snappy')
df = spark.read.parquet("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.parquet/")
df.write.csv("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.csv/",mode='overwrite',compression='none')
df = spark.read.parquet("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.parquet")
df = df.select('phrase','phraseCount','phraseLogProb','phrasePWMI')
# choosing phrases occurring at least 3500 times in the 10% sample (35000 times) and then with a PWMI of at least 3 yeids about 65000 expressions.
#
df = df.filter(f.col('phraseCount') > 3500).filter(f.col("phrasePWMI")>3)
df = df.toPandas()
df.to_feather("/gscratch/comdata/users/nathante/reddit_multiword_expressions.feather")
df.to_csv("/gscratch/comdata/users/nathante/reddit_multiword_expressions.csv")

View File

@@ -1,21 +0,0 @@
from pyspark.sql import SparkSession
from similarities_helper import build_tfidf_dataset
import pandas as pd
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
# remove [deleted] and AutoModerator (TODO remove other bots)
df = df.filter(df.author != '[deleted]')
df = df.filter(df.author != 'AutoModerator')
df = build_tfidf_dataset(df, include_subs, 'author')
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet',mode='overwrite',compression='snappy')
spark.stop()

View File

@@ -1,27 +0,0 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
from similarities_helper import build_weekly_tfidf_dataset
import pandas as pd
## TODO:need to exclude automoderator / bot posts.
## TODO:need to exclude better handle hyperlinks.
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
# remove [deleted] and AutoModerator (TODO remove other bots)
# df = df.filter(df.author != '[deleted]')
# df = df.filter(df.author != 'AutoModerator')
df = build_weekly_tfidf_dataset(df, include_subs, 'term')
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', mode='overwrite', compression='snappy')
spark.stop()

View File

@@ -1,106 +0,0 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
import numpy as np
import pyarrow
import pandas as pd
import fire
from itertools import islice
from pathlib import Path
from similarities_helper import *
#tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/subreddit_terms.parquet')
def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, included_subreddits = None, topN = 500):
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext.getConf()
print(outfile)
tfidf = spark.read.parquet(tfidf_path)
if included_subreddits is None:
included_subreddits = select_topN_subreddits(topN)
else:
included_subreddits = set(open(included_subreddits))
print("creating temporary parquet with matrix indicies")
tempdir = prep_tfidf_entries_weekly(tfidf, term_colname, min_df, included_subreddits)
tfidf = spark.read.parquet(tempdir.name)
# the ids can change each week.
subreddit_names = tfidf.select(['subreddit','subreddit_id_new','week']).distinct().toPandas()
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
spark.stop()
weeks = list(subreddit_names.week.drop_duplicates())
for week in weeks:
print("loading matrix")
mat = read_tfidf_matrix_weekly(tempdir.name, term_colname, week)
print('computing similarities')
sims = column_similarities(mat)
del mat
names = subreddit_names.loc[subreddit_names.week==week]
sims = sims.rename({i:sr for i, sr in enumerate(names.subreddit.values)},axis=1)
sims['subreddit'] = names.subreddit.values
write_weekly_similarities(outfile, sims, week)
def cosine_similarities(outfile, min_df = None, included_subreddits=None, topN=500):
'''
Compute similarities between subreddits based on tfi-idf vectors of author comments
included_subreddits : string
Text file containing a list of subreddits to include (one per line) if included_subreddits is None then do the top 500 subreddits
min_df : int (default = 0.1 * (number of included_subreddits)
exclude terms that appear in fewer than this number of documents.
outfile: string
where to output csv and feather outputs
'''
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext.getConf()
print(outfile)
tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet')
if included_subreddits is None:
included_subreddits = select_topN_subreddits(topN)
else:
included_subreddits = set(open(included_subreddits))
print("creating temporary parquet with matrix indicies")
tempdir = prep_tfidf_entries(tfidf, 'author', min_df, included_subreddits)
tfidf = spark.read.parquet(tempdir.name)
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
spark.stop()
print("loading matrix")
mat = read_tfidf_matrix(tempdir.name,'author')
print('computing similarities')
sims = column_similarities(mat)
del mat
sims = pd.DataFrame(sims.todense())
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)},axis=1)
sims['subreddit'] = subreddit_names.subreddit.values
p = Path(outfile)
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
sims.to_feather(outfile)
tempdir.cleanup()
if __name__ == '__main__':
fire.Fire(author_cosine_similarities)

View File

@@ -1,61 +0,0 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
from pyspark.mllib.linalg.distributed import RowMatrix, CoordinateMatrix
import numpy as np
import pyarrow
import pandas as pd
import fire
from itertools import islice
from pathlib import Path
from similarities_helper import prep_tfidf_entries, read_tfidf_matrix, column_similarities, select_topN
import scipy
# outfile='test_similarities_500.feather';
# min_df = None;
# included_subreddits=None; topN=100; exclude_phrases=True;
def term_cosine_similarities(outfile, min_df = None, included_subreddits=None, topN=500, exclude_phrases=False):
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext.getConf()
print(outfile)
print(exclude_phrases)
tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_terms.parquet')
if included_subreddits is None:
included_subreddits = select_topN_subreddits(topN)
else:
included_subreddits = set(open(included_subreddits))
if exclude_phrases == True:
tfidf = tfidf.filter(~f.col(term).contains("_"))
print("creating temporary parquet with matrix indicies")
tempdir = prep_tfidf_entries(tfidf, 'term', min_df, included_subreddits)
tfidf = spark.read.parquet(tempdir.name)
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
spark.stop()
print("loading matrix")
mat = read_tfidf_matrix(tempdir.name,'term')
print('computing similarities')
sims = column_similarities(mat)
del mat
sims = pd.DataFrame(sims.todense())
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)},axis=1)
sims['subreddit'] = subreddit_names.subreddit.values
p = Path(outfile)
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
sims.to_feather(outfile)
tempdir.cleanup()
if __name__ == '__main__':
fire.Fire(term_cosine_similarities)

View File

@@ -1,21 +0,0 @@
from pyspark.sql import SparkSession
from similarities_helper import build_tfidf_dataset
import pandas as pd
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
# remove [deleted] and AutoModerator (TODO remove other bots)
df = df.filter(df.author != '[deleted]')
df = df.filter(df.author != 'AutoModerator')
df = build_tfidf_dataset(df, include_subs, 'author')
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet',mode='overwrite',compression='snappy')
spark.stop()

View File

@@ -1,21 +0,0 @@
from pyspark.sql import SparkSession
from similarities_helper import build_weekly_tfidf_dataset
import pandas as pd
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
# remove [deleted] and AutoModerator (TODO remove other bots)
df = df.filter(df.author != '[deleted]')
df = df.filter(df.author != 'AutoModerator')
df = build_weekly_tfidf_dataset(df, include_subs, 'author')
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet', mode='overwrite', compression='snappy')
spark.stop()

View File

@@ -1,18 +0,0 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
from similarities_helper import build_tfidf_dataset
## TODO:need to exclude automoderator / bot posts.
## TODO:need to exclude better handle hyperlinks.
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
df = build_tfidf_dataset(df, include_subs, 'term')
df.write.parquet('/gscratch/comdata/output/reddit_similarity/reddit_similarity/subreddit_terms.parquet',mode='overwrite',compression='snappy')
spark.stop()

View File

@@ -1,27 +0,0 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
from similarities_helper import build_weekly_tfidf_dataset
import pandas as pd
## TODO:need to exclude automoderator / bot posts.
## TODO:need to exclude better handle hyperlinks.
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
# remove [deleted] and AutoModerator (TODO remove other bots)
# df = df.filter(df.author != '[deleted]')
# df = df.filter(df.author != 'AutoModerator')
df = build_weekly_tfidf_dataset(df, include_subs, 'term')
df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', mode='overwrite', compression='snappy')
spark.stop()

View File

@@ -1,22 +0,0 @@
#!/bin/bash
## tf reddit comments
#SBATCH --job-name="wikia ecology; fit var models"
## Allocation Definition
#SBATCH --account=comdata-ckpt
#SBATCH --partition=ckpt
## Resources
## Nodes. This should always be 1 for parallel-sql.
#SBATCH --nodes=1
## Walltime (12 hours)
#SBATCH --time=24:00:00
## Memory per node
#SBATCH --mem=8G
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH
#SBATCH --chdir /gscratch/comdata/users/nathante/wikia_ecology
#SBATCH --output=var_jobs/%A_%a.out
#SBATCH --error=var_jobs/%A_%a.out
TASK_NUM=$(( SLURM_ARRAY_TASK_ID + $1))
TASK_CALL=$(sed -n ${TASK_NUM}p ./var_jobs.sh)
${TASK_CALL}

View File

@@ -1,28 +1,25 @@
srun=srun -p compute-bigmem -A comdata --mem-per-cpu=9g --time=200:00:00 -c 40 all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms.parquet
srun_huge=srun -p compute-hugemem -A comdata --mem=724g --time=200:00:00 -c 40
similarity_data=../../data/reddit_similarity # all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet
tfidf_data=${similarity_data}/tfidf
lsi_components=[10,50,100,200,300,400,500,600,700,850]
lsi_similarities: ${similarity_data}/subreddit_comment_authors-tf_10k_LSI
all: ${similarity_data}/subreddit_comment_authors-tf_10k.feather # /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet: cosine_similarities.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
# start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.feather
${similarity_data}/subreddit_comment_authors-tf_10k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py ${similarity_data}/subreddits_by_num_comments_nonsfw.csv /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
${srun_huge} /bin/bash -c "source ~/.bashrc; python3 lsi_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=10 --inpath=$<" start_spark_and_run.sh 1 tfidf.py terms --topN=10000
${similarity_data}/subreddits_by_num_comments_nonsfw.csv: ../../data/reddit_submissions_by_subreddit.parquet ../../data/reddit_comments_by_subreddit.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
../start_spark_and_run.sh 3 top_subreddits_by_comments.py start_spark_and_run.sh 1 tfidf.py authors --topN=10000
${tfidf_data}/comment_authors_100k.parquet: ../../data/reddit_ngrams/comment_authors_sorted.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv /gscratch/comdata/output/reddit_similarity/comment_authors_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
../start_spark_and_run.sh 3 tfidf.py authors --topN=100000 --inpath=$< --outpath=${tfidf_data}/comment_authors_100k.parquet start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
../../data/reddit_ngrams/comment_authors_sorted.parquet: /gscratch/comdata/output/reddit_similarity/comment_terms.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet
$(MAKE) -C ../ngrams start_spark_and_run.sh 1 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
../../data/reddit_submissions_by_subreddit.parquet: # /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet: cosine_similarities.py /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet
$(MAKE) -C ../datasets # start_spark_and_run.sh 1 weekly_cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10000_weely.parquet
../../data/reddit_comments_by_subreddit.parquet: /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
$(MAKE) -C ../datasets start_spark_and_run.sh 1 cosine_similarities.py author-tf --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet

175
similarities/README.md Normal file
View File

@@ -0,0 +1,175 @@
# Subreddit similarity
This directory holds the code that computes pairwise similarities between
subreddits — both term-based (from TF-IDF over comment text) and
author-based (from overlapping commenter sets). Similarity matrices
produced here feed downstream clustering (`../clustering/`) and density
analysis (`../density/`).
## Datasets
Subreddit similarity datasets based on comment terms and comment authors
are available on Hyak in `/gscratch/comdata/output/reddit_similarity`.
The overall approach to subreddit similarity seems to work reasonably
well and the code is stabilizing. If you want help using these
similarities in a project, just reach out to
[Nate](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29).
By default, the scripts here take a `TopN` parameter which selects the
subreddits to include in the similarity dataset according to how many
total comments they have. You can alternatively pass a value to the
`included_subreddits` parameter for a file with the names of the
subreddits you would like to include on each line.
## Scripts
| Script | What it does |
|---|---|
| `tfidf.py` | Builds TF-IDF vectors for subreddits. Fire CLI subcommands for `authors`, `terms`, `authors_weekly`, `terms_weekly`. |
| `cosine_similarities.py` | Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands `author`, `term`, `author-tf`. |
| `weekly_cosine_similarities.py` | Same idea but operating on the weekly TF-IDF vectors. |
| `wang_similarity.py` | A variant similarity computation based on user overlaps in the style of Wang et al. |
| `top_subreddits_by_comments.py` | Produces the `subreddits_by_num_comments.csv` ranking used to pick the top-N subreddits for the similarity matrices. |
| `similarities_helper.py` | Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. |
| `Makefile` | Wires everything together with the canonical Hyak output paths. |
## Methods
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common and
simple information-retrieval technique that we can use to quantify the
topic of a subreddit. The goal of TF-IDF is to build a vector for each
subreddit that scores every term (or phrase) according to how
characteristic it is of the overall lexicon used in that subreddit. For
example, the most characteristic terms in the subreddit `/r/christianity`
in the current version of the TF-IDF model are:
| Term | tf_idf |
|:------------:|:------:|
| christians | 0.581 |
| christianity | 0.569 |
| kjv | 0.568 |
| bible | 0.557 |
| scripture | 0.55 |
TF-IDF stands for "term frequency — inverse document frequency" because
it is the product of two terms "term frequency" and "inverse document
frequency." Term frequency quantifies the amount that a term appears in
a subreddit (document). Inverse document frequency quantifies how much
that term appears in other subreddits (documents). As you can see on
the Wikipedia page, there are many possible ways of constructing and
combining these terms.
I chose to normalize term frequency by the maximum (raw) term frequency
for each subreddit:
$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}$$
I use the log inverse document frequency:
$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
I then combine them using some smoothing to get:
$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
(Other normalization strategies are worth trying — see the note in
`TODO`.)
### Building TF-IDF vectors
The process for building TF-IDF vectors has four steps:
1. Extracting terms using `../ngrams/tf_comments.py`
2. Detecting common phrases using `../ngrams/top_comment_phrases.py`
3. Extracting terms and common phrases using
`../ngrams/tf_comments.py --mwe-pass='second'`
4. Building IDF and TF-IDF scores in `tfidf.py`
#### Running `tf_comments.py` on the backfill queue
The main reason that I did it in four steps instead of one is to take
advantage of the backfill queue for running `tf_comments.py`. This step
requires reading all of the text in every comment and converting it to
a bag of words at the subreddit level. This is a lot of computation
that is easily parallelizable. The script `../ngrams/run_tf_jobs.sh`
partially automates running steps 1 (or 3) on the backfill queue.
#### Phrase detection using pointwise mutual information
TF-IDF is simple, but only uses single words (unigrams). Sequences of
multiple words can be important to account for how words have different
meanings in different contexts or how sequences of words refer to
distinct things like names. Dealing with context or longer sequences of
words is a common challenge in natural language processing since the
number of possible n-grams grows like crazy as n gets bigger. Phrase
detection helps this problem by limiting the set of n-grams to those
most informative.
But how do we detect phrases? I implemented [pointwise mutual
information](https://en.wikipedia.org/wiki/Pointwise_mutual_information),
which is a pretty simple way but seems to work pretty well.
PMI is a quantity derived from information theory. The intuition is
that if two words occur together quite frequently compared to how often
they appear separately then the cooccurrance is likely to be
informative.
$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
In `../ngrams/tf_comments.py` if `--mwe-pass=first` then a 10% sample
of 1-4-grams (sequences of terms up to length 4) will be written to a
file to be consumed by `../ngrams/top_comment_phrases.py`.
`top_comment_phrases.py` computes the PMI for these possible phrases
and writes those that occur at least 3500 times in the sample of
n-grams and have a PMI of at least 3 (about 65000 expressions).
`tf_comments.py --mwe-pass=second` then uses the detected phrases and
adds them to the term frequency data.
## Cosine similarity
Once the TF-IDF vectors are built, making a similarity score between
two subreddits is straightforward using cosine similarity.
$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
Intuitively, we represent two subreddits as lines in a high-dimensional
space (TF-IDF vectors). In linear algebra, the dot product ($\cdot$)
between two vectors takes their weighted sum (e.g. linear regression is
a dot product of a vector of covariates and a vector of weights). The
vectors might have different lengths — if one subreddit has more words
in comments than the other — so in cosine similarity the dot product
is normalized by the magnitude (length) of the vectors. It turns out
that this is equivalent to taking the cosine of the two vectors. So
cosine similarity in essence quantifies the angle between the two lines
in high-dimensional space. If the cosine similarity between two
subreddits is greater then their TF-IDF vectors are more correlated.
Cosine similarity with TF-IDF is popular (indeed it has been applied to
Reddit in research several times before) because it quantifies the
correlation between the most characteristic terms for two communities.
Compared to other approaches to similarity like those using word
embeddings or topic models it may struggle to handle polysemy, synonymy,
or correlations between different terms. Using phrase detection helps
with this a little bit. The advantages of this approach are simplicity
and scalability. I'm thinking about using [latent semantic
analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
intermediate step to improve upon similarities based on raw TF-IDFs.
Even still, computing similarities between a large number of subreddits
is computationally expensive and requires $n(n-1)/2$ dot-product
evaluations. This can be sped up by passing
`similarity-threshold=X` where $X>0$ into `cosine_similarities.py`. I
used a cosine similarity function that's built into the spark matrix
library which supports the `DIMSUM` algorithm for approximating
matrix-matrix products. This algorithm is commonly used in industry
(i.e. at Twitter, Google) for large-scale similarity scoring.
## See also
The CDSC wiki page
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
is the landing page for this project on the wiki. The methods writeup
above used to live there; it now lives here so that doc and code stay
in sync.

1
similarities/TODO Normal file
View File

@@ -0,0 +1 @@
Try normalizing tf by the mean or std instead of the max to avoid penalizing subreddits with very active users.

View File

@@ -2,14 +2,11 @@ import pandas as pd
import fire import fire
from pathlib import Path from pathlib import Path
from similarities_helper import similarities, column_similarities from similarities_helper import similarities, column_similarities
from functools import partial
def cosine_similarities(infile, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'): def cosine_similarities(infile, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
return similarities(infile=infile, simfunc=column_similarities, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=exclude_phrases,from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname) return similarities(infile=infile, simfunc=column_similarities, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=exclude_phrases,from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
# change so that these take in an input as an optional argument (for speed, but also for idf).
def term_cosine_similarities(outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
def term_cosine_similarities(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet', min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None): def term_cosine_similarities(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet', min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):

View File

@@ -1,6 +1,4 @@
#!/usr/bin/bash #!/usr/bin/bash
source ~/.bashrc
echo $(hostname)
start_spark_cluster.sh start_spark_cluster.sh
spark-submit --verbose --master spark://$(hostname):43015 tfidf.py authors --topN=100000 --inpath=../../data/reddit_ngrams/comment_authors_sorted.parquet --outpath=../../data/reddit_similarity/tfidf/comment_authors_100k.parquet spark-submit --master spark://$(hostname):18899 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
stop-all.sh stop-all.sh

View File

@@ -1,86 +0,0 @@
import pandas as pd
import fire
from pathlib import Path
from similarities_helper import *
#from similarities_helper import similarities, lsi_column_similarities
from functools import partial
# inpath = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/tfidf/comment_authors_compex.parquet"
# term_colname='authors'
# outfile='/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity/comment_test_compex_LSI'
# n_components=[10,50,100]
# included_subreddits="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/included_subreddits.txt"
# n_iter=5
# random_state=1968
# algorithm='randomized'
# topN = None
# from_date=None
# to_date=None
# min_df=None
# max_df=None
def lsi_similarities(inpath, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None, tfidf_colname='tf_idf',n_components=100,n_iter=5,random_state=1968,algorithm='arpack',lsi_model=None):
print(n_components,flush=True)
if lsi_model is None:
if type(n_components) == list:
lsi_model = Path(outfile) / f'{max(n_components)}_{term_colname}_LSIMOD.pkl'
else:
lsi_model = Path(outfile) / f'{n_components}_{term_colname}_LSIMOD.pkl'
simfunc = partial(lsi_column_similarities,n_components=n_components,n_iter=n_iter,random_state=random_state,algorithm=algorithm,lsi_model_save=lsi_model)
return similarities(inpath=inpath, simfunc=simfunc, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
# change so that these take in an input as an optional argument (for speed, but also for idf).
def term_lsi_similarities(inpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet',outfile=None, min_df=None, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None, algorithm='arpack', n_components=300,n_iter=5,random_state=1968):
res = lsi_similarities(inpath,
'term',
outfile,
min_df,
max_df,
included_subreddits,
topN,
from_date,
to_date,
n_components=n_components,
algorithm = algorithm
)
return res
def author_lsi_similarities(inpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_100k.parquet',outfile=None, min_df=2, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None,algorithm='arpack',n_components=300,n_iter=5,random_state=1968):
return lsi_similarities(inpath,
'author',
outfile,
min_df,
max_df,
included_subreddits,
topN,
from_date=from_date,
to_date=to_date,
n_components=n_components
)
def author_tf_similarities(inpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_100k.parquet',outfile=None, min_df=2, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None,algorithm='arpack',n_components=300,n_iter=5,random_state=1968):
return lsi_similarities(inpath,
'author',
outfile,
min_df,
max_df,
included_subreddits,
topN,
from_date=from_date,
to_date=to_date,
tfidf_colname='relative_tf',
n_components=n_components,
algorithm=algorithm
)
if __name__ == "__main__":
fire.Fire({'term':term_lsi_similarities,
'author':author_lsi_similarities,
'author-tf':author_tf_similarities})

View File

@@ -2,190 +2,143 @@ from pyspark.sql import SparkSession
from pyspark.sql import Window from pyspark.sql import Window
from pyspark.sql import functions as f from pyspark.sql import functions as f
from enum import Enum from enum import Enum
from multiprocessing import cpu_count, Pool
from pyspark.mllib.linalg.distributed import CoordinateMatrix from pyspark.mllib.linalg.distributed import CoordinateMatrix
from tempfile import TemporaryDirectory from tempfile import TemporaryDirectory
import pyarrow import pyarrow
import pyarrow.dataset as ds import pyarrow.dataset as ds
from sklearn.metrics import pairwise_distances
from scipy.sparse import csr_matrix, issparse from scipy.sparse import csr_matrix, issparse
from sklearn.decomposition import TruncatedSVD
import pandas as pd import pandas as pd
import numpy as np import numpy as np
import pathlib import pathlib
from datetime import datetime from datetime import datetime
from pathlib import Path from pathlib import Path
import pickle
class tf_weight(Enum): class tf_weight(Enum):
MaxTF = 1 MaxTF = 1
Norm05 = 2 Norm05 = 2
# infile = "/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet" infile = "/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet"
# cache_file = "/gscratch/comdata/users/nathante/cdsc_reddit/similarities/term_tfidf_entries_bak.parquet"
# subreddits missing after this step don't have any terms that have a high enough idf def reindex_tfidf_time_interval(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
# try rewriting without merges term = term_colname
term_id = term + '_id'
term_id_new = term + '_id_new'
# does reindex_tfidf, but without reindexing. spark = SparkSession.builder.getOrCreate()
def reindex_tfidf(*args, **kwargs): conf = spark.sparkContext.getConf()
df, tfidf_ds, ds_filter = _pull_or_reindex_tfidf(*args, **kwargs, reindex=True) print(exclude_phrases)
tfidf_weekly = spark.read.parquet(infile)
print("assigning names") # create the time interval
subreddit_names = tfidf_ds.to_table(filter=ds_filter,columns=['subreddit','subreddit_id']) if from_date is not None:
batches = subreddit_names.to_batches() if type(from_date) is str:
from_date = datetime.fromisoformat(from_date)
with Pool(cpu_count()) as pool: tfidf_weekly = tfidf_weekly.filter(tfidf_weekly.week >= from_date)
chunks = pool.imap_unordered(pull_names,batches)
subreddit_names = pd.concat(chunks,copy=False).drop_duplicates()
subreddit_names = subreddit_names.set_index("subreddit_id")
new_ids = df.loc[:,['subreddit_id','subreddit_id_new']].drop_duplicates() if to_date is not None:
new_ids = new_ids.set_index('subreddit_id') if type(to_date) is str:
subreddit_names = subreddit_names.join(new_ids,on='subreddit_id').reset_index() to_date = datetime.fromisoformat(to_date)
subreddit_names = subreddit_names.drop("subreddit_id",axis=1) tfidf_weekly = tfidf_weekly.filter(tfidf_weekly.week < to_date)
tfidf = tfidf_weekly.groupBy(["subreddit","week", term_id, term]).agg(f.sum("tf").alias("tf"))
tfidf = _calc_tfidf(tfidf, term_colname, tf_weight.Norm05)
tempdir = prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits)
tfidf = spark.read_parquet(tempdir.name)
subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
subreddit_names = subreddit_names.sort_values("subreddit_id_new") subreddit_names = subreddit_names.sort_values("subreddit_id_new")
return(df, subreddit_names) subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
return(tempdir, subreddit_names)
def pull_tfidf(*args, **kwargs): def reindex_tfidf(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False):
df, _, _ = _pull_or_reindex_tfidf(*args, **kwargs, reindex=False) spark = SparkSession.builder.getOrCreate()
return df conf = spark.sparkContext.getConf()
print(exclude_phrases)
def _pull_or_reindex_tfidf(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=None, week=None, from_date=None, to_date=None, rescale_idf=True, tf_family=tf_weight.MaxTF, reindex=True): tfidf = spark.read.parquet(infile)
print(f"loading tfidf {infile}, week {week}, min_df {min_df}, max_df {max_df}", flush=True)
if week is not None:
tfidf_ds = ds.dataset(infile, partitioning='hive')
else:
tfidf_ds = ds.dataset(infile)
if included_subreddits is None: if included_subreddits is None:
included_subreddits = select_topN_subreddits(topN) included_subreddits = select_topN_subreddits(topN)
else: else:
included_subreddits = set(map(str.strip,open(included_subreddits))) included_subreddits = set(map(str.strip,map(str.lower,open(included_subreddits))))
ds_filter = ds.field("subreddit").isin(included_subreddits) if exclude_phrases == True:
tfidf = tfidf.filter(~f.col(term_colname).contains("_"))
if min_df is not None: print("creating temporary parquet with matrix indicies")
ds_filter &= ds.field("count") >= min_df tempdir = prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits)
if max_df is not None: tfidf = spark.read.parquet(tempdir.name)
ds_filter &= ds.field("count") <= max_df subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
if week is not None: subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
ds_filter &= ds.field("week") == week spark.stop()
return (tempdir, subreddit_names)
if from_date is not None:
ds_filter &= ds.field("week") >= from_date
if to_date is not None:
ds_filter &= ds.field("week") <= to_date
term = term_colname
term_id = term + '_id'
term_id_new = term + '_id_new'
projection = {
'subreddit_id':ds.field('subreddit_id'),
term_id:ds.field(term_id),
'relative_tf':ds.field("relative_tf").cast('float32')
}
if not rescale_idf:
projection = {
'subreddit_id':ds.field('subreddit_id'),
term_id:ds.field(term_id),
'relative_tf':ds.field('relative_tf').cast('float32'),
'tf_idf':ds.field('tf_idf').cast('float32')}
print(projection, flush=True)
print(ds_filter, flush=True)
df = tfidf_ds.to_table(filter=ds_filter,columns=projection)
df = df.to_pandas(split_blocks=True,self_destruct=True)
print("assigning indexes",flush=True)
if reindex:
print("assigning indexes",flush=True)
df['subreddit_id_new'] = df.groupby("subreddit_id").ngroup() + 1
else:
df['subreddit_id_new'] = df['subreddit_id']
if reindex:
grouped = df.groupby(term_id)
df[term_id_new] = grouped.ngroup() + 1
else:
df[term_id_new] = df[term_id]
if rescale_idf:
print("computing idf", flush=True)
df['new_count'] = grouped[term_id].transform('count')
N_docs = df.subreddit_id_new.max() + 1
df['idf'] = np.log(N_docs/(1+df.new_count),dtype='float32') + 1
if tf_family == tf_weight.MaxTF:
df["tf_idf"] = df.relative_tf * df.idf
else: # tf_fam = tf_weight.Norm05
df["tf_idf"] = (0.5 + 0.5 * df.relative_tf) * df.idf
return (df, tfidf_ds, ds_filter)
def pull_names(batch): def similarities(infile, simfunc, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
return(batch.to_pandas().drop_duplicates())
def similarities(inpath, simfunc, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, from_date=None, to_date=None, tfidf_colname='tf_idf'):
''' '''
tfidf_colname: set to 'relative_tf' to use normalized term frequency instead of tf-idf, which can be useful for author-based similarities. tfidf_colname: set to 'relative_tf' to use normalized term frequency instead of tf-idf, which can be useful for author-based similarities.
''' '''
if from_date is not None or to_date is not None:
tempdir, subreddit_names = reindex_tfidf_time_interval(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=False, from_date=from_date, to_date=to_date)
def proc_sims(sims, outfile): else:
if issparse(sims): tempdir, subreddit_names = reindex_tfidf(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=False)
sims = sims.todense()
print(f"shape of sims:{sims.shape}") print("loading matrix")
print(f"len(subreddit_names.subreddit.values):{len(subreddit_names.subreddit.values)}",flush=True) # mat = read_tfidf_matrix("term_tfidf_entries7ejhvnvl.parquet", term_colname)
sims = pd.DataFrame(sims) mat = read_tfidf_matrix(tempdir.name, term_colname, tfidf_colname)
sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1) print(f'computing similarities on mat. mat.shape:{mat.shape}')
sims['_subreddit'] = subreddit_names.subreddit.values print(f"size of mat is:{mat.data.nbytes}")
sims = simfunc(mat)
del mat
p = Path(outfile) if issparse(sims):
sims = sims.todense()
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather")) print(f"shape of sims:{sims.shape}")
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv")) print(f"len(subreddit_names.subreddit.values):{len(subreddit_names.subreddit.values)}")
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet")) sims = pd.DataFrame(sims)
p.parent.mkdir(exist_ok=True, parents=True) sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1)
sims['subreddit'] = subreddit_names.subreddit.values
sims.to_feather(outfile) p = Path(outfile)
output_feather = Path(str(p).replace("".join(p.suffixes), ".feather"))
output_csv = Path(str(p).replace("".join(p.suffixes), ".csv"))
output_parquet = Path(str(p).replace("".join(p.suffixes), ".parquet"))
sims.to_feather(outfile)
tempdir.cleanup()
def read_tfidf_matrix_weekly(path, term_colname, week, tfidf_colname='tf_idf'):
term = term_colname term = term_colname
term_id = term + '_id' term_id = term + '_id'
term_id_new = term + '_id_new' term_id_new = term + '_id_new'
entries, subreddit_names = reindex_tfidf(inpath, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN,from_date=from_date,to_date=to_date) dataset = ds.dataset(path,format='parquet')
mat = csr_matrix((entries[tfidf_colname],(entries[term_id_new]-1, entries.subreddit_id_new-1))) entries = dataset.to_table(columns=[tfidf_colname,'subreddit_id_new', term_id_new],filter=ds.field('week')==week).to_pandas()
return(csr_matrix((entries[tfidf_colname], (entries[term_id_new]-1, entries.subreddit_id_new-1))))
print("loading matrix") def read_tfidf_matrix(path, term_colname, tfidf_colname='tf_idf'):
term = term_colname
term_id = term + '_id'
term_id_new = term + '_id_new'
dataset = ds.dataset(path,format='parquet')
print(f"tfidf_colname:{tfidf_colname}")
entries = dataset.to_table(columns=[tfidf_colname, 'subreddit_id_new',term_id_new]).to_pandas()
return(csr_matrix((entries[tfidf_colname],(entries[term_id_new]-1, entries.subreddit_id_new-1))))
# mat = read_tfidf_matrix("term_tfidf_entries7ejhvnvl.parquet", term_colname)
print(f'computing similarities on mat. mat.shape:{mat.shape}')
print(f"size of mat is:{mat.data.nbytes}",flush=True)
sims = simfunc(mat)
del mat
if hasattr(sims,'__next__'):
for simmat, name in sims:
proc_sims(simmat, Path(outfile)/(str(name) + ".feather"))
else:
proc_sims(sims, outfile)
def write_weekly_similarities(path, sims, week, names): def write_weekly_similarities(path, sims, week, names):
sims['week'] = week sims['week'] = week
p = pathlib.Path(path) p = pathlib.Path(path)
if not p.is_dir(): if not p.is_dir():
p.mkdir(exist_ok=True,parents=True) p.mkdir()
# reformat as a pairwise list # reformat as a pairwise list
sims = sims.melt(id_vars=['_subreddit','week'],value_vars=names.subreddit.values) sims = sims.melt(id_vars=['subreddit','week'],value_vars=names.subreddit.values)
sims.to_parquet(p / week.isoformat()) sims.to_parquet(p / week.isoformat())
def column_overlaps(mat): def column_overlaps(mat):
@@ -197,74 +150,136 @@ def column_overlaps(mat):
return intersection / den return intersection / den
def test_lsi_sims(): def column_similarities(mat):
term = "term" norm = np.matrix(np.power(mat.power(2).sum(axis=0),0.5,dtype=np.float32))
mat = mat.multiply(1/norm)
sims = mat.T @ mat
return(sims)
def prep_tfidf_entries_weekly(tfidf, term_colname, min_df, max_df, included_subreddits):
term = term_colname
term_id = term + '_id' term_id = term + '_id'
term_id_new = term + '_id_new' term_id_new = term + '_id_new'
t1 = time.perf_counter() if min_df is None:
entries, subreddit_names = reindex_tfidf("/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k_repartitioned.parquet", min_df = 0.1 * len(included_subreddits)
term_colname='term', tfidf = tfidf.filter(f.col('count') >= min_df)
min_df=2000, if max_df is not None:
topN=10000 tfidf = tfidf.filter(f.col('count') <= max_df)
)
t2 = time.perf_counter()
print(f"first load took:{t2 - t1}s")
entries, subreddit_names = reindex_tfidf("/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet", tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
term_colname='term',
min_df=2000,
topN=10000
)
t3=time.perf_counter()
print(f"second load took:{t3 - t2}s") # we might not have the same terms or subreddits each week, so we need to make unique ids for each week.
sub_ids = tfidf.select(['subreddit_id','week']).distinct()
sub_ids = sub_ids.withColumn("subreddit_id_new",f.row_number().over(Window.partitionBy('week').orderBy("subreddit_id")))
tfidf = tfidf.join(sub_ids,['subreddit_id','week'])
mat = csr_matrix((entries['tf_idf'],(entries[term_id_new], entries.subreddit_id_new))) # only use terms in at least min_df included subreddits in a given week
sims = list(lsi_column_similarities(mat, [10,50])) new_count = tfidf.groupBy([term_id,'week']).agg(f.count(term_id).alias('new_count'))
sims_og = sims tfidf = tfidf.join(new_count,[term_id,'week'],how='inner')
sims_test = list(lsi_column_similarities(mat,[10,50],algorithm='randomized',n_iter=10))
# n_components is the latent dimensionality. sklearn recommends 100. More might be better # reset the term ids
# if n_components is a list we'll return a list of similarities with different latent dimensionalities term_ids = tfidf.select([term_id,'week']).distinct()
# if algorithm is 'randomized' instead of 'arpack' then n_iter gives the number of iterations. term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.partitionBy('week').orderBy(term_id)))
# this function takes the svd and then the column similarities of it tfidf = tfidf.join(term_ids,[term_id,'week'])
def lsi_column_similarities(tfidfmat,n_components=300,n_iter=10,random_state=1968,algorithm='randomized',lsi_model_save=None,lsi_model_load=None):
# first compute the lsi of the matrix
# then take the column similarities
if type(n_components) is int: tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
n_components = [n_components] tfidf = tfidf.withColumn("tf_idf", (tfidf.relative_tf * tfidf.idf).cast('float'))
n_components = sorted(n_components,reverse=True) tempdir =TemporaryDirectory(suffix='.parquet',prefix='term_tfidf_entries',dir='.')
svd_components = n_components[0] tfidf = tfidf.repartition('week')
if lsi_model_load is not None and Path(lsi_model_load).exists(): tfidf.write.parquet(tempdir.name,mode='overwrite',compression='snappy')
print("loading LSI") return(tempdir)
mod = pickle.load(open(lsi_model_load ,'rb'))
lsi_model_save = lsi_model_load
else:
print("running LSI",flush=True)
svd = TruncatedSVD(n_components=svd_components,random_state=random_state,algorithm=algorithm,n_iter=n_iter)
mod = svd.fit(tfidfmat.T)
if lsi_model_save is not None:
Path(lsi_model_save).parent.mkdir(exist_ok=True, parents=True)
pickle.dump(mod, open(lsi_model_save,'wb'))
print(n_components, flush=True)
lsimat = mod.transform(tfidfmat.T)
for n_dims in n_components:
print("computing similarities", flush=True)
sims = column_similarities(lsimat[:,np.arange(n_dims)])
yield (sims, n_dims)
def prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits):
term = term_colname
term_id = term + '_id'
term_id_new = term + '_id_new'
def column_similarities(mat): if min_df is None:
return 1 - pairwise_distances(mat,metric='cosine') min_df = 0.1 * len(included_subreddits)
tfidf = tfidf.filter(f.col('count') >= min_df)
if max_df is not None:
tfidf = tfidf.filter(f.col('count') <= max_df)
tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
# reset the subreddit ids
sub_ids = tfidf.select('subreddit_id').distinct()
sub_ids = sub_ids.withColumn("subreddit_id_new", f.row_number().over(Window.orderBy("subreddit_id")))
tfidf = tfidf.join(sub_ids,'subreddit_id')
# only use terms in at least min_df included subreddits
new_count = tfidf.groupBy(term_id).agg(f.count(term_id).alias('new_count'))
tfidf = tfidf.join(new_count,term_id,how='inner')
# reset the term ids
term_ids = tfidf.select([term_id]).distinct()
term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.orderBy(term_id)))
tfidf = tfidf.join(term_ids,term_id)
tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
tfidf = tfidf.withColumn("tf_idf", (tfidf.relative_tf * tfidf.idf).cast('float'))
tempdir =TemporaryDirectory(suffix='.parquet',prefix='term_tfidf_entries',dir='.')
tfidf.write.parquet(tempdir.name,mode='overwrite',compression='snappy')
return tempdir
# try computing cosine similarities using spark
def spark_cosine_similarities(tfidf, term_colname, min_df, included_subreddits, similarity_threshold):
term = term_colname
term_id = term + '_id'
term_id_new = term + '_id_new'
if min_df is None:
min_df = 0.1 * len(included_subreddits)
tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
tfidf = tfidf.cache()
# reset the subreddit ids
sub_ids = tfidf.select('subreddit_id').distinct()
sub_ids = sub_ids.withColumn("subreddit_id_new",f.row_number().over(Window.orderBy("subreddit_id")))
tfidf = tfidf.join(sub_ids,'subreddit_id')
# only use terms in at least min_df included subreddits
new_count = tfidf.groupBy(term_id).agg(f.count(term_id).alias('new_count'))
tfidf = tfidf.join(new_count,term_id,how='inner')
# reset the term ids
term_ids = tfidf.select([term_id]).distinct()
term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.orderBy(term_id)))
tfidf = tfidf.join(term_ids,term_id)
tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
tfidf = tfidf.withColumn("tf_idf", tfidf.relative_tf * tfidf.idf)
# step 1 make an rdd of entires
# sorted by (dense) spark subreddit id
n_partitions = int(len(included_subreddits)*2 / 5)
entries = tfidf.select(f.col(term_id_new)-1,f.col("subreddit_id_new")-1,"tf_idf").rdd.repartition(n_partitions)
# put like 10 subredis in each partition
# step 2 make it into a distributed.RowMatrix
coordMat = CoordinateMatrix(entries)
coordMat = CoordinateMatrix(coordMat.entries.repartition(n_partitions))
# this needs to be an IndexedRowMatrix()
mat = coordMat.toRowMatrix()
#goal: build a matrix of subreddit columns and tf-idfs rows
sim_dist = mat.columnSimilarities(threshold=similarity_threshold)
return (sim_dist, tfidf)
def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05): def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
@@ -291,20 +306,20 @@ def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weig
idf = idf.withColumn('idf',f.log(idf.subreddits_in_week) / (1+f.col('count'))+1) idf = idf.withColumn('idf',f.log(idf.subreddits_in_week) / (1+f.col('count'))+1)
# collect the dictionary to make a pydict of terms to indexes # collect the dictionary to make a pydict of terms to indexes
terms = idf.select([term]).distinct() # terms are distinct terms = idf.select([term,'week']).distinct() # terms are distinct
terms = terms.withColumn(term_id,f.row_number().over(Window.orderBy(term))) # term ids are distinct terms = terms.withColumn(term_id,f.row_number().over(Window.partitionBy('week').orderBy(term))) # term ids are distinct
# make subreddit ids # make subreddit ids
subreddits = df.select(['subreddit']).distinct() subreddits = df.select(['subreddit','week']).distinct()
subreddits = subreddits.withColumn('subreddit_id',f.row_number().over(Window.orderBy("subreddit"))) subreddits = subreddits.withColumn('subreddit_id',f.row_number().over(Window.partitionBy("week").orderBy("subreddit")))
df = df.join(subreddits,on=['subreddit']) df = df.join(subreddits,on=['subreddit','week'])
# map terms to indexes in the tfs and the idfs # map terms to indexes in the tfs and the idfs
df = df.join(terms,on=[term]) # subreddit-term-id is unique df = df.join(terms,on=[term,'week']) # subreddit-term-id is unique
idf = idf.join(terms,on=[term]) idf = idf.join(terms,on=[term,'week'])
# join on subreddit/term to create tf/dfs indexed by term # join on subreddit/term to create tf/dfs indexed by term
df = df.join(idf, on=[term_id, term,'week']) df = df.join(idf, on=[term_id, term,'week'])
@@ -316,11 +331,9 @@ def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weig
else: # tf_fam = tf_weight.Norm05 else: # tf_fam = tf_weight.Norm05
df = df.withColumn("tf_idf", (0.5 + 0.5 * df.relative_tf) * df.idf) df = df.withColumn("tf_idf", (0.5 + 0.5 * df.relative_tf) * df.idf)
df = df.repartition('week') return df
dfwriter = df.write.partitionBy("week")
return dfwriter
def _calc_tfidf(df, term_colname, tf_family, min_df=None, max_df=None): def _calc_tfidf(df, term_colname, tf_family):
term = term_colname term = term_colname
term_id = term + '_id' term_id = term + '_id'
@@ -329,7 +342,7 @@ def _calc_tfidf(df, term_colname, tf_family, min_df=None, max_df=None):
df = df.join(max_subreddit_terms, on='subreddit') df = df.join(max_subreddit_terms, on='subreddit')
df = df.withColumn("relative_tf", (df.tf / df.sr_max_tf)) df = df.withColumn("relative_tf", df.tf / df.sr_max_tf)
# group by term. term is unique # group by term. term is unique
idf = df.groupby([term]).count() idf = df.groupby([term]).count()
@@ -338,13 +351,7 @@ def _calc_tfidf(df, term_colname, tf_family, min_df=None, max_df=None):
idf = idf.withColumn('idf',f.log(N_docs/(1+f.col('count')))+1) idf = idf.withColumn('idf',f.log(N_docs/(1+f.col('count')))+1)
# collect the dictionary to make a pydict of terms to indexes # collect the dictionary to make a pydict of terms to indexes
terms = idf terms = idf.select(term).distinct() # terms are distinct
if min_df is not None:
terms = terms.filter(f.col('count')>=min_df)
if max_df is not None:
terms = terms.filter(f.col('count')<=max_df)
terms = terms.select(term).distinct() # terms are distinct
terms = terms.withColumn(term_id,f.row_number().over(Window.orderBy(term))) # term ids are distinct terms = terms.withColumn(term_id,f.row_number().over(Window.orderBy(term))) # term ids are distinct
# make subreddit ids # make subreddit ids
@@ -354,12 +361,12 @@ def _calc_tfidf(df, term_colname, tf_family, min_df=None, max_df=None):
df = df.join(subreddits,on='subreddit') df = df.join(subreddits,on='subreddit')
# map terms to indexes in the tfs and the idfs # map terms to indexes in the tfs and the idfs
df = df.join(terms,on=term,how='inner') # subreddit-term-id is unique df = df.join(terms,on=term) # subreddit-term-id is unique
idf = idf.join(terms,on=term,how='inner') idf = idf.join(terms,on=term)
# join on subreddit/term to create tf/dfs indexed by term # join on subreddit/term to create tf/dfs indexed by term
df = df.join(idf, on=[term_id, term],how='inner') df = df.join(idf, on=[term_id, term])
# agg terms by subreddit to make sparse tf/df vectors # agg terms by subreddit to make sparse tf/df vectors
if tf_family == tf_weight.MaxTF: if tf_family == tf_weight.MaxTF:
@@ -370,36 +377,18 @@ def _calc_tfidf(df, term_colname, tf_family, min_df=None, max_df=None):
return df return df
def tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05, min_df=None, max_df=None): def build_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
term = term_colname term = term_colname
term_id = term + '_id' term_id = term + '_id'
# aggregate counts by week. now subreddit-term is distinct # aggregate counts by week. now subreddit-term is distinct
df = df.filter(df.subreddit.isin(include_subs)) df = df.filter(df.subreddit.isin(include_subs))
df = df.groupBy(['subreddit',term]).agg(f.sum('tf').alias('tf')) df = df.groupBy(['subreddit',term]).agg(f.sum('tf').alias('tf'))
df = _calc_tfidf(df, term_colname, tf_family, min_df, max_df) df = _calc_tfidf(df, term_colname, tf_family)
df = df.repartition('subreddit')
dfwriter = df.write
return dfwriter
def select_topN_subreddits(topN, path="../../data/reddit_similarity/subreddits_by_num_comments_nonsfw.csv"): return df
def select_topN_subreddits(topN, path="/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments_nonsfw.csv"):
rankdf = pd.read_csv(path) rankdf = pd.read_csv(path)
included_subreddits = set(rankdf.loc[rankdf.comments_rank <= topN,'subreddit'].values) included_subreddits = set(rankdf.loc[rankdf.comments_rank <= topN,'subreddit'].values)
return included_subreddits return included_subreddits
def repartition_tfidf(inpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet",
outpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k_repartitioned.parquet"):
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet(inpath)
df = df.repartition(400,'subreddit')
df.write.parquet(outpath,mode='overwrite')
def repartition_tfidf_weekly(inpath="/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet",
outpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_repartitioned.parquet"):
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet(inpath)
df = df.repartition(400,'subreddit','week')
dfwriter = df.write.partitionBy("week")
dfwriter.parquet(outpath,mode='overwrite')

View File

@@ -1,12 +1,9 @@
import fire import fire
from pyspark.sql import SparkSession from pyspark.sql import SparkSession
from pyspark.sql import functions as f from pyspark.sql import functions as f
from similarities_helper import tfidf_dataset, build_weekly_tfidf_dataset, select_topN_subreddits from similarities_helper import build_tfidf_dataset, build_weekly_tfidf_dataset, select_topN_subreddits
from functools import partial
inpath = '/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/tfidf/comment_authors_compex.parquet' def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_subreddits):
# include_terms is a path to a parquet file that contains a column of term_colname + '_id' to include.
def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_subreddits, included_terms=None, min_df=None, max_df=None):
spark = SparkSession.builder.getOrCreate() spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet(inpath) df = spark.read.parquet(inpath)
@@ -14,91 +11,65 @@ def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_
df = df.filter(~ f.col(term_colname).isin(exclude)) df = df.filter(~ f.col(term_colname).isin(exclude))
if included_subreddits is not None: if included_subreddits is not None:
include_subs = set(map(str.strip,open(included_subreddits))) include_subs = set(map(str.strip,map(str.lower, open(included_subreddits))))
else: else:
include_subs = select_topN_subreddits(topN) include_subs = select_topN_subreddits(topN)
include_subs = spark.sparkContext.broadcast(include_subs) df = func(df, include_subs, term_colname)
# term_id = term_colname + "_id" df.write.parquet(outpath,mode='overwrite',compression='snappy')
if included_terms is not None:
terms_df = spark.read.parquet(included_terms)
terms_df = terms_df.select(term_colname).distinct()
df = df.join(terms_df, on=term_colname, how='left_semi')
dfwriter = func(df, include_subs.value, term_colname)
dfwriter.parquet(outpath,mode='overwrite',compression='snappy')
spark.stop() spark.stop()
def tfidf(inpath, outpath, topN, term_colname, exclude, included_subreddits, min_df, max_df): def tfidf(inpath, outpath, topN, term_colname, exclude, included_subreddits):
tfidf_func = partial(tfidf_dataset, max_df=max_df, min_df=min_df) return _tfidf_wrapper(build_tfidf_dataset, inpath, outpath, topN, term_colname, exclude, included_subreddits)
return _tfidf_wrapper(tfidf_func, inpath, outpath, topN, term_colname, exclude, included_subreddits)
def tfidf_weekly(inpath, outpath, static_tfidf_path, topN, term_colname, exclude, included_subreddits): def tfidf_weekly(inpath, outpath, topN, term_colname, exclude, included_subreddits):
return _tfidf_wrapper(build_weekly_tfidf_dataset, inpath, outpath, topN, term_colname, exclude, included_subreddits, included_terms=static_tfidf_path) return _tfidf_wrapper(build_weekly_tfidf_dataset, inpath, outpath, topN, term_colname, exclude, included_subreddits)
def tfidf_authors(outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet',
topN=25000,
included_subreddits=None):
def tfidf_authors(inpath="/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet", return tfidf("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet",
outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet',
topN=None,
included_subreddits=None,
min_df=None,
max_df=None):
return tfidf(inpath,
outpath, outpath,
topN, topN,
'author', 'author',
['[deleted]','AutoModerator'], ['[deleted]','AutoModerator'],
included_subreddits=included_subreddits, included_subreddits=included_subreddits
min_df=min_df,
max_df=max_df
) )
def tfidf_terms(inpath="/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet", def tfidf_terms(outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet',
outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet', topN=25000,
topN=None, included_subreddits=None):
included_subreddits=None,
min_df=None,
max_df=None):
return tfidf(inpath, return tfidf("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet",
outpath, outpath,
topN, topN,
'term', 'term',
[], [],
included_subreddits=included_subreddits, included_subreddits=included_subreddits
min_df=min_df,
max_df=max_df
) )
def tfidf_authors_weekly(inpath="/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet", def tfidf_authors_weekly(outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
static_tfidf_path="/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet", topN=25000,
outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
topN=None,
included_subreddits=None): included_subreddits=None):
return tfidf_weekly(inpath, return tfidf_weekly("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet",
outpath, outpath,
static_tfidf_path,
topN, topN,
'author', 'author',
['[deleted]','AutoModerator'], ['[deleted]','AutoModerator'],
included_subreddits=included_subreddits included_subreddits=included_subreddits
) )
def tfidf_terms_weekly(inpath="/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet", def tfidf_terms_weekly(outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
static_tfidf_path="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet", topN=25000,
outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
topN=None,
included_subreddits=None): included_subreddits=None):
return tfidf_weekly(inpath, return tfidf_weekly("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet",
outpath, outpath,
static_tfidf_path,
topN, topN,
'term', 'term',
[], [],

View File

@@ -1,20 +1,16 @@
from pyspark.sql import functions as f from pyspark.sql import functions as f
from pyspark.sql import SparkSession from pyspark.sql import SparkSession
from pyspark.sql import Window from pyspark.sql import Window
from datetime import datetime
from pathlib import Path
spark = SparkSession.builder.getOrCreate() spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext.getConf() conf = spark.sparkContext.getConf()
submissions = spark.read.parquet("../../data/reddit_submissions_by_subreddit.parquet") submissions = spark.read.parquet("/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet")
submissions = submissions.filter(f.col("CreatedAt") <= datetime(2020,4,13))
prop_nsfw = submissions.select(['subreddit','over_18']).groupby('subreddit').agg(f.mean(f.col('over_18').astype('double')).alias('prop_nsfw')) prop_nsfw = submissions.select(['subreddit','over_18']).groupby('subreddit').agg(f.mean(f.col('over_18').astype('double')).alias('prop_nsfw'))
df = spark.read.parquet("../../data/reddit_comments_by_subreddit.parquet") df = spark.read.parquet("/gscratch/comdata/output/reddit_comments_by_subreddit.parquet")
df = df.filter(f.col("CreatedAt") <= datetime(2020,4,13))
# remove /u/ pages # remove /u/ pages
df = df.filter(~df.subreddit.like("u_%")) df = df.filter(~df.subreddit.like("u_%"))
@@ -30,6 +26,4 @@ df = df.toPandas()
df = df.sort_values("n_comments") df = df.sort_values("n_comments")
outpath = Path("../../data/reddit_similarity/subreddits_by_num_comments_nonsfw.csv") df.to_csv('/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv', index=False)
outpath.parent.mkdir(exist_ok=True, parents=True)
df.to_csv(str(outpath), index=False)

View File

@@ -0,0 +1,18 @@
from similarities_helper import similarities
import numpy as np
import fire
def wang_similarity(mat):
non_zeros = (mat != 0).astype(np.float32)
intersection = non_zeros.T @ non_zeros
return intersection
infile="/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet"; outfile="/gscratch/comdata/output/reddit_similarity/wang_similarity_10000.feather"; min_df=1; included_subreddits=None; topN=10000; exclude_phrases=False; from_date=None; to_date=None
def wang_overlaps(infile, outfile="/gscratch/comdata/output/reddit_similarity/wang_similarity_10000.feather", min_df=1, max_df=None, included_subreddits=None, topN=10000, exclude_phrases=False, from_date=None, to_date=None):
return similarities(infile=infile, simfunc=wang_similarity, term_colname='author', outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=exclude_phrases, from_date=from_date, to_date=to_date)
if __name__ == "__main__":
fire.Fire(wang_overlaps)

View File

@@ -0,0 +1,81 @@
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from pyspark.sql import Window
import numpy as np
import pyarrow
import pandas as pd
import fire
from itertools import islice
from pathlib import Path
from similarities_helper import *
from multiprocessing import Pool, cpu_count
def _week_similarities(tempdir, term_colname, week):
print(f"loading matrix: {week}")
mat = read_tfidf_matrix_weekly(tempdir.name, term_colname, week)
print('computing similarities')
sims = column_similarities(mat)
del mat
names = subreddit_names.loc[subreddit_names.week == week]
sims = pd.DataFrame(sims.todense())
sims = sims.rename({i: sr for i, sr in enumerate(names.subreddit.values)}, axis=1)
sims['_subreddit'] = names.subreddit.values
write_weekly_similarities(outfile, sims, week, names)
#tfidf = spark.read.parquet('/gscratch/comdata/users/nathante/subreddit_tfidf_weekly.parquet')
def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, included_subreddits = None, topN = 500):
spark = SparkSession.builder.getOrCreate()
conf = spark.sparkContext.getConf()
print(outfile)
tfidf = spark.read.parquet(tfidf_path)
if included_subreddits is None:
included_subreddits = select_topN_subreddits(topN)
else:
included_subreddits = set(open(included_subreddits))
print(f"computing weekly similarities for {len(included_subreddits)} subreddits")
print("creating temporary parquet with matrix indicies")
tempdir = prep_tfidf_entries_weekly(tfidf, term_colname, min_df, max_df=None, included_subreddits=included_subreddits)
tfidf = spark.read.parquet(tempdir.name)
# the ids can change each week.
subreddit_names = tfidf.select(['subreddit','subreddit_id_new','week']).distinct().toPandas()
subreddit_names = subreddit_names.sort_values("subreddit_id_new")
subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
spark.stop()
weeks = sorted(list(subreddit_names.week.drop_duplicates()))
# do this step in parallel if we have the memory for it.
# should be doable with pool.map
def week_similarities_helper(week):
_week_similarities(tempdir, term_colname, week)
with Pool(cpu_count()) as pool: # maybe it can be done with 40 cores on the huge machine?
list(pool.map(week_similarities_helper,weeks))
def author_cosine_similarities_weekly(outfile, min_df=2 , included_subreddits=None, topN=500):
return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
outfile,
'author',
min_df,
included_subreddits,
topN)
def term_cosine_similarities_weekly(outfile, min_df=None, included_subreddits=None, topN=500):
return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
outfile,
'term',
min_df,
included_subreddits,
topN)
if __name__ == "__main__":
fire.Fire({'authors':author_cosine_similarities_weekly,
'terms':term_cosine_similarities_weekly})

View File

@@ -1,21 +0,0 @@
#!/usr/bin/env bash
# Script to start a spark cluster and run a script on klone
source $SPARK_CONF_DIR/spark-env.sh
echo "#!/usr/bin/bash" > job_script.sh
echo "source ~/.bashrc" >> job_script.sh
echo "export PYSPARK_PYTHON=python3" >> job.script.sh
echo "export JAVA_HOME=/gscratch/comdata/local/open-jdk" >> job.script.sh
echo "export SPARK_CONF_DIR=/gscratch/comdata/local/spark_config" >> job.script.sh
echo "echo \$(hostname)" >> job_script.sh
echo "source $SPARK_CONF_DIR/spark-env.sh" >> job.script.sh
echo "start_spark_cluster.sh" >> job_script.sh
echo "spark-submit --verbose --master spark://\$(hostname):$SPARK_MASTER_PORT $2 ${@:3}" >> job_script.sh
echo "stop-all.sh" >> job_script.sh
#echo "singularity instance stop --all" >> job_script.sh
chmod +x job_script.sh
let "cpus = $1 * 40"
salloc -p compute-bigmem -A comdata --nodes=$1 --time=48:00:00 -c 40 --mem=362G --exclusive srun -n1 job_script.sh

View File

@@ -1,26 +0,0 @@
#!/usr/bin/env bash
nodes="$(scontrol show hostnames)"
export SPARK_MASTER_HOST=$(hostname)
echo $SPARK_MASTER_HOST
# singularity instance stop spark-boss
# rm -r $HOME/.singularity/instances/sing/$(hostname)/nathante/spark-boss
# for node in $nodes
# dol
# echo $node
# ssh $node "singularity instance stop --all -F"
# done
# singularity instance start /gscratch/comdata/users/nathante/cdsc_base.sif spark-boss
#apptainer exec /gscratch/comdata/users/nathante/containers/nathante.sif
start-master.sh
for node in $nodes
do
# if [ "$node" != "$SPARK_BOSS" ]
# then
echo $node
ssh -t $node start_spark_worker.sh $SPARK_MASTER_HOST
# fi
done

View File

@@ -1,18 +0,0 @@
#!/usr/bin/env bash
# runs on worker node
# instance_name=spark-worker-$(hostname)
# echo $hostname
# instance_url="instance://$instance_name"
# singularity instance list
# singularity instance stop -F "$instance_name"
# singularity instance list
# sleep 5
# ls $HOME/.singularity/instances/sing/$(hostname)/nathante/$instance_name
# rm -r $HOME/.singularity/instances/sing/$(hostname)/nathante/$instance_name
# singularity instance start /gscratch/comdata/users/nathante/cdsc_base.sif $instance_name
source /gscratch/comdata/env/cdsc_klone_bashrc
source $SPARK_CONF_DIR/spark-env.sh
echo $(which python3)
echo $PYSPARK_PYTHON
echo "start-worker.sh spark://$1:$SPARK_MASTER_PORT"
start-worker.sh spark://$1:$SPARK_MASTER_PORT

View File

@@ -0,0 +1,96 @@
from pyarrow import dataset as ds
import numpy as np
import pandas as pd
import plotnine as pn
random = np.random.RandomState(1968)
def load_densities(term_density_file="/gscratch/comdata/output/reddit_density/comment_terms_10000.feather",
author_density_file="/gscratch/comdata/output/reddit_density/comment_authors_10000.feather"):
term_density = pd.read_feather(term_density_file)
author_density = pd.read_feather(author_density_file)
term_density.rename({'overlap_density':'term_density','index':'subreddit'},axis='columns',inplace=True)
author_density.rename({'overlap_density':'author_density','index':'subreddit'},axis='columns',inplace=True)
density = term_density.merge(author_density,on='subreddit',how='inner')
return density
def load_clusters(term_clusters_file="/gscratch/comdata/output/reddit_clustering/comment_terms_10000.feather",
author_clusters_file="/gscratch/comdata/output/reddit_clustering/comment_authors_10000.feather"):
term_clusters = pd.read_feather(term_clusters_file)
author_clusters = pd.read_feather(author_clusters_file)
# rename, join and return
term_clusters.rename({'cluster':'term_cluster'},axis='columns',inplace=True)
author_clusters.rename({'cluster':'author_cluster'},axis='columns',inplace=True)
clusters = term_clusters.merge(author_clusters,on='subreddit',how='inner')
return clusters
if __name__ == '__main__':
df = load_densities()
cl = load_clusters()
df['td_rank'] = df.term_density.rank()
df['ad_rank'] = df.author_density.rank()
df['td_percentile'] = df.td_rank / df.shape[0]
df['ad_percentile'] = df.ad_rank / df.shape[0]
df = df.merge(cl, on='subreddit',how='inner')
term_cluster_density = df.groupby('term_cluster').agg({'td_rank':['mean','min','max'],
'ad_rank':['mean','min','max'],
'td_percentile':['mean','min','max'],
'ad_percentile':['mean','min','max'],
'subreddit':['count']})
author_cluster_density = df.groupby('author_cluster').agg({'td_rank':['mean','min','max'],
'ad_rank':['mean','min','max'],
'td_percentile':['mean','min','max'],
'ad_percentile':['mean','min','max'],
'subreddit':['count']})
# which clusters have the most term_density?
term_cluster_density.iloc[term_cluster_density.td_rank['mean'].sort_values().index]
# which clusters have the most author_density?
term_cluster_density.iloc[term_cluster_density.ad_rank['mean'].sort_values(ascending=False).index].loc[term_cluster_density.subreddit['count'] >= 5][0:20]
high_density_term_clusters = term_cluster_density.loc[(term_cluster_density.td_percentile['mean'] > 0.75) & (term_cluster_density.subreddit['count'] > 5)]
# let's just use term density instead of author density for now. We can do a second batch with author density next.
chosen_clusters = high_density_term_clusters.sample(3,random_state=random)
cluster_info = df.loc[df.term_cluster.isin(chosen_clusters.index.values)]
chosen_subreddits = cluster_info.subreddit.values
dataset = ds.dataset("/gscratch/comdata/output/reddit_comments_by_subreddit.parquet",format='parquet')
comments = dataset.to_table(filter=ds.field("subreddit").isin(chosen_subreddits),columns=['id','subreddit','author','CreatedAt'])
comments = comments.to_pandas()
comments['week'] = comments.CreatedAt.dt.date - pd.to_timedelta(comments['CreatedAt'].dt.dayofweek, unit='d')
author_timeseries = comments.loc[:,['subreddit','author','week']].drop_duplicates().groupby(['subreddit','week']).count().reset_index()
for clid in chosen_clusters.index.values:
ts = pd.read_feather(f"data/ts_term_cluster_{clid}.feather")
pn.options.figure_size = (11.7,8.27)
p = pn.ggplot(ts)
p = p + pn.geom_line(pn.aes('week','value',group='subreddit'))
p = p + pn.facet_wrap('~ subreddit')
p.save(f"plots/ts_term_cluster_{clid}.png")
fig, ax = pyplot.subplots(figsize=(11.7,8.27))
g = sns.FacetGrid(ts,row='subreddit')
g.map_dataframe(sns.scatterplot,'week','value',data=ts,ax=ax)

View File

@@ -0,0 +1,37 @@
import pandas as pd
import numpy as np
from pyspark.sql import functions as f
from pyspark.sql import SparkSession
from choose_clusters import load_clusters, load_densities
import fire
from pathlib import Path
def main(term_clusters_path="/gscratch/comdata/output/reddit_clustering/comment_terms_10000.feather",
author_clusters_path="/gscratch/comdata/output/reddit_clustering/comment_authors_10000.feather",
term_densities_path="/gscratch/comdata/output/reddit_density/comment_terms_10000.feather",
author_densities_path="/gscratch/comdata/output/reddit_density/comment_authors_10000.feather",
output="data/subreddit_timeseries.parquet"):
clusters = load_clusters(term_clusters_path, author_clusters_path)
densities = load_densities(term_densities_path, author_densities_path)
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet("/gscratch/comdata/output/reddit_comments_by_subreddit.parquet")
df = df.withColumn('week', f.date_trunc('week', f.col("CreatedAt")))
# time of unique authors by series by week
ts = df.select(['subreddit','week','author']).distinct().groupby(['subreddit','week']).count()
ts = ts.repartition('subreddit')
spk_clusters = spark.createDataFrame(clusters)
ts = ts.join(spk_clusters, on='subreddit', how='inner')
spk_densities = spark.createDataFrame(densities)
ts = ts.join(spk_densities, on='subreddit', how='inner')
ts.write.parquet(output, mode='overwrite')
if __name__ == "__main__":
fire.Fire(main)

View File

@@ -0,0 +1 @@
/annex/objects/SHA256E-s60874--d536adb0ec637fca262c4e1ec908dd8b4a5d1464047b583cd1a99cc6dba87191

11
visualization/Makefile Normal file
View File

@@ -0,0 +1,11 @@
all: subreddit_author_tf_similarities_10000.html #comment_authors_10000.html
# wang_tsne_10000.html
# wang_tsne_10000.html:/gscratch/comdata/output/reddit_tsne/wang_similarity_10000.feather /gscratch/comdata/output/reddit_clustering/wang_similarity_10000.feather tsne_vis.py
# python3 tsne_vis.py --tsne_data=/gscratch/comdata/output/reddit_tsne/wang_similarity_10000.feather --clusters=/gscratch/comdata/output/reddit_clustering/wang_similarity_10000.feather --output=wang_tsne_10000.html
# comment_authors_10000.html:/gscratch/comdata/output/reddit_tsne/comment_authors_10000.feather /gscratch/comdata/output/reddit_clustering/comment_authors_10000.feather tsne_vis.py
# python3 tsne_vis.py --tsne_data=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather --clusters=/gscratch/comdata/output/reddit_clustering/comment_authors_10000.feather --output=comment_authors_10000.html
subreddit_author_tf_similarities_10000.html:/gscratch/comdata/output/reddit_tsne/subreddit_author_tf_similarities_10000.feather /gscratch/comdata/output/reddit_clustering/subreddit_author_tf_similarities_10000.feather tsne_vis.py
start_spark_and_run.sh 1 tsne_vis.py --tsne_data=/gscratch/comdata/output/reddit_tsne/subreddit_author_tf_similarities_10000.feather --clusters=/gscratch/comdata/output/reddit_clustering/subreddit_author_tf_similarities_10000.feather --output=subreddit_author_tf_similarities_10000.html

View File

@@ -0,0 +1 @@
../../.git/annex/objects/Qk/wG/SHA256E-s145210--14a2ad6660d1e4015437eff556ec349dd10a115a4f96594152a29e83d00aa784/SHA256E-s145210--14a2ad6660d1e4015437eff556ec349dd10a115a4f96594152a29e83d00aa784

View File

@@ -0,0 +1 @@
../../.git/annex/objects/w7/2f/SHA256E-s44458--f1c5247775ecf06514a0ff9e523e944bc8fcd9d0fdb6f214cc1329b759d4354e/SHA256E-s44458--f1c5247775ecf06514a0ff9e523e944bc8fcd9d0fdb6f214cc1329b759d4354e

View File

@@ -0,0 +1 @@
../../.git/annex/objects/WX/v3/SHA256E-s190874--c2aea719f989dde297ca5f13371e156693c574e44acd9a0e313e5e3a3ad4b543/SHA256E-s190874--c2aea719f989dde297ca5f13371e156693c574e44acd9a0e313e5e3a3ad4b543

View File

@@ -0,0 +1 @@
../../.git/annex/objects/mq/2z/SHA256E-s58834--2e7b3ee11f47011fd9b34bddf8f1e788d35ab9c9e0bb6a1301b0b916135400cf/SHA256E-s58834--2e7b3ee11f47011fd9b34bddf8f1e788d35ab9c9e0bb6a1301b0b916135400cf

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

175
visualization/tsne_vis.py Normal file
View File

@@ -0,0 +1,175 @@
import pyarrow
import altair as alt
alt.data_transformers.disable_max_rows()
alt.data_transformers.enable('default')
from sklearn.neighbors import NearestNeighbors
import pandas as pd
from numpy import random
import fire
import numpy as np
def base_plot(plot_data):
# base = base.encode(alt.Color(field='color',type='nominal',scale=alt.Scale(scheme='category10')))
cluster_dropdown = alt.binding_select(options=[str(c) for c in sorted(set(plot_data.cluster))])
# subreddit_dropdown = alt.binding_select(options=sorted(plot_data.subreddit))
cluster_click_select = alt.selection_single(on='click',fields=['cluster'], bind=cluster_dropdown, name=' ')
# cluster_select = alt.selection_single(fields=['cluster'], bind=cluster_dropdown, name='cluster')
# cluster_select_and = cluster_click_select & cluster_select
#
# subreddit_select = alt.selection_single(on='click',fields=['subreddit'],bind=subreddit_dropdown,name='subreddit_click')
color = alt.condition(cluster_click_select ,
alt.Color(field='color',type='nominal',scale=alt.Scale(scheme='category10')),
alt.value("lightgray"))
base = alt.Chart(plot_data).mark_text().encode(
alt.X('x',axis=alt.Axis(grid=False),scale=alt.Scale(domain=(-65,65))),
alt.Y('y',axis=alt.Axis(grid=False),scale=alt.Scale(domain=(-65,65))),
color=color,
text='subreddit')
base = base.add_selection(cluster_click_select)
return base
def zoom_plot(plot_data):
chart = base_plot(plot_data)
chart = chart.interactive()
chart = chart.properties(width=1275,height=800)
return chart
def viewport_plot(plot_data):
selector1 = alt.selection_interval(encodings=['x','y'],init={'x':(-65,65),'y':(-65,65)})
selectorx2 = alt.selection_interval(encodings=['x'],init={'x':(30,40)})
selectory2 = alt.selection_interval(encodings=['y'],init={'y':(-20,0)})
base = base_plot(plot_data)
viewport = base.mark_point(fillOpacity=0.2,opacity=0.2).encode(
alt.X('x',axis=alt.Axis(grid=False)),
alt.Y('y',axis=alt.Axis(grid=False)),
)
viewport = viewport.properties(width=600,height=400)
viewport1 = viewport.add_selection(selector1)
viewport2 = viewport.encode(
alt.X('x',axis=alt.Axis(grid=False),scale=alt.Scale(domain=selector1)),
alt.Y('y',axis=alt.Axis(grid=False),scale=alt.Scale(domain=selector1))
)
viewport2 = viewport2.add_selection(selectorx2)
viewport2 = viewport2.add_selection(selectory2)
sr = base.encode(alt.X('x',axis=alt.Axis(grid=False),scale=alt.Scale(domain=selectorx2)),
alt.Y('y',axis=alt.Axis(grid=False),scale=alt.Scale(domain=selectory2))
)
sr = sr.properties(width=1275,height=600)
chart = (viewport1 | viewport2) & sr
return chart
def assign_cluster_colors(tsne_data, clusters, n_colors, n_neighbors = 4):
tsne_data = tsne_data.merge(clusters,on='subreddit')
centroids = tsne_data.groupby('cluster').agg({'x':np.mean,'y':np.mean})
color_ids = np.arange(n_colors)
distances = np.empty(shape=(centroids.shape[0],centroids.shape[0]))
groups = tsne_data.groupby('cluster')
points = np.array(tsne_data.loc[:,['x','y']])
centers = np.array(centroids.loc[:,['x','y']])
# point x centroid
point_center_distances = np.linalg.norm((points[:,None,:] - centers[None,:,:]),axis=-1)
# distances is cluster x point
for gid, group in groups:
c_dists = point_center_distances[group.index.values,:].min(axis=0)
distances[group.cluster.values[0],] = c_dists
# nbrs = NearestNeighbors(n_neighbors=n_neighbors).fit(centroids)
# distances, indices = nbrs.kneighbors()
nearest = distances.argpartition(n_neighbors,0)
indices = nearest[:n_neighbors,:].T
# neighbor_distances = np.copy(distances)
# neighbor_distances.sort(0)
# neighbor_distances = neighbor_distances[0:n_neighbors,:]
# nbrs = NearestNeighbors(n_neighbors=n_neighbors,metric='precomputed').fit(distances)
# distances, indices = nbrs.kneighbors()
color_assignments = np.repeat(-1,len(centroids))
for i in range(len(centroids)):
knn = indices[i]
knn_colors = color_assignments[knn]
available_colors = color_ids[list(set(color_ids) - set(knn_colors))]
if(len(available_colors) > 0):
color_assignments[i] = available_colors[0]
else:
raise Exception("Can't color this many neighbors with this many colors")
centroids = centroids.reset_index()
colors = centroids.loc[:,['cluster']]
colors['color'] = color_assignments
tsne_data = tsne_data.merge(colors,on='cluster')
return(tsne_data)
def build_visualization(tsne_data, clusters, output):
# tsne_data = "/gscratch/comdata/output/reddit_tsne/subreddit_author_tf_similarities_10000.feather"
# clusters = "/gscratch/comdata/output/reddit_clustering/subreddit_author_tf_similarities_10000.feather"
tsne_data = pd.read_feather(tsne_data)
clusters = pd.read_feather(clusters)
tsne_data = assign_cluster_colors(tsne_data,clusters,10,8)
# sr_per_cluster = tsne_data.groupby('cluster').subreddit.count().reset_index()
# sr_per_cluster = sr_per_cluster.rename(columns={'subreddit':'cluster_size'})
tsne_data = tsne_data.merge(sr_per_cluster,on='cluster')
term_zoom_plot = zoom_plot(tsne_data)
term_zoom_plot.save(output)
term_viewport_plot = viewport_plot(tsne_data)
term_viewport_plot.save(output.replace(".html","_viewport.html"))
if __name__ == "__main__":
fire.Fire(build_visualization)
# commenter_data = pd.read_feather("tsne_author_fit.feather")
# clusters = pd.read_feather('author_3000_clusters.feather')
# commenter_data = assign_cluster_colors(commenter_data,clusters,10,8)
# commenter_zoom_plot = zoom_plot(commenter_data)
# commenter_viewport_plot = viewport_plot(commenter_data)
# commenter_zoom_plot.save("subreddit_commenters_tsne_3000.html")
# commenter_viewport_plot.save("subreddit_commenters_tsne_3000_viewport.html")
# chart = chart.properties(width=10000,height=10000)
# chart.save("test_tsne_whole.svg")