Merge remote-tracking branch 'refs/remotes/origin/excise_reindex' into excise_reindex

git-annex in
Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex
2022-04-06 11:14:13 -07:00 · 2022-04-06 11:11:11 -07:00 · 2022-01-19 14:01:44 -08:00 · 2022-01-19 13:57:02 -08:00 · 2021-12-10 21:23:32 -08:00 · 2021-08-11 22:48:33 -07:00
68 changed files with 2660 additions and 1959 deletions
--- a/README.md
+++ b/README.md
@@ -2,111 +2,51 @@
 title: Utilities for Reddit Data Science
 ---
 `cdsc_reddit` is a collection of tools for working with Reddit data on the
 Hyak super computing system at the University of Washington. It is built
 around [PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
 and [pyarrow](https://arrow.apache.org/docs/python/) so that the underlying
 pipelines scale to the full Pushshift archive.
-The project was originally developed by [Nate
+The reddit_cdsc project contains tools for working with Reddit data.  The project is designed for the hyak super computing system at The University of Washington.  It consists of a set of python and bash scripts and uses the [Pyspark](https://spark.apache.org/docs/latest/api/python/index.html "Pyspark documentation") and [pyarrow](https://arrow.apache.org/docs/python/ "documentation of python arrow bindings") to process large datasets.  As of November 1st 2020, the project is under active development by [Nate TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Washington.29 "Nate's profile on the Community Data Science Collective Wiki") and provides scripts for:
 TeBlunthuis](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29)
 and is now maintained by a rotating set of researchers in the Community
 Data Science Collective, including Benjamin Mako Hill, Madelyn Douglas, and
 others.
-At a high level, the codebase covers four kinds of work:
+- Pulling and updating dumps from [Pushshift](https://pushshift.io "Pushshift.io") in `pull_pushshift_comments.sh` and `pull_pushshift_submissions.sh`.
 - Uncompressing and parsing the dumps into [Parquet](https://parquet.apache.org/ "apahce parquet website") [datasets](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
 - Running text analysis based on [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") including 
  - Extracting terms from Reddit comments in `tf_comments.py`
  - Detecting common phrases based on [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information")
  - Building TF-IDF vectors for each subreddit `idf_comments.py` and (more experimentally) at the subreddit-week level `idf_comments_weekly.py` 
  - Computing cosine similarities between subreddits based on TF-IDF `term_cosine_similarity.py`. 
- **Ingest.** Turning Pushshift comment and submission dumps into
+Right now, two steps are still in earlier stages of progress:
  partitioned Parquet datasets that are fast to query by subreddit or by
  author.
 - **Text features.** Building per-subreddit TF-IDF vectors over comment
  text, including a phrase-detection pass based on pointwise mutual
  information.
 - **Similarity, clustering, and density.** Computing cosine similarities
  between subreddits (by terms or by overlapping authors), clustering the
  resulting similarity matrices, and summarizing how dense each
  neighborhood is.
 - **Time series and visualization.** Pulling activity time series per
  subreddit and producing t-SNE plots of the clustering output.
-Several pieces are still rough — the user interfaces for many of the
+- Approach comparable to tf-idf for similarity between subreddits in terms of comment authors. 
-scripts assume familiarity with the project, and the TF-IDF pipeline does
+- Clustering subreddits based on cosine-similarities using [power iteration clustering (PIC)](http://www.cs.cmu.edu/~wcohen/postscript/icml2010-pic-final.pdf "Paper on power iteration clustering")
 not yet strip hyperlinks or bot comments, so subreddits with similar
 automod messages can look misleadingly similar.
-## Repository layout
+The TF-IDF for comments still has some kinks to iron out to remove hyper links and bot comments. Right now subreddits that have similar automoderation messages appear very similar.
-| Directory | What's in it |
+The user interfaces for most of the scripts are pretty crappy and need to be refined for re-use by others. 
 |---|---|
 | `datasets/` | Scripts that convert the raw dumps into partitioned, sorted Parquet datasets. |
 | `ngrams/` | Term extraction from comments, phrase detection via PMI, and supporting batch scripts. |
 | `similarities/` | TF-IDF construction and cosine-similarity computation, for both terms and authors, including a weekly variant. |
 | `clustering/` | Affinity-propagation clustering of the similarity matrices and t-SNE fits for visualization. |
 | `density/` | Per-subreddit overlap density measures derived from the similarity matrices. |
 | `timeseries/` | Per-subreddit activity time series, plus tooling for choosing among clustering runs. |
 | `visualization/` | Altair-based interactive plots of subreddit clusters. |
 | `bots/` | Heuristics for flagging likely bot accounts. |
 | `examples/` | Small standalone examples using pyarrow. |
-## Sourcing the dumps
+## Pulling data from [Pushshift](https://pushshift.io "Pushshift.io") ##
-Pushshift was effectively wound down after Reddit cut off third-party API
+- `pull_pushshift_comments.sh` uses wget to download comment dumps to  `/gscratch/comdata/raw_data/reddit_dumps/comments`. It doesn't download files that already exists and runs `check_comments_shas.sh` to verify the files downloaded correctly. 
 access in 2023, and the original `files.pushshift.io` archive is gone.
 Collection of new Reddit comment and submission data has since been
 picked up by [ArcticShift](https://github.com/ArthurHeitmann/arctic_shift),
 which publishes both the historical Pushshift archive and the new data
 it continues to collect, with monthly updates redistributed as academic
 torrents by Reddit users `u/Watchful1` and `u/RaiderBDev`. Fetching the
 dumps from a torrent client is a manual prerequisite to running the rest
 of this pipeline; step-by-step instructions for the current CDSC
 workflow — including which torrents to pull and how to stage the `.zst`
 files on Hyak — live on the CDSC wiki at
 [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit).
 The earlier `dumps/` directory of `pull_pushshift_*.sh` and SHA-check
 scripts has been removed since the URLs they pointed at no longer
 resolve.
-## Building Parquet datasets
+- `pull_pushshift_submissions.sh` does the same for submissions and puts them in `/gscratch/comdata/raw_data/reddit_dumps/comments`.
-The raw dumps are huge compressed JSON files with a lot of metadata that
+## Building Parquet Datasets ##
 we usually don't need. They aren't indexed, so it's expensive to pull data
 for just a handful of subreddits, and they are awkward to read directly
 into Spark. Extracting the useful fields and rewriting the data as
 Parquet makes everything downstream cheaper. The conversion happens in
 two steps:
-1. Extracting JSON into temporary, unpartitioned Parquet files using
+Pushshift dumps are huge compressed json files with a lot of metadata that we may not need. It isn't indexed so it's expensive to pull data from just a handful of subreddits. It also turns out that it's a pain to read these compressed files straight into spark. Extracting useful variables from the dumps and building parquet datasets will make them easier to work with.  This happens in two steps:
   pyarrow (`comments_2_parquet_part1.py`,
   `submissions_2_parquet_part1.py`).
 2. Repartitioning and sorting the data using PySpark
   (`comments_2_parquet_part2.py`, `submissions_2_parquet_part2.py`).
-The final datasets live in `/gscratch/comdata/output/`:
+1. Extracting json into (temporary, unpartitioned) parquet files using pyarrow.
 2. Repartitioning and sorting the data using pyspark.
- `reddit_comments_by_author.parquet` — comments partitioned and sorted by
+The final datasets are in `/gscratch/comdata/output.`
  author (lowercase).
 - `reddit_comments_by_subreddit.parquet` — comments partitioned and sorted
  by subreddit (lowercase).
 - `reddit_submissions_by_author.parquet` — submissions partitioned and
  sorted by author (lowercase).
 - `reddit_submissions_by_subreddit.parquet` — submissions partitioned and
  sorted by subreddit (lowercase).
-Splitting the work this way lets us decompress and parse the dumps in the
+- `reddit_comments_by_author.parquet` has comments partitioned and sorted by username (lowercase).
-Hyak backfill queue and then sort them in Spark. Partitioning makes it
+- `reddit_comments_by_subreddit.parquet` has comments partitioned and sorted by subreddit name (lowercase).
-possible to read data for specific subreddits or authors efficiently, and
+- `reddit_submissions_by_author.parquet` has submissions partitioned and sorted by username (lowercase).
-sorting makes per-subreddit or per-user aggregations cheap. More
+- `reddit_submissions_by_subreddit.parquet` has submissions partitioned and sorted by subreddit name (lowercase).
 documentation on using these files lives on the [CDSC
 wiki](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets).
-## TF-IDF subreddit similarity
+Breaking this down into two steps is useful because it allows us to decompress and parse the dumps in the backfill queue and then sort them in spark. Partitioning the data makes it possible to efficiently read data for specific subreddits or authors.  Sorting it means that you can efficiently compute agreggations at the subreddit or user level. More documentation on using these files is available [here](https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets "Wikilink to documentation on the Reddit parquet datasets").
-[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a simple
+## TF-IDF Subreddit Similarity ##
-information-retrieval technique we use to quantify the topic of a
+
-subreddit. The goal is to build a vector for each subreddit that scores
+[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf "Wikipedia article on tf-idf") is common and simple information retrieval technique that we can use to quantify the topic of a subreddit.  The goal of TF-IDF is to build a vector for each subreddit that scores every term (or phrase) according to how characteristic it is of the overall lexicon used in that subreddit. For example, the most characteristic terms in the subreddit /r/christianity in the current version of the TF-IDF model are:
 every term (or phrase) according to how characteristic it is of the
 lexicon used there. For example, the most characteristic terms in
 `/r/christianity` in the current model are:
 | Term         | tf_idf |
 |:------------:|:------:|
@@ -116,121 +56,61 @@ lexicon used there. For example, the most characteristic terms in
 | bible        | 0.557  |
 | scripture    | 0.55   |
-TF-IDF is the product of two pieces: *term frequency* (how often a term
+TF-IDF stands for "term frequency - inverse document frequency" because it is the product of two terms "term frequency" and "inverse document frequency." Term frequency quantifies the amount that a term appears in a subreddit (document). Inverse document frequency quantifies how much that term appears in other subreddits (documents). As you can see on the Wikipedia page, there are many possible ways of constructing and combining these terms. 
 appears in a subreddit) and *inverse document frequency* (how rare the
 term is across other subreddits). There are many ways to construct and
 combine these; the [Wikipedia
 page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) catalogs the common
 variants.
-We normalize term frequency by the maximum raw term frequency for each
+$x + y = z_{1,d}$ 
 subreddit:
-$$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t^{'} \in d}{f_{t^{'},d}}}$$
+I chose to normalize term frequency by the maximum (raw) term frequency for each subreddit:
 $\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\sum_{t^{'} \in d}{f_{t^{'},d}}}$ 
-and use the log inverse document frequency:
+I use the log inverse document frequency:
 $\mathrm{idf}_{t} = log\frac{N}{| {d \in D : t \in d} |}$
-$$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
+I then combine them using some smoothing to get:
-combined with a smoothing term:
+$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$ 
-$$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
+### Building TF-IDF vectors ###
-(Other normalization strategies are worth trying — see the note in
+The process for building TF-IDF vectors has four steps:
 `similarities/TODO`.)
-### Building TF-IDF vectors
+1. Extracting terms using `tf_comments.py`
 2. Detecting common phrases using `top_comment_phrases.py`
 3. Extracting terms and common phrases using `tf_comments.py --mwe-pass='second'`
 4. Building idf and tf-idf scores in `idf_comments.py`
-The pipeline has four steps:
+#### Running `tf_comments.py` on the backfill queue ####
-1. Extract terms with `ngrams/tf_comments.py`.
+The main reason that I did it in 4 steps instead of one is to take advantage of the backfill queue for running `tf_comments.py`.  This step requires reading all of the text in every comment and converting it to a bag of words at the subreddit-level.  This is a lot of computation that is easily parallelizable. The script `run_tf_jobs.sh` partially automates running steps 1 (or 3) on the backfill queue. 
 2. Detect common phrases with `ngrams/top_comment_phrases.py`.
 3. Re-extract terms together with detected phrases via
   `ngrams/tf_comments.py --mwe-pass=second`.
 4. Compute IDF and TF-IDF scores in `similarities/tfidf.py`.
-#### Running `tf_comments.py` on the backfill queue
+#### Phrase detection using Pointwise Mutual Information ####
-The main reason for the four-step layout is that `tf_comments.py` is
+TF-IDF is simple, but only uses single words (unigrams).  Sequences of multiple words can be important to account for how words have different meanings in different contexts or how sequences of words refer to distinct things like names. Dealing with context or longer sequences of words is a common challenge in natural language processing since the number of possible n-grams grows like crazy as n gets bigger. Phrase detection helps this  problem by limiting the set of n-grams to those most informative. 
 trivially parallel — it reads every comment and rewrites each subreddit
 as a bag of words — so it benefits from being farmed out to the Hyak
 backfill queue. `ngrams/run_tf_jobs.sh` partially automates the dispatch.
-#### Phrase detection using pointwise mutual information
+But how do we detect phrases?  I implemented [Pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) "Wikipedia article on pointwise mutual information"), which is a pretty simple way, but seems to work pretty well. 
-TF-IDF over unigrams misses the fact that sequences of words often carry
+PMI is an quantity derived from information theory. The intuition is that if two words occur together quite frequently compared to how often they appear separately then the cooccurrance is likely to be informative. 
 distinct meaning (names, fixed expressions, in-jokes). Considering every
 possible n-gram is prohibitive because the candidate set explodes with
 `n`, so we use phrase detection to limit ourselves to informative
 n-grams.
-We use [pointwise mutual
+$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}.$
 information](https://en.wikipedia.org/wiki/Pointwise_mutual_information)
 (PMI), which is simple and works well in practice. The intuition is that
 if two words co-occur much more often than their marginal frequencies
 would predict, the pair is probably meaningful:
-$$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
+In `tf_comments.py` if `--mwe-pass=first` then a 10\% sample of 1-4-grams (sequences of terms up to length 4) will be written to a file to be consumed by `top_comment_phrases.py`.  `top_comment_phrases.py` computes the PMI for these possible phrases and writes those that occur at least 3500 times in the sample of n-grams and have a PWMI of at least 3 (about 65000 expressions). 
-When `tf_comments.py` is run with `--mwe-pass=first`, it writes a 10%
+`tf_comments.py --mwe-pass=second` then uses the detected phrases and adds them to the term frequency data. 
 sample of 1- to 4-grams to a file. `top_comment_phrases.py` then
 computes PMI over that sample and keeps phrases that occur at least
 3,500 times and have PMI of at least 3 — roughly 65,000 expressions.
 A second pass of `tf_comments.py --mwe-pass=second` folds those phrases
 back into the term-frequency data.
-### Cosine similarity
+### Cosine Similarity ###
-Once the TF-IDF vectors are built, computing a similarity score between
+Once the tf-idf vectors are built, making a similarity score between two subreddits is straightforward using cosine similarity. 
 two subreddits is straightforward with cosine similarity:
-$$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
+$\text{similarity} = \cos(\theta) = {\mathbf{A} \cdot \mathbf{B} \over \|\mathbf{A}\| \|\mathbf{B}\|} = \frac{ \sum\limits_{i=1}^{n}{A_i  B_i} }{ \sqrt{\sum\limits_{i=1}^{n}{A_i^2}}  \sqrt{\sum\limits_{i=1}^{n}{B_i^2}} }$
-Each subreddit is a vector in a high-dimensional term space. The dot
+Intuitively, we represent two subreddits as lines in a high-dimensional space (tf-idf vectors). 
-product gives a weighted sum of shared terms, and dividing by the
+In linear algebra, the dot product ($\cdot$) between two vectors takes their weighted sum (e.g. linear regression is a dot product of a vector of covariates and a vector of weights).  
-vector magnitudes removes the effect of differing vocabulary size — what
+The vectors might have different lengths like if one subreddit has words in comments than the other, so in cosine similarity the dot product is normalized by the magnitude (lengths) of the vectors. 
-remains is the cosine of the angle between the two vectors. Cosine
+It turns out that this is equivalent to taking the cosine of the two vectors.  So cosine similarity in essence quantifies the angle between the two lines in high-dimensional space.  If the cosine similarity between two subreddits is greater then their tf-idf vectors are more correlated. 
 similarity with TF-IDF is popular (and has been used on Reddit several
 times in prior research) because it captures correlation between the
 *most characteristic* terms of two communities.
-Compared to approaches based on word embeddings or topic models, this
+Cosine similarity with tf-idf is popular (indeed it has been applied to Reddit in research several times before) because it quantifies the correlation between the most characteristic terms for two communities.
 method can struggle with polysemy, synonymy, and correlations between
 related terms. Phrase detection helps a little. The trade-off is
 simplicity and scalability. Adding [latent semantic
 analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
 intermediate step is on the wish-list for improving on raw TF-IDF
 similarities.
-Even with these simplifications, similarity between a large number of
+Compared to other approach to similarity like those using word embeddings or topic models it may struggle to handle polysemy, synonymy, or correlations between different terms.  Using phrase detection helps with this a little bit.  The advantages of this approach are simplicity and scalability.  I'm thinking about using [Latent Semantic Analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis "Wikipedia article on Latent semantic analysis") as an intermediate step to improve upon similarities based on raw tf-idfs. 
 subreddits is expensive — naively $n^2$ dot-products. Passing
 `--similarity-threshold=X` (with `X>0`) to the similarity scripts lets
 Spark's built-in matrix library use the DIMSUM approximation, which is
 the same algorithm Twitter and Google have used for large-scale
 similarity scoring.
-## Clustering, density, and time series
+Even still, computing similarities between a large number of subreddits is computationally expensive and requires $n^2$ dot-product evaluations. 
-
+This can be sped up by passing `similarity-threshold=X` where $X>0$ into `term_comment_similarity.py`.  I used a cosine similarity function that's built into the spark matrix library which supports the `DIMSUM` algorithm for approximating matrix-matrix products.  This algorithm is commonly used in industry (i.e. at Twitter, Google) for large-scale similarity scoring.
 The similarity matrices feed three follow-on analyses:
 - `clustering/clustering.py` clusters a similarity matrix using
  affinity propagation; `clustering/selection.py` and
  `clustering/fit_tsne.py` are supporting scripts for hyperparameter
  selection and 2-D embeddings.
 - `density/overlap_density.py` computes a per-subreddit overlap density
  measure from the similarity matrix.
 - `timeseries/cluster_timeseries.py` and `timeseries/choose_clusters.py`
  pull subreddit-level activity time series and join them against
  clustering output.
 `visualization/tsne_vis.py` renders interactive Altair plots of the
 clustering output — see the prebuilt HTML files in `visualization/` for
 examples.
 ## Bot detection
 `bots/good_bad_bot.py` computes user-level features (compression rate
 of comment text, frequency of self-identification as a bot, etc.) that
 are useful for filtering bot accounts out of downstream analyses. This
 is preliminary work; nothing in the pipeline currently consumes it
 automatically.
--- a/init.py
+++ b/init.py
@@ -0,0 +1,2 @@
 from .timeseries import load_clusters, load_densities, build_cluster_timeseries
--- a/clustering/Makefile
+++ b/clustering/Makefile
@@ -2,20 +2,164 @@
 srun_singularity=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity.sh
 similarity_data=/gscratch/comdata/output/reddit_similarity
 clustering_data=/gscratch/comdata/output/reddit_clustering
-selection_grid="--max_iter=3000 --convergence_iter=15,30,100 --damping=0.5,0.6,0.7,0.8,0.85,0.9,0.95,0.97,0.99, --preference_quantile=0.1,0.3,0.5,0.7,0.9"
+kmeans_selection_grid=--max_iters=[3000] --n_inits=[10] --n_clusters=[100,500,1000,1250,1500,1750,2000]
-#selection_grid="--max_iter=3000 --convergence_iter=[15] --preference_quantile=[0.5] --damping=[0.99]"
+hdbscan_selection_grid=--min_cluster_sizes=[2,3,4,5] --min_samples=[2,3,4,5] --cluster_selection_epsilons=[0,0.01,0.05,0.1,0.15,0.2] --cluster_selection_methods=[eom,leaf]
-all:$(clustering_data)/subreddit_comment_authors_10k/selection_data.csv $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv
+affinity_selection_grid=--dampings=[0.5,0.6,0.7,0.8,0.95,0.97,0.99] --preference_quantiles=[0.1,0.3,0.5,0.7,0.9] --convergence_iters=[15]
 # $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS $(clustering_data)/subreddit_authors-tf_similarities_30k.feather/SUCCESS
 # $(clustering_data)/subreddit_comment_terms_30k.feather/SUCCESS
-$(clustering_data)/subreddit_comment_authors_10k/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_authors_10k.feather clustering.py
+authors_10k_input=$(similarity_data)/subreddit_comment_authors_10k.feather
-	$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_10k.feather $(clustering_data)/subreddit_comment_authors_10k $(clustering_data)/subreddit_comment_authors_10k/selection_data.csv $(selection_grid) -J 20
+authors_10k_input_lsi=$(similarity_data)/subreddit_comment_authors_10k_LSI
 authors_10k_output=$(clustering_data)/subreddit_comment_authors_10k
 authors_10k_output_lsi=$(clustering_data)/subreddit_comment_authors_10k_LSI
-$(clustering_data)/subreddit_comment_terms_10k/selection_data.csv:selection.py $(similarity_data)/subreddit_comment_terms_10k.feather clustering.py
+authors_tf_10k_input=$(similarity_data)/subreddit_comment_authors-tf_10k.feather
-	$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_terms_10k.feather $(clustering_data)/subreddit_comment_terms_10k  $(clustering_data)/subreddit_comment_terms_10k/selection_data.csv $(selection_grid) -J 20 
+authors_tf_10k_input_lsi=$(similarity_data)/subreddit_comment_authors-tf_10k_LSI
 authors_tf_10k_output=$(clustering_data)/subreddit_comment_authors-tf_10k
 authors_tf_10k_output_lsi=$(clustering_data)/subreddit_comment_authors-tf_10k_LSI
-$(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv:clustering.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather
+terms_10k_input=$(similarity_data)/subreddit_comment_terms_10k.feather
-	$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors-tf_10k.feather $(clustering_data)/subreddit_comment_authors-tf_10k  $(clustering_data)/subreddit_comment_authors-tf_10k/selection_data.csv $(selection_grid) -J 20
+terms_10k_input_lsi=$(similarity_data)/subreddit_comment_terms_10k_LSI
 terms_10k_output=$(clustering_data)/subreddit_comment_terms_10k
 terms_10k_output_lsi=$(clustering_data)/subreddit_comment_terms_10k_LSI
 all:terms_10k authors_10k authors_tf_10k terms_10k_lsi authors_10k_lsi authors_tf_10k_lsi
 terms_10k:${terms_10k_output}/kmeans/selection_data.csv ${terms_10k_output}/affinity/selection_data.csv ${terms_10k_output}/hdbscan/selection_data.csv
 authors_10k:${authors_10k_output}/kmeans/selection_data.csv ${authors_10k_output}/hdbscan/selection_data.csv ${authors_10k_output}/affinity/selection_data.csv
 authors_tf_10k:${authors_tf_10k_output}/kmeans/selection_data.csv ${authors_tf_10k_output}/hdbscan/selection_data.csv ${authors_tf_10k_output}/affinity/selection_data.csv
 terms_10k_lsi:${terms_10k_output_lsi}/kmeans/selection_data.csv ${terms_10k_output_lsi}/affinity/selection_data.csv ${terms_10k_output_lsi}/hdbscan/selection_data.csv
 authors_10k_lsi:${authors_10k_output_lsi}/kmeans/selection_data.csv ${authors_10k_output_lsi}/hdbscan/selection_data.csv ${authors_10k_output_lsi}/affinity/selection_data.csv
 authors_tf_10k_lsi:${authors_tf_10k_output_lsi}/kmeans/selection_data.csv ${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv ${authors_tf_10k_output_lsi}/affinity/selection_data.csv
 ${authors_10k_output}/kmeans/selection_data.csv:selection.py ${authors_10k_input} clustering_base.py kmeans_clustering.py
 	$(srun_singularity) python3 kmeans_clustering.py --inpath=${authors_10k_input} --outpath=${authors_10k_output}/kmeans --savefile=${authors_10k_output}/kmeans/selection_data.csv $(kmeans_selection_grid) 
 ${terms_10k_output}/kmeans/selection_data.csv:selection.py ${terms_10k_input} clustering_base.py kmeans_clustering.py
 	$(srun_singularity) python3 kmeans_clustering.py --inpath=${terms_10k_input} --outpath=${terms_10k_output}/kmeans  --savefile=${terms_10k_output}/kmeans/selection_data.csv $(kmeans_selection_grid) 
 ${authors_tf_10k_output}/kmeans/selection_data.csv:clustering.py ${authors_tf_10k_input} clustering_base.py kmeans_clustering.py
 	$(srun_singularity) python3 kmeans_clustering.py --inpath=${authors_tf_10k_input} --outpath=${authors_tf_10k_output}/kmeans --savefile=${authors_tf_10k_output}/kmeans/selection_data.csv $(kmeans_selection_grid) 
 ${authors_10k_output}/affinity/selection_data.csv:selection.py ${authors_10k_input} clustering_base.py affinity_clustering.py
 	$(srun_singularity) python3 affinity_clustering.py --inpath=${authors_10k_input} --outpath=${authors_10k_output}/affinity --savefile=${authors_10k_output}/affinity/selection_data.csv $(affinity_selection_grid) 
 ${terms_10k_output}/affinity/selection_data.csv:selection.py ${terms_10k_input} clustering_base.py affinity_clustering.py
 	$(srun_singularity) python3 affinity_clustering.py --inpath=${terms_10k_input} --outpath=${terms_10k_output}/affinity  --savefile=${terms_10k_output}/affinity/selection_data.csv $(affinity_selection_grid) 
 ${authors_tf_10k_output}/affinity/selection_data.csv:clustering.py ${authors_tf_10k_input} clustering_base.py affinity_clustering.py
 	$(srun_singularity) python3 affinity_clustering.py --inpath=${authors_tf_10k_input} --outpath=${authors_tf_10k_output}/affinity --savefile=${authors_tf_10k_output}/affinity/selection_data.csv $(affinity_selection_grid) 
 ${authors_10k_output}/hdbscan/selection_data.csv:selection.py ${authors_10k_input} clustering_base.py hdbscan_clustering.py
 	$(srun_singularity) python3 hdbscan_clustering.py --inpath=${authors_10k_input} --outpath=${authors_10k_output}/hdbscan --savefile=${authors_10k_output}/hdbscan/selection_data.csv $(hdbscan_selection_grid) 
 ${terms_10k_output}/hdbscan/selection_data.csv:selection.py ${terms_10k_input} clustering_base.py hdbscan_clustering.py
 	$(srun_singularity) python3 hdbscan_clustering.py --inpath=${terms_10k_input} --outpath=${terms_10k_output}/hdbscan  --savefile=${terms_10k_output}/hdbscan/selection_data.csv $(hdbscan_selection_grid) 
 ${authors_tf_10k_output}/hdbscan/selection_data.csv:clustering.py ${authors_tf_10k_input} clustering_base.py hdbscan_clustering.py
 	$(srun_singularity) python3 hdbscan_clustering.py --inpath=${authors_tf_10k_input} --outpath=${authors_tf_10k_output}/hdbscan --savefile=${authors_tf_10k_output}/hdbscan/selection_data.csv $(hdbscan_selection_grid) 
 ## LSI Models
 ${authors_10k_output_lsi}/kmeans/selection_data.csv:selection.py ${authors_10k_input_lsi} clustering_base.py kmeans_clustering.py
 	$(srun_singularity) python3 kmeans_clustering_lsi.py --inpath=${authors_10k_input_lsi} --outpath=${authors_10k_output_lsi}/kmeans --savefile=${authors_10k_output_lsi}/kmeans/selection_data.csv $(kmeans_selection_grid)
 ${terms_10k_output_lsi}/kmeans/selection_data.csv:selection.py ${terms_10k_input_lsi} clustering_base.py kmeans_clustering.py
 	$(srun_singularity) python3 kmeans_clustering_lsi.py --inpath=${terms_10k_input_lsi} --outpath=${terms_10k_output_lsi}/kmeans  --savefile=${terms_10k_output_lsi}/kmeans/selection_data.csv $(kmeans_selection_grid)
 ${authors_tf_10k_output_lsi}/kmeans/selection_data.csv:clustering.py ${authors_tf_10k_input_lsi} clustering_base.py kmeans_clustering.py
 	$(srun_singularity) python3 kmeans_clustering_lsi.py --inpath=${authors_tf_10k_input_lsi} --outpath=${authors_tf_10k_output_lsi}/kmeans --savefile=${authors_tf_10k_output_lsi}/kmeans/selection_data.csv $(kmeans_selection_grid)
 ${authors_10k_output_lsi}/affinity/selection_data.csv:selection.py ${authors_10k_input_lsi} clustering_base.py affinity_clustering.py
 	$(srun_singularity) python3 affinity_clustering_lsi.py --inpath=${authors_10k_input_lsi} --outpath=${authors_10k_output_lsi}/affinity --savefile=${authors_10k_output_lsi}/affinity/selection_data.csv $(affinity_selection_grid)
 ${terms_10k_output_lsi}/affinity/selection_data.csv:selection.py ${terms_10k_input_lsi} clustering_base.py affinity_clustering.py
 	$(srun_singularity) python3 affinity_clustering_lsi.py --inpath=${terms_10k_input_lsi} --outpath=${terms_10k_output_lsi}/affinity  --savefile=${terms_10k_output_lsi}/affinity/selection_data.csv $(affinity_selection_grid)
 ${authors_tf_10k_output_lsi}/affinity/selection_data.csv:clustering.py ${authors_tf_10k_input_lsi} clustering_base.py affinity_clustering.py
 	$(srun_singularity) python3 affinity_clustering_lsi.py --inpath=${authors_tf_10k_input_lsi} --outpath=${authors_tf_10k_output_lsi}/affinity --savefile=${authors_tf_10k_output_lsi}/affinity/selection_data.csv $(affinity_selection_grid)
 ${authors_10k_output_lsi}/hdbscan/selection_data.csv:selection.py ${authors_10k_input_lsi} clustering_base.py hdbscan_clustering.py
 	$(srun_singularity) python3 hdbscan_clustering_lsi.py --inpath=${authors_10k_input_lsi} --outpath=${authors_10k_output_lsi}/hdbscan --savefile=${authors_10k_output_lsi}/hdbscan/selection_data.csv $(hdbscan_selection_grid)
 ${terms_10k_output_lsi}/hdbscan/selection_data.csv:selection.py ${terms_10k_input_lsi} clustering_base.py hdbscan_clustering.py
 	$(srun_singularity) python3 hdbscan_clustering_lsi.py --inpath=${terms_10k_input_lsi} --outpath=${terms_10k_output_lsi}/hdbscan  --savefile=${terms_10k_output_lsi}/hdbscan/selection_data.csv $(hdbscan_selection_grid)
 ${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv:clustering.py ${authors_tf_10k_input_lsi} clustering_base.py hdbscan_clustering.py
 	$(srun_singularity) python3 hdbscan_clustering_lsi.py --inpath=${authors_tf_10k_input_lsi} --outpath=${authors_tf_10k_output_lsi}/hdbscan --savefile=${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv $(hdbscan_selection_grid)
 ${terms_10k_output_lsi}/best_hdbscan.feather:${terms_10k_output_lsi}/hdbscan/selection_data.csv pick_best_clustering.py
 	$(srun_singularity) python3 pick_best_clustering.py $< $@ --min_clusters=50 --max_isolates=5000 --min_cluster_size=2
 ${authors_tf_10k_output_lsi}/best_hdbscan.feather:${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv pick_best_clustering.py
 	$(srun_singularity) python3 pick_best_clustering.py $< $@ --min_clusters=50 --max_isolates=5000 --min_cluster_size=2
 clean_affinity:
 	rm -f ${authors_10k_output}/affinity/selection_data.csv
 	rm -f ${authors_tf_10k_output}/affinity/selection_data.csv
 	rm -f ${terms_10k_output}/affinity/selection_data.csv
 clean_kmeans:
 	rm -f ${authors_10k_output}/kmeans/selection_data.csv
 	rm -f ${authors_tf_10k_output}/kmeans/selection_data.csv
 	rm -f ${terms_10k_output}/kmeans/selection_data.csv
 clean_hdbscan:
 	rm -f ${authors_10k_output}/hdbscan/selection_data.csv
 	rm -f ${authors_tf_10k_output}/hdbscan/selection_data.csv
 	rm -f ${terms_10k_output}/hdbscan/selection_data.csv
 clean_authors:
 	rm -f ${authors_10k_output}/affinity/selection_data.csv
 	rm -f ${authors_10k_output}/kmeans/selection_data.csv
 	rm -f ${authors_10k_output}/hdbscan/selection_data.csv
 clean_authors_tf:
 	rm -f ${authors_tf_10k_output}/affinity/selection_data.csv
 	rm -f ${authors_tf_10k_output}/kmeans/selection_data.csv
 	rm -f ${authors_tf_10k_output}/hdbscan/selection_data.csv
 clean_terms:
 	rm -f ${terms_10k_output}/affinity/selection_data.csv
 	rm -f ${terms_10k_output}/kmeans/selection_data.csv
 	rm -f ${terms_10k_output}/hdbscan/selection_data.csv
 clean_lsi_affinity:
 	rm -f ${authors_10k_output_lsi}/affinity/selection_data.csv
 	rm -f ${authors_tf_10k_output_lsi}/affinity/selection_data.csv
 	rm -f ${terms_10k_output_lsi}/affinity/selection_data.csv
 clean_lsi_kmeans:
 	rm -f ${authors_10k_output_lsi}/kmeans/selection_data.csv
 	rm -f ${authors_tf_10k_output_lsi}/kmeans/selection_data.csv
 	rm -f ${terms_10k_output_lsi}/kmeans/selection_data.csv
 clean_lsi_hdbscan:
 	rm -f ${authors_10k_output_lsi}/hdbscan/selection_data.csv
 	rm -f ${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv
 	rm -f ${terms_10k_output_lsi}/hdbscan/selection_data.csv
 clean_lsi_authors:
 	rm -f ${authors_10k_output_lsi}/affinity/selection_data.csv
 	rm -f ${authors_10k_output_lsi}/kmeans/selection_data.csv
 	rm -f ${authors_10k_output_lsi}/hdbscan/selection_data.csv
 clean_lsi_authors_tf:
 	rm -f ${authors_tf_10k_output_lsi}/affinity/selection_data.csv
 	rm -f ${authors_tf_10k_output_lsi}/kmeans/selection_data.csv
 	rm -f ${authors_tf_10k_output_lsi}/hdbscan/selection_data.csv
 clean_lsi_terms:
 	rm -f ${terms_10k_output_lsi}/affinity/selection_data.csv
 	rm -f ${terms_10k_output_lsi}/kmeans/selection_data.csv
 	rm -f ${terms_10k_output_lsi}/hdbscan/selection_data.csv
 clean: clean_affinity clean_kmeans clean_hdbscan
 PHONY: clean clean_affinity clean_kmeans clean_hdbscan clean_authors clean_authors_tf clean_terms terms_10k authors_10k authors_tf_10k
 # $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS:selection.py $(similarity_data)/subreddit_comment_authors_30k.feather clustering.py
 # 	$(srun_singularity) python3 selection.py $(similarity_data)/subreddit_comment_authors_30k.feather $(clustering_data)/subreddit_comment_authors_30k $(selection_grid) -J 10 && touch $(clustering_data)/subreddit_comment_authors_30k.feather/SUCCESS
--- a/clustering/affinity_clustering.py
+++ b/clustering/affinity_clustering.py
@@ -0,0 +1,129 @@
 from sklearn.cluster import AffinityPropagation
 from dataclasses import dataclass
 from clustering_base import clustering_result, clustering_job
 from grid_sweep import grid_sweep
 from pathlib import Path
 from itertools import product, starmap
 import fire
 import sys
 import numpy as np
 # silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying. 
@dataclass
 class affinity_clustering_result(clustering_result):
    damping:float
    convergence_iter:int
    preference_quantile:float
    preference:float
    max_iter:int
 class affinity_job(clustering_job):
    def __init__(self, infile, outpath, name, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True):
        super().__init__(infile,
                         outpath,
                         name,
                         call=self._affinity_clustering,
                         preference_quantile=preference_quantile,
                         damping=damping,
                         max_iter=max_iter,
                         convergence_iter=convergence_iter,
                         random_state=1968,
                         verbose=verbose)
        self.damping=damping
        self.max_iter=max_iter
        self.convergence_iter=convergence_iter
        self.preference_quantile=preference_quantile
    def _affinity_clustering(self, mat, preference_quantile, *args, **kwargs):
        mat = 1-mat
        preference = np.quantile(mat, preference_quantile)
        self.preference = preference
        print(f"preference is {preference}")
        print("data loaded")
        sys.stdout.flush()
        clustering = AffinityPropagation(*args,
                                         preference=preference,
                                         affinity='precomputed',
                                         copy=False,
                                         **kwargs).fit(mat)
        return clustering
    def get_info(self):
        result = super().get_info()
        self.result=affinity_clustering_result(**result.__dict__,
                                               damping=self.damping,
                                               max_iter=self.max_iter,
                                               convergence_iter=self.convergence_iter,
                                               preference_quantile=self.preference_quantile,
                                               preference=self.preference)
        return self.result
 class affinity_grid_sweep(grid_sweep):
    def __init__(self,
                 inpath,
                 outpath,
                 *args,
                 **kwargs):
        super().__init__(affinity_job,
                         _afffinity_grid_sweep,
                         inpath,
                         outpath,
                         self.namer,
                         *args,
                         **kwargs)
    def namer(self,
              damping,
              max_iter,
              convergence_iter,
              preference_quantile):
        return f"damp-{damping}_maxit-{max_iter}_convit-{convergence_iter}_prefq-{preference_quantile}"
 def run_affinity_grid_sweep(savefile, inpath, outpath, dampings=[0.8], max_iters=[3000], convergence_iters=[30], preference_quantiles=[0.5],n_cores=10):
    """Run affinity clustering once or more with different parameters.
    Usage:
    affinity_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --max_iters=<csv> --dampings=<csv> --preference_quantiles=<csv>
    Keword arguments:
    savefile: path to save the metadata and diagnostics 
    inpath: path to feather data containing a labeled matrix of subreddit similarities.
    outpath: path to output fit kmeans clusterings.
    dampings:one or more numbers in [0.5, 1). damping parameter in affinity propagatin clustering. 
    preference_quantiles:one or more numbers in (0,1) for selecting the 'preference' parameter.
    convergence_iters:one or more integers of number of iterations without improvement before stopping.
    max_iters: one or more numbers of different maximum interations.
    """
    obj = affinity_grid_sweep(inpath,
                         outpath,
                         map(float,dampings),
                         map(int,max_iters),
                         map(int,convergence_iters),
                         map(float,preference_quantiles))
    obj.run(n_cores)
    obj.save(savefile)
 def test_select_affinity_clustering():
    # select_hdbscan_clustering("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_30k_LSI",
    #                           "test_hdbscan_author30k",
    #                           min_cluster_sizes=[2],
    #                           min_samples=[1,2],
    #                           cluster_selection_epsilons=[0,0.05,0.1,0.15],
    #                           cluster_selection_methods=['eom','leaf'],
    #                           lsi_dimensions='all')
    inpath = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/"
    outpath = "test_affinity";
    dampings=[0.8,0.9]
    max_iters=[100000]
    convergence_iters=[15]
    preference_quantiles=[0.5,0.7]
    gs = affinity_lsi_grid_sweep(inpath, 'all', outpath, dampings, max_iters, convergence_iters, preference_quantiles)
    gs.run(20)
    gs.save("test_affinity/lsi_sweep.csv")
 if __name__ == "__main__":
    fire.Fire(run_affinity_grid_sweep)
--- a/clustering/affinity_clustering_lsi.py
+++ b/clustering/affinity_clustering_lsi.py
@@ -0,0 +1,99 @@
 import fire
 from affinity_clustering import affinity_clustering_result, affinity_job, affinity_grid_sweep
 from grid_sweep import grid_sweep
 from lsi_base import lsi_result_mixin, lsi_grid_sweep, lsi_mixin
 from dataclasses import dataclass
@dataclass
 class affinity_clustering_result_lsi(affinity_clustering_result, lsi_result_mixin):
    pass
 class affinity_lsi_job(affinity_job, lsi_mixin):
    def __init__(self, infile, outpath, name, lsi_dims, *args, **kwargs):
        super().__init__(infile,
                         outpath,
                         name,
                         *args,
                         **kwargs)
        super().set_lsi_dims(lsi_dims)
    def get_info(self):
        result = super().get_info()
        self.result = affinity_clustering_result_lsi(**result.__dict__,
                                                     lsi_dimensions=self.lsi_dims)
        return self.result
 class affinity_lsi_grid_sweep(lsi_grid_sweep):
    def __init__(self,
                 inpath,
                 lsi_dims,
                 outpath,
                 dampings=[0.9],
                 max_iters=[10000],
                 convergence_iters=[30],
                 preference_quantiles=[0.5]):
        super().__init__(affinity_lsi_job,
                         _affinity_lsi_grid_sweep,
                         inpath,
                         lsi_dims,
                         outpath,
                         dampings,
                         max_iters,
                         convergence_iters,
                         preference_quantiles)
 class _affinity_lsi_grid_sweep(grid_sweep):
    def __init__(self,
                 inpath,
                 outpath,
                 lsi_dim,
                 *args,
                 **kwargs):
        self.lsi_dim = lsi_dim
        self.jobtype = affinity_lsi_job
        super().__init__(self.jobtype,
                         inpath,
                         outpath,
                         self.namer,
                         [self.lsi_dim],
                         *args,
                         **kwargs)
    def namer(self, *args, **kwargs):
        s = affinity_grid_sweep.namer(self, *args[1:], **kwargs)
        s += f"_lsi-{self.lsi_dim}"
        return s
 def run_affinity_lsi_grid_sweep(savefile, inpath, outpath, dampings=[0.8], max_iters=[3000], convergence_iters=[30], preference_quantiles=[0.5], lsi_dimensions='all',n_cores=30):
    """Run affinity clustering once or more with different parameters.
    Usage:
    affinity_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --max_iters=<csv> --dampings=<csv> --preference_quantiles=<csv> --lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
    Keword arguments:
    savefile: path to save the metadata and diagnostics 
    inpath: path to folder containing feather files with LSI similarity labeled matrices of subreddit similarities.
    outpath: path to output fit kmeans clusterings.
    dampings:one or more numbers in [0.5, 1). damping parameter in affinity propagatin clustering. 
    preference_quantiles:one or more numbers in (0,1) for selecting the 'preference' parameter.
    convergence_iters:one or more integers of number of iterations without improvement before stopping.
    max_iters: one or more numbers of different maximum interations.
    lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
    """
    obj = affinity_lsi_grid_sweep(inpath,
                            lsi_dimensions,
                            outpath,
                            map(float,dampings),
                            map(int,max_iters),
                            map(int,convergence_iters),
                            map(float,preference_quantiles))
    obj.run(n_cores)
    obj.save(savefile)
 if __name__ == "__main__":
    fire.Fire(run_affinity_lsi_grid_sweep)
--- a/clustering/clustering.py
+++ b/clustering/clustering.py
@@ -6,21 +6,20 @@ import numpy as np
 from sklearn.cluster import AffinityPropagation
 import fire
 from pathlib import Path
 from multiprocessing import cpu_count
 from dataclasses import dataclass
 from clustering_base import sim_to_dist, process_clustering_result, clustering_result, read_similarity_mat
-def read_similarity_mat(similarities, use_threads=True):
+def affinity_clustering(similarities, output, *args, **kwargs):
    df = pd.read_feather(similarities, use_threads=use_threads)
    mat = np.array(df.drop('_subreddit',1))
    n = mat.shape[0]
    mat[range(n),range(n)] = 1
    return (df._subreddit,mat)
 def affinity_clustering(similarities, *args, **kwargs):
    subreddits, mat = read_similarity_mat(similarities)
-    return _affinity_clustering(mat, subreddits, *args, **kwargs)
+    clustering = _affinity_clustering(mat, *args, **kwargs)
    cluster_data = process_clustering_result(clustering, subreddits)
    cluster_data['algorithm'] = 'affinity'
    return(cluster_data)
 def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000, convergence_iter=30, preference_quantile=0.5, random_state=1968, verbose=True):
    '''
-    similarities: feather file with a dataframe of similarity scores
+    similarities: matrix of similarity scores
    preference_quantile: parameter controlling how many clusters to make. higher values = more clusters. 0.85 is a good value with 3000 subreddits.
    damping: parameter controlling how iterations are merged. Higher values make convergence faster and more dependable. 0.85 is a good value for the 10000 subreddits by author. 
    '''
@@ -40,25 +39,14 @@ def _affinity_clustering(mat, subreddits, output, damping=0.9, max_iter=100000,
                                     verbose=verbose,
                                     random_state=random_state).fit(mat)
-
+    cluster_data = process_clustering_result(clustering, subreddits)
-    print(f"clustering took {clustering.n_iter_} iterations")
+    output = Path(output)
-    clusters = clustering.labels_
+    output.parent.mkdir(parents=True,exist_ok=True)
    print(f"found {len(set(clusters))} clusters")
    cluster_data = pd.DataFrame({'subreddit': subreddits,'cluster':clustering.labels_})
    cluster_sizes = cluster_data.groupby("cluster").count()
    print(f"the largest cluster has {cluster_sizes.subreddit.max()} members")
    print(f"the median cluster has {cluster_sizes.subreddit.median()} members")
    print(f"{(cluster_sizes.subreddit==1).sum()} clusters have 1 member")
    sys.stdout.flush()
    cluster_data.to_feather(output)
    print(f"saved {output}")
    return clustering
 if __name__ == "__main__":
    fire.Fire(affinity_clustering)
--- a/clustering/clustering_base.py
+++ b/clustering/clustering_base.py
@@ -0,0 +1,105 @@
 from pathlib import Path
 import numpy as np
 import pandas as pd
 from dataclasses import dataclass
 from sklearn.metrics import silhouette_score, silhouette_samples
 from collections import Counter
 # this is meant to be an interface, not created directly
 class clustering_job:
    def __init__(self, infile, outpath, name, call, *args, **kwargs):
        self.outpath = Path(outpath)
        self.call = call
        self.args = args
        self.kwargs = kwargs
        self.infile = Path(infile)
        self.name = name
        self.hasrun = False
    def run(self):
        self.subreddits, self.mat = self.read_distance_mat(self.infile)
        self.clustering = self.call(self.mat, *self.args, **self.kwargs)
        self.cluster_data = self.process_clustering(self.clustering, self.subreddits)
        self.score = self.silhouette()
        self.outpath.mkdir(parents=True, exist_ok=True)
        self.cluster_data.to_feather(self.outpath/(self.name + ".feather"))
        self.hasrun = True
    def get_info(self):
        if not self.hasrun:
            self.run()
        self.result = clustering_result(outpath=str(self.outpath.resolve()),
                                        silhouette_score=self.score,
                                        name=self.name,
                                        n_clusters=self.n_clusters,
                                        n_isolates=self.n_isolates,
                                        silhouette_samples = self.silsampout
                                        )
        return self.result
    def silhouette(self):
        counts = Counter(self.clustering.labels_)
        singletons = [key for key, value in counts.items() if value == 1]
        isolates = (self.clustering.labels_ == -1) | (np.isin(self.clustering.labels_,np.array(singletons)))
        scoremat = self.mat[~isolates][:,~isolates]
        if self.n_clusters > 1:
            score = silhouette_score(scoremat, self.clustering.labels_[~isolates], metric='precomputed')
            silhouette_samp = silhouette_samples(self.mat, self.clustering.labels_, metric='precomputed')
            silhouette_samp = pd.DataFrame({'subreddit':self.subreddits,'score':silhouette_samp})
            self.outpath.mkdir(parents=True, exist_ok=True)
            silsampout = self.outpath / ("silhouette_samples-" + self.name +  ".feather")
            self.silsampout = silsampout.resolve()
            silhouette_samp.to_feather(self.silsampout)
        else:
            score = None
            self.silsampout = None
        return score
    def read_distance_mat(self, similarities, use_threads=True):
        df = pd.read_feather(similarities, use_threads=use_threads)
        mat = np.array(df.drop('_subreddit',1))
        n = mat.shape[0]
        mat[range(n),range(n)] = 1
        return (df._subreddit,1-mat)
    def process_clustering(self, clustering, subreddits):
        if hasattr(clustering,'n_iter_'):
            print(f"clustering took {clustering.n_iter_} iterations")
        clusters = clustering.labels_
        self.n_clusters = len(set(clusters))
        print(f"found {self.n_clusters} clusters")
        cluster_data = pd.DataFrame({'subreddit': subreddits,'cluster':clustering.labels_})
        cluster_sizes = cluster_data.groupby("cluster").count().reset_index()
        print(f"the largest cluster has {cluster_sizes.loc[cluster_sizes.cluster!=-1].subreddit.max()} members")
        print(f"the median cluster has {cluster_sizes.subreddit.median()} members")
        n_isolates1 = (cluster_sizes.subreddit==1).sum()
        print(f"{n_isolates1} clusters have 1 member")
        n_isolates2 = cluster_sizes.loc[cluster_sizes.cluster==-1,:]['subreddit'].to_list()
        if len(n_isolates2) > 0:
            n_isloates2 = n_isolates2[0]
        print(f"{n_isolates2} subreddits are in cluster -1",flush=True)
        if n_isolates1 == 0:
            self.n_isolates = n_isolates2
        else:
            self.n_isolates = n_isolates1
        return cluster_data
@dataclass
 class clustering_result:
    outpath:Path
    silhouette_score:float
    name:str
    n_clusters:int
    n_isolates:int
    silhouette_samples:str
--- a/clustering/fit_tsne.py
+++ b/clustering/fit_tsne.py
@@ -17,7 +17,7 @@ def fit_tsne(similarities, output, learning_rate=750, perplexity=50, n_iter=1000
    df = pd.read_feather(similarities)
    n = df.shape[0]
-    mat = np.array(df.drop('subreddit',1),dtype=np.float64)
+    mat = np.array(df.drop('_subreddit',1),dtype=np.float64)
    mat[range(n),range(n)] = 1
    mat[mat > 1] = 1
    dist = 2*np.arccos(mat)/np.pi
@@ -26,7 +26,7 @@ def fit_tsne(similarities, output, learning_rate=750, perplexity=50, n_iter=1000
    tsne_fit_whole = tsne_fit_model.fit_transform(dist)
-    plot_data = pd.DataFrame({'x':tsne_fit_whole[:,0],'y':tsne_fit_whole[:,1], 'subreddit':df.subreddit})
+    plot_data = pd.DataFrame({'x':tsne_fit_whole[:,0],'y':tsne_fit_whole[:,1], '_subreddit':df['_subreddit']})
    plot_data.to_feather(output)
--- a/clustering/grid_sweep.py
+++ b/clustering/grid_sweep.py
@@ -0,0 +1,33 @@
 from pathlib import Path
 from multiprocessing import Pool, cpu_count
 from itertools import product, chain
 import pandas as pd
 class grid_sweep:
    def __init__(self, jobtype, inpath, outpath, namer, *args):
        self.jobtype = jobtype
        self.namer = namer
        print(*args)
        grid = list(product(*args))
        inpath = Path(inpath)
        outpath = Path(outpath)
        self.hasrun = False
        self.grid = [(inpath,outpath,namer(*g)) + g for g in grid]
        self.jobs = [jobtype(*g) for g in self.grid]
    def run(self, cores=20):
        if cores is not None and cores > 1:
            with Pool(cores) as pool:
                infos = pool.map(self.jobtype.get_info, self.jobs)
        else:
            infos = map(self.jobtype.get_info, self.jobs)
        self.infos = pd.DataFrame(infos)
        self.hasrun = True
    def save(self, outcsv):
        if not self.hasrun:
            self.run()
        outcsv = Path(outcsv)
        outcsv.parent.mkdir(parents=True, exist_ok=True)
        self.infos.to_csv(outcsv)
--- a/clustering/hdbscan_clustering.py
+++ b/clustering/hdbscan_clustering.py
@@ -0,0 +1,159 @@
 from clustering_base import clustering_result, clustering_job
 from grid_sweep import grid_sweep
 from dataclasses import dataclass
 import hdbscan
 from sklearn.neighbors import NearestNeighbors
 import plotnine as pn
 import numpy as np
 from itertools import product, starmap, chain
 import pandas as pd
 from multiprocessing import cpu_count
 import fire
 def test_select_hdbscan_clustering():
    # select_hdbscan_clustering("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_30k_LSI",
    #                           "test_hdbscan_author30k",
    #                           min_cluster_sizes=[2],
    #                           min_samples=[1,2],
    #                           cluster_selection_epsilons=[0,0.05,0.1,0.15],
    #                           cluster_selection_methods=['eom','leaf'],
    #                           lsi_dimensions='all')
    inpath = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity/comment_authors_compex_LSI"
    outpath = "test_hdbscan";
    min_cluster_sizes=[2,3,4];
    min_samples=[1,2,3];
    cluster_selection_epsilons=[0,0.1,0.3,0.5];
    cluster_selection_methods=[1];
    lsi_dimensions='all'
    gs = hdbscan_lsi_grid_sweep(inpath, "all", outpath, min_cluster_sizes, min_samples, cluster_selection_epsilons, cluster_selection_methods)
    gs.run(20)
    gs.save("test_hdbscan/lsi_sweep.csv")
    # job1 = hdbscan_lsi_job(infile=inpath, outpath=outpath, name="test", lsi_dims=500, min_cluster_size=2, min_samples=1,cluster_selection_epsilon=0,cluster_selection_method='eom')
    # job1.run()
    # print(job1.get_info())
    # df = pd.read_csv("test_hdbscan/selection_data.csv")
    # test_select_hdbscan_clustering()
    # check_clusters = pd.read_feather("test_hdbscan/500_2_2_0.1_eom.feather")
    # silscores = pd.read_feather("test_hdbscan/silhouette_samples500_2_2_0.1_eom.feather")
    # c = check_clusters.merge(silscores,on='subreddit')#    fire.Fire(select_hdbscan_clustering)
 class hdbscan_grid_sweep(grid_sweep):
    def __init__(self,
                 inpath,
                 outpath,
                 *args,
                 **kwargs):
        super().__init__(hdbscan_job, inpath, outpath, self.namer, *args, **kwargs)
    def namer(self,
              min_cluster_size,
              min_samples,
              cluster_selection_epsilon,
              cluster_selection_method):
        return f"mcs-{min_cluster_size}_ms-{min_samples}_cse-{cluster_selection_epsilon}_csm-{cluster_selection_method}"
@dataclass
 class hdbscan_clustering_result(clustering_result):
    min_cluster_size:int
    min_samples:int
    cluster_selection_epsilon:float
    cluster_selection_method:str
 class hdbscan_job(clustering_job):
    def __init__(self, infile, outpath, name, min_cluster_size=2, min_samples=1, cluster_selection_epsilon=0, cluster_selection_method='eom'):
        super().__init__(infile,
                         outpath,
                         name,
                         call=hdbscan_job._hdbscan_clustering,
                         min_cluster_size=min_cluster_size,
                         min_samples=min_samples,
                         cluster_selection_epsilon=cluster_selection_epsilon,
                         cluster_selection_method=cluster_selection_method
                         )
        self.min_cluster_size = min_cluster_size
        self.min_samples = min_samples
        self.cluster_selection_epsilon = cluster_selection_epsilon
        self.cluster_selection_method = cluster_selection_method
 #        self.mat = 1 - self.mat
    def _hdbscan_clustering(mat, *args, **kwargs):
        print(f"running hdbscan clustering. args:{args}. kwargs:{kwargs}")
        print(mat)
        clusterer = hdbscan.HDBSCAN(metric='precomputed',
                                    core_dist_n_jobs=cpu_count(),
                                    *args,
                                    **kwargs,
                                    )
        clustering = clusterer.fit(mat.astype('double'))
        return(clustering)
    def get_info(self):
        result = super().get_info()
        self.result = hdbscan_clustering_result(**result.__dict__,
                                                min_cluster_size=self.min_cluster_size,
                                                min_samples=self.min_samples,
                                                cluster_selection_epsilon=self.cluster_selection_epsilon,
                                                cluster_selection_method=self.cluster_selection_method)
        return self.result
 def run_hdbscan_grid_sweep(savefile, inpath, outpath,  min_cluster_sizes=[2], min_samples=[1], cluster_selection_epsilons=[0], cluster_selection_methods=['eom']):
    """Run hdbscan clustering once or more with different parameters.
    Usage:
    hdbscan_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --min_cluster_sizes=<csv> --min_samples=<csv> --cluster_selection_epsilons=<csv> --cluster_selection_methods=<csv "eom"|"leaf">
    Keword arguments:
    savefile: path to save the metadata and diagnostics 
    inpath: path to feather data containing a labeled matrix of subreddit similarities.
    outpath: path to output fit kmeans clusterings.
    min_cluster_sizes: one or more integers indicating the minumum cluster size
    min_samples: one ore more integers indicating the minimum number of samples used in the algorithm
    cluster_selection_epsilon: one or more similarity thresholds for transition from dbscan to hdbscan
    cluster_selection_method: "eom" or "leaf" eom gives larger clusters. 
    """    
    obj = hdbscan_grid_sweep(inpath,
                             outpath,
                             map(int,min_cluster_sizes),
                             map(int,min_samples),
                             map(float,cluster_selection_epsilons),
                             cluster_selection_methods)
    obj.run()
    obj.save(savefile)
 def KNN_distances_plot(mat,outname,k=2):
    nbrs = NearestNeighbors(n_neighbors=k,algorithm='auto',metric='precomputed').fit(mat)
    distances, indices = nbrs.kneighbors(mat)
    d2 = distances[:,-1]
    df = pd.DataFrame({'dist':d2})
    df = df.sort_values("dist",ascending=False)
    df['idx'] = np.arange(0,d2.shape[0]) + 1
    p = pn.qplot(x='idx',y='dist',data=df,geom='line') + pn.scales.scale_y_continuous(minor_breaks = np.arange(0,50)/50,
                                                                                      breaks = np.arange(0,10)/10)
    p.save(outname,width=16,height=10)
 def make_KNN_plots():
    similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10k.feather"
    subreddits, mat = read_similarity_mat(similarities)
    mat = sim_to_dist(mat)
    KNN_distances_plot(mat,k=2,outname='terms_knn_dist2.png')
    similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10k.feather"
    subreddits, mat = read_similarity_mat(similarities)
    mat = sim_to_dist(mat)
    KNN_distances_plot(mat,k=2,outname='authors_knn_dist2.png')
    similarities = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k.feather"
    subreddits, mat = read_similarity_mat(similarities)
    mat = sim_to_dist(mat)
    KNN_distances_plot(mat,k=2,outname='authors-tf_knn_dist2.png')
 if __name__ == "__main__":
    fire.Fire(run_hdbscan_grid_sweep)
 #    test_select_hdbscan_clustering()
    #fire.Fire(select_hdbscan_clustering)  
--- a/clustering/hdbscan_clustering_lsi.py
+++ b/clustering/hdbscan_clustering_lsi.py
@@ -0,0 +1,101 @@
 from hdbscan_clustering import hdbscan_job, hdbscan_grid_sweep, hdbscan_clustering_result
 from lsi_base import lsi_grid_sweep, lsi_mixin, lsi_result_mixin
 from grid_sweep import grid_sweep
 import fire
 from dataclasses import dataclass
@dataclass
 class hdbscan_clustering_result_lsi(hdbscan_clustering_result, lsi_result_mixin):
    pass 
 class hdbscan_lsi_job(hdbscan_job, lsi_mixin):
    def __init__(self, infile, outpath, name, lsi_dims, *args, **kwargs):
        super().__init__(
                         infile,
                         outpath,
                         name,
                         *args,
                         **kwargs)
        super().set_lsi_dims(lsi_dims)
    def get_info(self):
        partial_result = super().get_info()
        self.result = hdbscan_clustering_result_lsi(**partial_result.__dict__,
                                                    lsi_dimensions=self.lsi_dims)
        return self.result
 class hdbscan_lsi_grid_sweep(lsi_grid_sweep):
    def __init__(self,
                 inpath,
                 lsi_dims,
                 outpath,
                 min_cluster_sizes,
                 min_samples,
                 cluster_selection_epsilons,
                 cluster_selection_methods
                 ):
        super().__init__(hdbscan_lsi_job,
                         _hdbscan_lsi_grid_sweep,
                         inpath,
                         lsi_dims,
                         outpath,
                         min_cluster_sizes,
                         min_samples,
                         cluster_selection_epsilons,
                         cluster_selection_methods)
 class _hdbscan_lsi_grid_sweep(grid_sweep):
    def __init__(self,
                 inpath,
                 outpath,
                 lsi_dim,
                 *args,
                 **kwargs):
        print(args)
        print(kwargs)
        self.lsi_dim = lsi_dim
        self.jobtype = hdbscan_lsi_job
        super().__init__(self.jobtype, inpath, outpath, self.namer, [self.lsi_dim], *args, **kwargs)
    def namer(self, *args, **kwargs):
        s = hdbscan_grid_sweep.namer(self, *args[1:], **kwargs)
        s += f"_lsi-{self.lsi_dim}"
        return s
 def run_hdbscan_lsi_grid_sweep(savefile, inpath, outpath,  min_cluster_sizes=[2], min_samples=[1], cluster_selection_epsilons=[0], cluster_selection_methods=[1],lsi_dimensions='all'):
    """Run hdbscan clustering once or more with different parameters.
    Usage:
    hdbscan_clustering_lsi --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --min_cluster_sizes=<csv> --min_samples=<csv> --cluster_selection_epsilons=<csv> --cluster_selection_methods=[eom]> --lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
    Keword arguments:
    savefile: path to save the metadata and diagnostics 
    inpath: path to folder containing feather files with LSI similarity labeled matrices of subreddit similarities.
    outpath: path to output fit clusterings.
    min_cluster_sizes: one or more integers indicating the minumum cluster size
    min_samples: one ore more integers indicating the minimum number of samples used in the algorithm
    cluster_selection_epsilons: one or more similarity thresholds for transition from dbscan to hdbscan
    cluster_selection_methods: one or more of "eom" or "leaf" eom gives larger clusters. 
    lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
    """    
    obj = hdbscan_lsi_grid_sweep(inpath,
                                 lsi_dimensions,
                                 outpath,
                                 list(map(int,min_cluster_sizes)),
                                 list(map(int,min_samples)),
                                 list(map(float,cluster_selection_epsilons)),
                                 cluster_selection_methods)
    obj.run(10)
    obj.save(savefile)
 if __name__ == "__main__":
    fire.Fire(run_hdbscan_lsi_grid_sweep)
--- a/clustering/kmeans_clustering.py
+++ b/clustering/kmeans_clustering.py
@@ -0,0 +1,105 @@
 from sklearn.cluster import KMeans
 import fire
 from pathlib import Path
 from dataclasses import dataclass
 from clustering_base import clustering_result, clustering_job
 from grid_sweep import grid_sweep
@dataclass
 class kmeans_clustering_result(clustering_result):
    n_clusters:int
    n_init:int
    max_iter:int
 class kmeans_job(clustering_job):
    def __init__(self, infile, outpath, name, n_clusters, n_init=10, max_iter=100000, random_state=1968, verbose=True):
        super().__init__(infile,
                         outpath,
                         name,
                         call=kmeans_job._kmeans_clustering,
                         n_clusters=n_clusters,
                         n_init=n_init,
                         max_iter=max_iter,
                         random_state=random_state,
                         verbose=verbose)
        self.n_clusters=n_clusters
        self.n_init=n_init
        self.max_iter=max_iter
    def _kmeans_clustering(mat, *args, **kwargs):
        clustering = KMeans(*args,
                            **kwargs,
                            ).fit(mat)
        return clustering
    def get_info(self):
        result = super().get_info()
        self.result = kmeans_clustering_result(**result.__dict__,
                                               n_init=self.n_init,
                                               max_iter=self.max_iter)
        return self.result
 class kmeans_grid_sweep(grid_sweep):
    def __init__(self,
                 inpath,
                 outpath,
                 *args,
                 **kwargs):
        super().__init__(kmeans_job, inpath, outpath, self.namer, *args, **kwargs)
    def namer(self,
             n_clusters,
             n_init,
             max_iter):
        return f"nclusters-{n_clusters}_nit-{n_init}_maxit-{max_iter}"
 def test_select_kmeans_clustering():
    inpath = "/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/"
    outpath = "test_kmeans";
    n_clusters=[200,300,400];
    n_init=[1,2,3];
    max_iter=[100000]
    gs = kmeans_lsi_grid_sweep(inpath, 'all', outpath, n_clusters, n_init, max_iter)
    gs.run(1)
    cluster_selection_epsilons=[0,0.1,0.3,0.5];
    cluster_selection_methods=['eom'];
    lsi_dimensions='all'
    gs = hdbscan_lsi_grid_sweep(inpath, "all", outpath, min_cluster_sizes, min_samples, cluster_selection_epsilons, cluster_selection_methods)
    gs.run(20)
    gs.save("test_hdbscan/lsi_sweep.csv")
 def run_kmeans_grid_sweep(savefile, inpath, outpath,  n_clusters=[500], n_inits=[1], max_iters=[3000]):
    """Run kmeans clustering once or more with different parameters.
    Usage:
    kmeans_clustering.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH --n_clusters=<csv number of clusters> --n_inits=<csv> --max_iters=<csv>
    Keword arguments:
    savefile: path to save the metadata and diagnostics 
    inpath: path to feather data containing a labeled matrix of subreddit similarities.
    outpath: path to output fit kmeans clusterings.
    n_clusters: one or more numbers of kmeans clusters to select.
    n_inits: one or more numbers of different initializations to use for each clustering.
    max_iters: one or more numbers of different maximum interations. 
    """    
    obj = kmeans_grid_sweep(inpath,
                            outpath,
                            map(int,n_clusters),
                            map(int,n_inits),
                            map(int,max_iters))
    obj.run(1)
    obj.save(savefile)
 if __name__ == "__main__":
    fire.Fire(run_kmeans_grid_sweep)
--- a/clustering/kmeans_clustering_lsi.py
+++ b/clustering/kmeans_clustering_lsi.py
@@ -0,0 +1,93 @@
 import fire
 from dataclasses import dataclass
 from kmeans_clustering import kmeans_job, kmeans_clustering_result, kmeans_grid_sweep
 from lsi_base import lsi_mixin, lsi_result_mixin, lsi_grid_sweep
 from grid_sweep import grid_sweep
@dataclass
 class kmeans_clustering_result_lsi(kmeans_clustering_result, lsi_result_mixin):
    pass
 class kmeans_lsi_job(kmeans_job, lsi_mixin):
    def __init__(self, infile, outpath, name, lsi_dims, *args, **kwargs):
        super().__init__(infile,
                         outpath,
                         name,
                         *args,
                         **kwargs)
        super().set_lsi_dims(lsi_dims)
    def get_info(self):
        result = super().get_info()
        self.result = kmeans_clustering_result_lsi(**result.__dict__,
                                                   lsi_dimensions=self.lsi_dims)
        return self.result
 class _kmeans_lsi_grid_sweep(grid_sweep):
    def __init__(self,
                 inpath,
                 outpath,
                 lsi_dim,
                 *args,
                 **kwargs):
        print(args)
        print(kwargs)
        self.lsi_dim = lsi_dim
        self.jobtype = kmeans_lsi_job
        super().__init__(self.jobtype, inpath, outpath, self.namer, [self.lsi_dim], *args, **kwargs)
    def namer(self, *args, **kwargs):
        s = kmeans_grid_sweep.namer(self, *args[1:], **kwargs)
        s += f"_lsi-{self.lsi_dim}"
        return s
 class kmeans_lsi_grid_sweep(lsi_grid_sweep):
    def __init__(self,
                 inpath,
                 lsi_dims,
                 outpath,
                 n_clusters,
                 n_inits,
                 max_iters
                 ):
        super().__init__(kmeans_lsi_job,
                         _kmeans_lsi_grid_sweep,
                         inpath,
                         lsi_dims,
                         outpath,
                         n_clusters,
                         n_inits,
                         max_iters)
 def run_kmeans_lsi_grid_sweep(savefile, inpath, outpath,  n_clusters=[500], n_inits=[1], max_iters=[3000], lsi_dimensions="all"):
    """Run kmeans clustering once or more with different parameters.
    Usage:
    kmeans_clustering_lsi.py --savefile=SAVEFILE --inpath=INPATH --outpath=OUTPATH d--lsi_dimensions=<"all"|csv number of LSI dimensions to use> --n_clusters=<csv number of clusters> --n_inits=<csv> --max_iters=<csv>
    Keword arguments:
    savefile: path to save the metadata and diagnostics 
    inpath: path to folder containing feather files with LSI similarity labeled matrices of subreddit similarities.
    outpath: path to output fit kmeans clusterings.
    lsi_dimensions: either "all" or one or more available lsi similarity dimensions at INPATH.
    n_clusters: one or more numbers of kmeans clusters to select.
    n_inits: one or more numbers of different initializations to use for each clustering.
    max_iters: one or more numbers of different maximum interations. 
    """    
    obj = kmeans_lsi_grid_sweep(inpath,
                                lsi_dimensions,
                                outpath,
                                list(map(int,n_clusters)),
                                list(map(int,n_inits)),
                                list(map(int,max_iters))
                                )
    obj.run(1)
    obj.save(savefile)
 if __name__ == "__main__":
    fire.Fire(run_kmeans_lsi_grid_sweep)
--- a/clustering/lsi_base.py
+++ b/clustering/lsi_base.py
@@ -0,0 +1,29 @@
 from clustering_base import clustering_job, clustering_result
 from grid_sweep import grid_sweep
 from dataclasses import dataclass
 from itertools import chain
 from pathlib import Path
 class lsi_mixin():
    def set_lsi_dims(self, lsi_dims):
        self.lsi_dims = lsi_dims
@dataclass
 class lsi_result_mixin:
    lsi_dimensions:int
 class lsi_grid_sweep(grid_sweep):
    def __init__(self, jobtype, subsweep, inpath, lsi_dimensions, outpath, *args, **kwargs):
        self.jobtype = jobtype
        self.subsweep = subsweep
        inpath = Path(inpath)
        if lsi_dimensions == 'all':
            lsi_paths = list(inpath.glob("*.feather"))
        else:
            lsi_paths = [inpath / (str(dim) + '.feather') for dim in lsi_dimensions]
        print(lsi_paths)
        lsi_nums = [int(p.stem) for p in lsi_paths]
        self.hasrun = False
        self.subgrids = [self.subsweep(lsi_path, outpath,  lsi_dim, *args, **kwargs) for lsi_dim, lsi_path in zip(lsi_nums, lsi_paths)]
        self.jobs = list(chain(*map(lambda gs: gs.jobs, self.subgrids)))
--- a/clustering/pick_best_clustering.py
+++ b/clustering/pick_best_clustering.py
@@ -0,0 +1,33 @@
 #!/usr/bin/env python3
 import fire
 import pandas as pd
 from pathlib import Path
 import shutil
 selection_data="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/clustering/comment_authors_compex_LSI/selection_data.csv"
 outpath = 'test_best.feather'
 min_clusters=50; max_isolates=7500; min_cluster_size=2
 # pick the best clustering according to silhouette score subject to contraints
 def pick_best_clustering(selection_data, output, min_clusters, max_isolates, min_cluster_size):
    df = pd.read_csv(selection_data,index_col=0)
    df = df.sort_values("silhouette_score",ascending=False)
    # not sure I fixed the bug underlying this fully or not.
    df['n_isolates_str'] = df.n_isolates.str.strip("[]")
    df['n_isolates_0'] = df['n_isolates_str'].apply(lambda l: len(l) == 0)
    df.loc[df.n_isolates_0,'n_isolates'] = 0
    df.loc[~df.n_isolates_0,'n_isolates'] = df.loc[~df.n_isolates_0].n_isolates_str.apply(lambda l: int(l))
    best_cluster = df[(df.n_isolates <= max_isolates)&(df.n_clusters >= min_clusters)&(df.min_cluster_size==min_cluster_size)]
    best_cluster = best_cluster.iloc[0]
    best_lsi_dimensions = best_cluster.lsi_dimensions
    print(best_cluster.to_dict())
    best_path = Path(best_cluster.outpath) / (str(best_cluster['name']) + ".feather")
    shutil.copy(best_path,output)
    print(f"lsi dimensions:{best_lsi_dimensions}")
 if __name__ == "__main__":
    fire.Fire(pick_best_clustering)
--- a/clustering/selection.py
+++ b/clustering/selection.py
@@ -1,101 +1,38 @@
 from sklearn.metrics import silhouette_score
 from sklearn.cluster import AffinityPropagation
 from functools import partial
 from clustering import _affinity_clustering, read_similarity_mat
 from dataclasses import dataclass
 from multiprocessing  import Pool, cpu_count, Array, Process
 from pathlib import Path
 from itertools import product, starmap
 import numpy as np
 import pandas as pd
-import fire
+import plotnine as pn
-import sys
+from pathlib import Path
 from clustering.fit_tsne import fit_tsne
 from visualization.tsne_vis import build_visualization
-# silhouette is the only one that doesn't need the feature matrix. So it's probably the only one that's worth trying. 
+df = pd.read_csv("/gscratch/comdata/output/reddit_clustering/subreddit_comment_authors-tf_10k_LSI/hdbscan/selection_data.csv",index_col=0)
-@dataclass
+# plot silhouette_score as a function of isolates
-class clustering_result:
+df = df.sort_values("silhouette_score")
    outpath:Path
    damping:float
    max_iter:int
    convergence_iter:int
    preference_quantile:float
    silhouette_score:float
    alt_silhouette_score:float
    name:str
 df['n_isolates'] = df.n_isolates.str.split("\n0").apply(lambda rg: int(rg[1]))
 p = pn.ggplot(df,pn.aes(x='n_isolates',y='silhouette_score')) + pn.geom_point()
 p.save("isolates_x_score.png")
-def sim_to_dist(mat):
+p = pn.ggplot(df,pn.aes(y='n_clusters',x='n_isolates',color='silhouette_score')) + pn.geom_point()
-    dist = 1-mat
+p.save("clusters_x_isolates.png")
    dist[dist < 0] = 0
    np.fill_diagonal(dist,0)
    return dist
-def do_clustering(damping, convergence_iter, preference_quantile, name, mat, subreddits,  max_iter,  outdir:Path, random_state, verbose, alt_mat, overwrite=False):
+# the best result for hdbscan seems like this one: it has a decent number of 
-    if name is None:
+# i think I prefer the 'eom' clustering style because larger clusters are less likely to suffer from ommitted variables
-        name = f"damping-{damping}_convergenceIter-{convergence_iter}_preferenceQuantile-{preference_quantile}"
+best_eom = df[(df.n_isolates <5000)&(df.silhouette_score>0.4)&(df.cluster_selection_method=='eom')&(df.min_cluster_size==2)].iloc[df.shape[1]]
    print(name)
    sys.stdout.flush()
    outpath = outdir / (str(name) + ".feather")
    print(outpath)
    clustering = _affinity_clustering(mat, subreddits, outpath, damping, max_iter, convergence_iter, preference_quantile, random_state, verbose)
    mat = sim_to_dist(clustering.affinity_matrix_)
-    score = silhouette_score(mat, clustering.labels_, metric='precomputed')
+best_lsi = df[(df.n_isolates <5000)&(df.silhouette_score>0.4)&(df.cluster_selection_method=='leaf')&(df.min_cluster_size==2)].iloc[df.shape[1]]
-    if alt_mat is not None:
+tsne_data = Path("./clustering/authors-tf_lsi850_tsne.feather")
        alt_distances = sim_to_dist(alt_mat)
        alt_score = silhouette_score(alt_mat, clustering.labels_, metric='precomputed')
-    res = clustering_result(outpath=outpath,
+if not tnse_data.exists():
-                            damping=damping,
+    fit_tsne("/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/850.feather",
-                            max_iter=max_iter,
+             tnse_data)
                            convergence_iter=convergence_iter,
                            preference_quantile=preference_quantile,
                            silhouette_score=score,
                            alt_silhouette_score=score,
                            name=str(name))
-    return res
+build_visualization("./clustering/authors-tf_lsi850_tsne.feather",
                    Path(best_eom.outpath)/(best_eom['name']+'.feather'),
                    "./authors-tf_lsi850_best_eom.html")
-# alt similiarities is for checking the silhouette coefficient of an alternative measure of similarity (e.g., topic similarities for user clustering).
+build_visualization("./clustering/authors-tf_lsi850_tsne.feather",
                    Path(best_leaf.outpath)/(best_leaf['name']+'.feather'),
                    "./authors-tf_lsi850_best_leaf.html")
 def select_affinity_clustering(similarities, outdir, outinfo, damping=[0.9], max_iter=100000, convergence_iter=[30], preference_quantile=[0.5], random_state=1968, verbose=True, alt_similarities=None, J=None):
    damping = list(map(float,damping))
    convergence_iter = convergence_iter = list(map(int,convergence_iter))
    preference_quantile = list(map(float,preference_quantile))
    if type(outdir) is str:
        outdir = Path(outdir)
    outdir.mkdir(parents=True,exist_ok=True)
    subreddits, mat = read_similarity_mat(similarities,use_threads=True)
    if alt_similarities is not None:
        alt_mat = read_similarity_mat(alt_similarities,use_threads=True)
    else:
        alt_mat = None
    if J is None:
        J = cpu_count()
    pool = Pool(J)
    # get list of tuples: the combinations of hyperparameters
    hyper_grid = product(damping, convergence_iter, preference_quantile)
    hyper_grid = (t + (str(i),) for i, t in enumerate(hyper_grid))
    _do_clustering = partial(do_clustering,  mat=mat, subreddits=subreddits, outdir=outdir, max_iter=max_iter, random_state=random_state, verbose=verbose, alt_mat=alt_mat)
    #    similarities = Array('d', mat)
    # call pool.starmap
    print("running clustering selection")
    clustering_data = pool.starmap(_do_clustering, hyper_grid)
    clustering_data = pd.DataFrame(list(clustering_data))
    clustering_data.to_csv(outinfo)
    return clustering_data
 if __name__ == "__main__":
    x = fire.Fire(select_affinity_clustering)
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -1,345 +0,0 @@
 # Reddit dumps → sorted parquet datasets
 This directory holds the pipeline that turns compressed Reddit dump files
 (`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the
 sorted, repartitioned parquet datasets that the rest of the project
 consumes.
 ## Pipeline overview
 The raw dumps are huge compressed json files with a lot of metadata that
 we may not need. They aren't indexed so it's expensive to pull data from
 just a handful of subreddits. It also turns out that it's a pain to read
 these compressed files straight into spark. Extracting useful variables
 from the dumps and building parquet datasets makes them easier to work
 with. This happens in two steps:
 1. Extracting json into (temporary, unpartitioned) parquet files using
   pyarrow.
 2. Repartitioning and sorting the data using pyspark.
 Breaking this down into two steps is useful because it allows us to
 decompress and parse the dumps in the backfill queue and then sort them
 in spark. Partitioning the data makes it possible to efficiently read
 data for specific subreddits or authors. Sorting it means that you can
 efficiently compute aggregations at the subreddit or user level. More
 documentation on using these files is available on the [CDSC wiki][hyak-datasets].
 The final datasets are in `/gscratch/comdata/output`:
 - `reddit_comments_by_author.parquet` has comments partitioned and sorted
  by username (lowercase).
 - `reddit_comments_by_subreddit.parquet` has comments partitioned and
  sorted by subreddit name (lowercase).
 - `reddit_submissions_by_author.parquet` has submissions partitioned and
  sorted by username (lowercase).
 - `reddit_submissions_by_subreddit.parquet` has submissions partitioned
  and sorted by subreddit name (lowercase).
 [hyak-datasets]: https://wiki.communitydata.science/CommunityData:Hyak_Datasets#Reading_Reddit_parquet_datasets
 ## Scripts
 | Script | Role |
 |---|---|
 | `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
 | `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads a directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Accepts `--indir` and `--mode` to support layered appends; defaults match the build-from-scratch workflow. |
 | `comments_merge.py`, `submissions_merge.py` | Merge entry points. Each is a Spark job that collapses all accumulated layers in the final datasets into a single clean layer. Launched via `start_spark_and_run.sh`. |
 | `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` / `merge_layers` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
 | `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |
 ## The three workflows
 ### Build from scratch — `build_from_scratch.sh`
 Use this when there is no existing parquet output, or when the upstream
 data has changed in a way that requires reparsing everything. Wipes the
 per-source temp directories, processes every `RC_*` / `RS_*` dump in the
 raw dumps directory through Part 1 (in parallel via GNU parallel), then
 runs the Part 2 Spark sort.
 ### Add new months — `add_months.sh YYYY-MM [YYYY-MM ...]`
 > **NOTE: written but not yet tested. Remove this notice after a
 > successful end-to-end run.**
 Use this for routine incremental updates. Runs Part 1 on only the
 specified months, then appends the sorted output as a new layer of
 partition files alongside the existing ones. No existing data is
 rewritten.
 Each run adds one layer to each final dataset directory. Spark and DuckDB
 read all layers together correctly. At a yearly update cadence the number
 of layers stays small; use `merge_layers.sh` to collapse them when
 needed.
 The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
 `SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
 via environment variables if the files are not in the standard locations:
 ```sh
 COMMENTS_DUMPDIR=/path/to/new/comments \
 SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
 ./add_months.sh 2025-01 2025-02 2025-03
 ```
 Part 1 runs directly on a compute node. For Part 2 there are two options:
 - **Single fat node** (simpler, often faster for smaller sorts): `salloc`
  a `cpu-g2` node (128 cores, ~1 TB RAM) and run the Part 2 script
  directly with `spark-submit` or `python3`. See Step 6 of the walkthrough
  below for the `salloc` invocation.
 - **Multi-node Spark cluster**: use `start_spark_and_run.sh` from a login
  node. It allocates nodes via `salloc` and handles cluster coordination.
  Pass the number of nodes as the first argument.
 ### Merge layers — `merge_layers.sh`
 > **NOTE: written but not yet tested. Remove this notice after a
 > successful end-to-end run.**
 Use this to collapse accumulated layers from incremental adds into a
 single clean layer. Reads the existing final datasets, re-sorts
 everything, writes to `.merging` temp paths, then atomically replaces the
 originals via rename.
 Run this when query performance has degraded due to many layers, or any
 time you want a clean single-file-per-partition layout. The existing
 datasets are safe until the rename step completes; see `merge_layers.sh`
 for recovery notes if interrupted. As with `add_months.sh`, Part 2 can
 run on a single fat node or via `start_spark_and_run.sh`.
 ## Running steps individually
 Both `.sh` runners are written so that every meaningful step is a
 separate, self-contained command. If something fails partway through, or
 you want to inspect intermediate state, you can copy any single line out
 of the runner and execute it standalone. For example:
 ```sh
 # parse one specific file (skipping the rest of the workflow)
 python3 comments_part1.py parse_dump RC_2025-03.zst
 # override default dump/output paths from the CLI
 python3 comments_part1.py parse_dump RC_2025-03.zst \
    --dumpdir=/tmp/test --outdir=/tmp/out
 # regenerate just the task list
 python3 submissions_part1.py gen_task_list
 ```
 The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
 Hyak-provided wrapper not included in this repo); see the wiki for the
 launch convention.
 ## Detailed walkthrough: refreshing the data on Hyak
 This walkthrough describes the process we went through updating Reddit
 data from the PushShift cutoff up to the end of 2024. Adapting it for
 newer data should just involve using different academic torrent files
 that start from 2025 onwards. For a single-month update, the
 `add_new_month.sh` workflow above is much shorter; this walkthrough is
 for the bulk-refresh case.
 ### Prerequisites
 - [Set up Hyak with CDSC lab][hyak-setup] (make sure to update config
  and `.bashrc`)
 - [Go through the Hyak Getting Started tutorial][hyak-syllabus]
 Reddit dumps info (handled by `u/Watchful1` and `u/RaiderBDev`):
 - [Watchful1's reddit explanation][watchful1-explainer] (separated by
  subreddit), the [dataset not divided by subreddits][watchful1-bulk],
  and the [GitHub repo with scripts for analyzing data][watchful1-repo]
 - [RaiderBDev monthly dumps][raiderbdev-monthly] and
  [RaiderBDev's ArcticShift API][arctic-shift]
 - The [2005-06 to 2024-12 academic torrent][academic-torrent] used for
  the 2005-2024 refresh
 CDSC and Hyak docs:
 - [Hyak docs — how to work with modules][hyak-modules]
 - [CDSC — how to download Python or R packages][cdsc-pkgs]
 - [CDSC — Hyak datasets information][hyak-datasets]
 - [CDSC — Hyak Spark information][hyak-spark]
 [hyak-setup]: https://wiki.communitydata.science/CommunityData:Hyak#General_Introduction_to_Hyak
 [hyak-syllabus]: https://hyak.uw.edu/docs/hyak101/basics/syllabus/
 [watchful1-explainer]: https://www.reddit.com/r/pushshift/comments/1itme1k/separate_dump_files_for_the_top_40k_subreddits/
 [watchful1-bulk]: https://www.reddit.com/r/pushshift/comments/1i4mlqu/dump_files_from_200506_to_202412/
 [watchful1-repo]: https://github.com/Watchful1/PushshiftDumps/tree/master
 [raiderbdev-monthly]: https://www.reddit.com/r/pushshift/comments/1ithjd3/subreddits_metadata_rules_and_wikis_202501/
 [arctic-shift]: https://github.com/ArthurHeitmann/arctic_shift
 [academic-torrent]: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4
 [hyak-modules]: https://hyak.uw.edu/docs/tools/modules
 [cdsc-pkgs]: https://wiki.communitydata.science/CommunityData:Hyak_software_installation#Python_packages
 [hyak-spark]: https://wiki.communitydata.science/CommunityData:Hyak_Spark
 ### Step 1: data download on Nada and Hyak
 We downloaded the [2005-2024 academic torrent][academic-torrent] and put
 it on Nada (~2 days of downloading). We copied the raw data over to
 Hyak's scrubbed directory in a new directory,
 `/gscratch/scrubbed/comdata/reddit_download_2005-2024/reddit`, with raw
 data sorted into `/comments` or `/submissions`. The `/submissions`
 directory shows `RS_20*.zst` files and the `/comments` shows `RC_20*.zst`
 files. (There are no earlier zip files, such as `.bz2` or `.xz`, to deal
 with.)
 ### Step 2: clone the repo on Hyak
 On Hyak, clone this repo (or `scp` the contents of `datasets/`) into the
 working directory next to the raw data, e.g.
 `/gscratch/scrubbed/comdata/reddit_download_2005-2024/`. The relevant
 code lives entirely in `datasets/`:
 - `dumps_helper.py` — shared parsing and Spark logic
 - `helper.py` — file-open helpers
 - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
 - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
 - `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
 The Spark wrapper scripts (`start_spark_and_run.sh`,
 `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
 they are part of the CDSC Hyak environment and should already be on
 PATH.
 ### Step 3: smoke-test Part 1 on a single file
 Check out `any_machine`. We'll test submissions Part 1 with just one
 file:
 ```sh
 python3 submissions_part1.py parse_dump RS_2005-06.zst
 ```
 To verify, go to your output directory and examine the start of the
 file:
 ```sh
 python3 -c "import pandas as pd; df = pd.read_parquet('reddit_submissions.parquet'); print(df.head())"
 ```
 You should see columns like `id`, `author`, `subreddit`, and `title`
 printed out. Repeat the process with `comments_part1.py`; you should see
 columns like `id`, `subreddit`, `link_id`, and `parent_id` printed out.
 **Note**: you may have to install relevant libraries before successfully
 running the file:
 ```sh
 pip install --user pyarrow simdjson zstandard fire
 ```
 ### Step 4: Part 1 — converting `.zst` to `.parquet` files
 Now we'll convert all of our `.zst` compressed Reddit data to `.parquet`
 files. First, to generate our task list, we'll run
 ```sh
 python3 submissions_part1.py gen_task_list
 ```
 There should be a script, `parse_submissions_task_list`, in the working
 directory. Check the script (`less parse_submissions_task_list`); it
 should have many lines that look like our earlier test command,
 `python3 submissions_part1.py parse_dump RS_2005-06.zst`, but for all of
 our `.zst` files. Do the same process with comments to generate
 `parse_comments_task_list`.
 From a login node, run `tmux` to keep our job running and then
 `any_machine` to check out a node to do computational work. We'll run
 our tasks (from the task list) in parallel to optimize. Start with
 submissions:
 ```sh
 parallel --joblog submissions_joblog.txt --results submissions/logs < parse_submissions_task_list
 ```
 The `--joblog` flag creates a text file where you can see which tasks
 completed successfully, and the `--results` flag creates a directory
 where each task has its own stderr output to see the specific error
 (this is best practice for debugging).
 Now we'll monitor the job. Create a new window in tmux (`CTRL+b c`).
 We'll ssh into our computational node (`ssh n1234` — you can get the
 node name by running `ourjobs`) and run `htop`
 ([more details on htop][htop-explainer]). You should see that the
 machine's CPUs are getting close to 100% usage. If all looks good,
 create a new window and repeat the process for comments.
 [htop-explainer]: https://codeahoy.com/2017/01/20/hhtop-explained-visually/
 Once the job has successfully completed, you'll see that your CPUs are
 closer to 0% usage in `htop` and your `submissions_joblog.txt` file
 should show an `exitval` of 0 for all commands. Kill your node by
 running `scancel 12345678` (the job ID can be found from `ourjobs`).
 ### Step 5: verify the per-source parquet files
 We'll want to verify our `.parquet` files at this point. We compared the
 new files' number of columns and rows to the old data: from the
 `/gscratch/scrubbed/comdata/reddit_download_2005-2024/output/temp/reddit_comments.parquet`
 directory, run
 ```sh
 diff <(../../../report_parquet_filesizes.py *.parquet) <(../../../report_parquet_filesizes.py /gscratch/comdata/output/temp/reddit_comments.parquet/*.parquet)
 ```
 and confirm there are no differences (same process with submissions).
 This may or may not be relevant if we continue using the same academic
 torrent to update data and have nothing to compare to, but you can still
 check that the new data's number of columns and rows are fairly
 continuous with the most recent data we already have.
 ### Step 6: Part 2 — sorting the `.parquet` files by author and subreddit via Spark
 If the `.parquet` files reasonably appear to be complete, we can now
 sort them by author and subreddit. The most efficient way to do so is by
 using one node on `cpu-g2` with 128 CPUs and 994G memory. This one node
 splits into up to six slices (four in our current case) so the tasks
 will still be parallelized (`hyakalloc` or [this Hyak blog][hyak-blog]
 are good resources for further information). Run `tmux` on a login
 node, then grab the whole node for up to a week with:
 ```sh
 salloc -p cpu-g2 -A comdata --nodes=1 --time=168:00:00 -c 128 --mem=994G
 ```
 [hyak-blog]: https://hyak.uw.edu/blog/g1-vs-g2/
 Once Slurm drops you onto the compute node, run
 ```sh
 ./start_spark_and_run.sh submissions_part2.py
 ```
 Monitor via `htop` (as described in Step 4); the CPUs may not always
 show high usage but you should see that memory is being used. Repeat
 for the comments. Successful jobs will result in
 `/gscratch/comdata/output` having four new directories:
 `reddit_submissions_by_author.parquet`,
 `reddit_submissions_by_subreddit.parquet`,
 `reddit_comments_by_author.parquet`, and
 `reddit_comments_by_subreddit.parquet`. Each should contain many
 `snappy.parquet` files (e.g.
 `part-00799-c8ec5f61-5158-43c7-ae2a-189169e9a86b-c000.snappy.parquet`)
 and `_SUCCESS`.
 ### Step 7: data verification
 Verify and make sure the new data is reasonably complete before deleting
 any of the old data. Do a simple time series to see how many posts there
 are per day and make sure things don't fall off. It is also useful to
 have lab members test out anything they're working on again with the
 new parquet files.
 ## See also
 The CDSC wiki page
 [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
 is the landing page for this project on the wiki and provides
 cross-links to related CDSC and Hyak documentation. The walkthrough
 above used to live there; it now lives here so that doc and code stay
 in sync.
--- a/datasets/add_months.sh
+++ b/datasets/add_months.sh
@@ -1,124 +0,0 @@
 #!/usr/bin/env bash
 #
 # Add one or more new months to the existing parquet datasets using a
 # layered append. The live datasets are never touched until the final
 # copy step, so they remain safe and queryable throughout.
 #
 # Usage:
 #   add_months.sh YYYY-MM [YYYY-MM ...]
 #
 # Example:
 #   add_months.sh 2025-01 2025-02 2025-03
 #
 # The new .zst dump files must live at:
 #   $COMMENTS_DUMPDIR/RC_YYYY-MM.zst
 #   $SUBMISSIONS_DUMPDIR/RS_YYYY-MM.zst
 #
 # Override the dump directories via environment variables if the new files
 # are not in the standard locations:
 #
 #   COMMENTS_DUMPDIR=/path/to/new/comments \
 #   SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
 #   ./add_months.sh 2025-01 2025-02
 #
 # Workflow:
 #   Part 1  — parse new .zst files into per-month parquets (compute node)
 #   Part 2  — sort into staging directories, not the live datasets (fat node)
 #   Verify  — inspect staging before committing (manual step, not scripted)
 #   Copy    — move staging files into live datasets (run manually after verify)
 #   Cleanup — remove temp and staging dirs (run manually after copy)
 #
 # Every command below is independently runnable for debugging. The copy
 # and cleanup steps are intentionally left as separate commands so you can
 # verify the staging output before touching the live datasets.
 #
 # NOTE: This script and its workflow are written but not yet tested.
 # Remove this notice after a successful end-to-end run.
 set -e
 cd "$(dirname "$0")"
 if [ $# -eq 0 ]; then
    echo "Usage: $0 YYYY-MM [YYYY-MM ...]" >&2
    exit 1
 fi
 COMMENTS_DUMPDIR="${COMMENTS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/comments}"
 SUBMISSIONS_DUMPDIR="${SUBMISSIONS_DUMPDIR:-/gscratch/comdata/raw_data/reddit_dumps/submissions}"
 # Part 1 temp dirs (per-month parquets, parsed from .zst)
 TEMP_COMMENTS="/gscratch/comdata/output/temp/add_months_comments.parquet"
 TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/add_months_submissions.parquet"
 # Staging dirs (sorted new layer; inspected before copying to live)
 STAGING_COMMENTS_SUB="/gscratch/comdata/output/temp/new_layer_comments_by_subreddit.parquet"
 STAGING_COMMENTS_AUTH="/gscratch/comdata/output/temp/new_layer_comments_by_author.parquet"
 STAGING_SUBMISSIONS_SUB="/gscratch/comdata/output/temp/new_layer_submissions_by_subreddit.parquet"
 STAGING_SUBMISSIONS_AUTH="/gscratch/comdata/output/temp/new_layer_submissions_by_author.parquet"
 # Live dataset dirs
 LIVE_COMMENTS_SUB="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet"
 LIVE_COMMENTS_AUTH="/gscratch/comdata/output/reddit_comments_by_author.parquet"
 LIVE_SUBMISSIONS_SUB="/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet"
 LIVE_SUBMISSIONS_AUTH="/gscratch/comdata/output/reddit_submissions_by_author.parquet"
 # --- Part 1: parse new months in parallel (run on a compute node) -----------
 printf "python3 comments_part1.py parse_dump RC_%s.zst --dumpdir=\"$COMMENTS_DUMPDIR\" --outdir=\"$TEMP_COMMENTS\"\n" "$@" \
    > add_months_comments_tasks.txt
 printf "python3 submissions_part1.py parse_dump RS_%s.zst --dumpdir=\"$SUBMISSIONS_DUMPDIR\" --outdir=\"$TEMP_SUBMISSIONS\"\n" "$@" \
    > add_months_submissions_tasks.txt
 parallel --joblog add_months_comments_joblog.txt --results add_months_comments_logs \
    < add_months_comments_tasks.txt
 parallel --joblog add_months_submissions_joblog.txt --results add_months_submissions_logs \
    < add_months_submissions_tasks.txt
 # --- Part 2: sort new months into staging (not the live datasets) -----------
 #
 # start_spark_and_run.sh calls salloc — run from a login node, or replace
 # with start_spark_cluster.sh + spark-submit if already on a suitable node.
 start_spark_and_run.sh 1 comments_part2.py \
    --indir="$TEMP_COMMENTS" \
    --out_by_subreddit="$STAGING_COMMENTS_SUB" \
    --out_by_author="$STAGING_COMMENTS_AUTH"
 start_spark_and_run.sh 1 submissions_part2.py \
    --indir="$TEMP_SUBMISSIONS" \
    --out_by_subreddit="$STAGING_SUBMISSIONS_SUB" \
    --out_by_author="$STAGING_SUBMISSIONS_AUTH"
 # --- Verify: inspect staging before copying to live -------------------------
 #
 # Stop here and check that the staging output looks right before running
 # the copy step. The live datasets are untouched at this point. Example:
 #
 #   ls -lah "$STAGING_COMMENTS_SUB" | head
 #   python3 -c "
 #   import pyarrow.parquet as pq, os
 #   f = sorted(os.listdir('$STAGING_COMMENTS_SUB'))[0]
 #   t = pq.read_table('$STAGING_COMMENTS_SUB/' + f, columns=['created_utc'])
 #   print(t.column('created_utc')[0].as_py(), t.column('created_utc')[-1].as_py())
 #   "
 # --- Copy: add staging files into live datasets -----------------------------
 #
 # Run these lines manually after verifying staging. This is the only step
 # that touches the live datasets. It only adds new files — existing files
 # are never deleted or overwritten.
 find "$STAGING_COMMENTS_SUB"  -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_SUB"/  \;
 find "$STAGING_COMMENTS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_COMMENTS_AUTH"/ \;
 find "$STAGING_SUBMISSIONS_SUB"  -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_SUB"/  \;
 find "$STAGING_SUBMISSIONS_AUTH" -maxdepth 1 -type f -exec cp {} "$LIVE_SUBMISSIONS_AUTH"/ \;
 # --- Cleanup: remove temp and staging dirs ----------------------------------
 #
 # Run after confirming the copy succeeded and the live datasets look right.
 rm -rf "$TEMP_COMMENTS" "$TEMP_SUBMISSIONS"
 rm -rf "$STAGING_COMMENTS_SUB" "$STAGING_COMMENTS_AUTH"
 rm -rf "$STAGING_SUBMISSIONS_SUB" "$STAGING_SUBMISSIONS_AUTH"
--- a/datasets/build_from_scratch.sh
+++ b/datasets/build_from_scratch.sh
@@ -1,56 +0,0 @@
 #!/usr/bin/env bash
 #
 # Build the sorted, partitioned Reddit parquet datasets from scratch.
 #
 # Wipes the per-source temp directories, processes every RC_* and RS_* dump
 # in the raw_data dumps directory through Part 1 (per-file, parallel), then
 # runs the Part 2 Spark sort + repartition for both comments and submissions.
 #
 # Every command below is independently runnable — to debug a single stage,
 # copy the line out and run it directly. Run the whole script end-to-end
 # only when you trust each step.
 #
 # Prerequisites:
 # - raw .zst dumps already staged in the dumpdir locations (see the
 #   defaults in dumps_helper.py, or override via --dumpdir)
 # - GNU parallel installed
 # - start_spark_and_run.sh on PATH (Hyak-provided wrapper)
 #
 # To add new months to an existing build without rebuilding from scratch,
 # use add_months.sh.
 set -e
 cd "$(dirname "$0")"
 TEMP_COMMENTS="/gscratch/comdata/output/temp/reddit_comments.parquet"
 TEMP_SUBMISSIONS="/gscratch/comdata/output/temp/reddit_submissions.parquet"
 # --- Part 1a: comments ------------------------------------------------------
 # wipe any existing comments temp output
 rm -rf "$TEMP_COMMENTS"
 # generate the per-file parse task list
 python3 comments_part1.py gen_task_list
 # run all comments parse tasks in parallel
 parallel --joblog comments_joblog.txt --results comments_logs < parse_comments_task_list
 # --- Part 1b: submissions ---------------------------------------------------
 # wipe any existing submissions temp output
 rm -rf "$TEMP_SUBMISSIONS"
 # generate the per-file parse task list
 python3 submissions_part1.py gen_task_list
 # run all submissions parse tasks in parallel
 parallel --joblog submissions_joblog.txt --results submissions_logs < parse_submissions_task_list
 # --- Part 2: spark sort + repartition --------------------------------------
 # sort comments and write reddit_comments_by_{subreddit,author}.parquet
 start_spark_and_run.sh 1 comments_part2.py
 # sort submissions and write reddit_submissions_by_{subreddit,author}.parquet
 start_spark_and_run.sh 1 submissions_part2.py
--- a/datasets/checkpoint_parallelsql.sbatch
+++ b/datasets/checkpoint_parallelsql.sbatch
@@ -1,26 +0,0 @@
 #!/bin/bash
 ## parallel_sql_job.sh
 #SBATCH --job-name=tf_subreddit_comments
 ## Allocation Definition
 #SBATCH --account=comdata-ckpt
 #SBATCH --partition=ckpt
 ## Resources
 ## Nodes. This should always be 1 for parallel-sql.
 #SBATCH --nodes=1    
 ## Walltime (12 hours)
 #SBATCH --time=12:00:00
 ## Memory per node
 #SBATCH --mem=32G
 #SBATCH --cpus-per-task=4
 #SBATCH --ntasks=1
 #SBATCH -D /gscratch/comdata/users/nathante/cdsc-reddit
 source ./bin/activate
 module load parallel_sql
 echo $(which perl)
 conda list pyarrow
 which python3
 #Put here commands to load other modules (e.g. matlab etc.)
 #Below command means that parallel_sql will get tasks from the database
 #and run them on the node (in parallel). So a 16 core node will have
 #16 tasks running at one time.
 parallel-sql --sql -a parallel --exit-on-term --jobs 4
--- a/datasets/comments_2_parquet.sh
+++ b/datasets/comments_2_parquet.sh
@@ -0,0 +1,10 @@
 #!/usr/bin/env bash
 ## needs to be run by hand since i don't have a nice way of waiting on a parallel-sql job to complete 
 echo "#!/usr/bin/bash" > job_script.sh
 #echo "source $(pwd)/../bin/activate" >> job_script.sh
 echo "python3 $(pwd)/comments_2_parquet_part1.py" >> job_script.sh
 srun -p compute-bigmem -A comdata --nodes=1 --mem-per-cpu=9g -c 40 --time=120:00:00 --pty job_script.sh
 start_spark_and_run.sh 1 $(pwd)/comments_2_parquet_part2.py
--- a/datasets/comments_2_parquet_part1.py
+++ b/datasets/comments_2_parquet_part1.py
@@ -0,0 +1,111 @@
 #!/usr/bin/env python3
 import os
 import json
 from datetime import datetime
 from multiprocessing import Pool
 from itertools import islice
 from helper import open_input_file, find_dumps
 import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq
 from pathlib import Path
 import fire
 def parse_comment(comment, names= None):
    if names is None:
        names = ["id","subreddit","link_id","parent_id","created_utc","author","ups","downs","score","edited","subreddit_type","subreddit_id","stickied","is_submitter","body","error"]
    try:
        comment = json.loads(comment)
    except json.decoder.JSONDecodeError as e:
        print(e)
        print(comment)
        row = [None for _ in names]
        row[-1] = "json.decoder.JSONDecodeError|{0}|{1}".format(e,comment)
        return tuple(row)
    row = []
    for name in names:
        if name == 'created_utc':
            row.append(datetime.fromtimestamp(int(comment['created_utc']),tz=None))
        elif name == 'edited':
            val = comment[name]
            if type(val) == bool:
                row.append(val)
                row.append(None)
            else:
                row.append(True)
                row.append(datetime.fromtimestamp(int(val),tz=None))
        elif name == "time_edited":
            continue
        elif name not in comment:
            row.append(None)
        else:
            row.append(comment[name])
    return tuple(row)
 #    conf = sc._conf.setAll([('spark.executor.memory', '20g'), ('spark.app.name', 'extract_reddit_timeline'), ('spark.executor.cores', '26'), ('spark.cores.max', '26'), ('spark.driver.memory','84g'),('spark.driver.maxResultSize','0'),('spark.local.dir','/gscratch/comdata/spark_tmp')])
 def parse_dump(partition):
    dumpdir = f"/gscratch/comdata/raw_data/reddit_dumps/comments/{partition}"
    stream = open_input_file(dumpdir)
    rows = map(parse_comment, stream)
    schema = pa.schema([
        pa.field('id', pa.string(), nullable=True),
        pa.field('subreddit', pa.string(), nullable=True),
        pa.field('link_id', pa.string(), nullable=True),
        pa.field('parent_id', pa.string(), nullable=True),
        pa.field('created_utc', pa.timestamp('ms'), nullable=True),
        pa.field('author', pa.string(), nullable=True),
        pa.field('ups', pa.int64(), nullable=True),
        pa.field('downs', pa.int64(), nullable=True),
        pa.field('score', pa.int64(), nullable=True),
        pa.field('edited', pa.bool_(), nullable=True),
        pa.field('time_edited', pa.timestamp('ms'), nullable=True),
        pa.field('subreddit_type', pa.string(), nullable=True),
        pa.field('subreddit_id', pa.string(), nullable=True),
        pa.field('stickied', pa.bool_(), nullable=True),
        pa.field('is_submitter', pa.bool_(), nullable=True),
        pa.field('body', pa.string(), nullable=True),
        pa.field('error', pa.string(), nullable=True),
    ])
    p = Path("/gscratch/comdata/output/temp/reddit_comments.parquet")
    p.mkdir(exist_ok=True,parents=True)
    N=10000
    with pq.ParquetWriter(f"/gscratch/comdata/output/temp/reddit_comments.parquet/{partition}.parquet",
                          schema=schema,
                          compression='snappy',
                          flavor='spark') as writer:
        while True:
            chunk = islice(rows,N)
            pddf = pd.DataFrame(chunk, columns=schema.names)
            table = pa.Table.from_pandas(pddf,schema=schema)
            if table.shape[0] == 0:
                break
            writer.write_table(table)
        writer.close()
 def gen_task_list(dumpdir="/gscratch/comdata/raw_data/reddit_dumps/comments", overwrite=True):
    files = list(find_dumps(dumpdir,base_pattern="RC_20*.*"))
    with open("comments_task_list.sh",'w') as of:
        for fpath in files:
            partition = os.path.split(fpath)[1]
            if (not Path(f"/gscratch/comdata/output/temp/reddit_comments.parquet/{partition}.parquet").exists()) or (overwrite is True):
                of.write(f'python3 comments_2_parquet_part1.py parse_dump {partition}\n')
 if __name__ == '__main__':
    fire.Fire({'parse_dump':parse_dump,
              'gen_task_list':gen_task_list})
--- a/datasets/comments_2_parquet_part2.py
+++ b/datasets/comments_2_parquet_part2.py
@@ -0,0 +1,36 @@
 #!/usr/bin/env python3
 # spark script to make sorted, and partitioned parquet files 
 import pyspark
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 spark = SparkSession.builder.getOrCreate()
 conf = pyspark.SparkConf().setAppName("Reddit submissions to parquet")
 conf = conf.set("spark.sql.shuffle.partitions",2000)
 conf = conf.set('spark.sql.crossJoin.enabled',"true")
 conf = conf.set('spark.debug.maxToStringFields',200)
 sc = spark.sparkContext
 df = spark.read.parquet("/gscratch/comdata/output/temp/reddit_comments.parquet",compression='snappy')
 df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
 df = df.drop('subreddit')
 df = df.withColumnRenamed('subreddit_2','subreddit')
 df = df.withColumnRenamed("created_utc","CreatedAt")
 df = df.withColumn("Month",f.month(f.col("CreatedAt")))
 df = df.withColumn("Year",f.year(f.col("CreatedAt")))
 df = df.withColumn("Day",f.dayofmonth(f.col("CreatedAt")))
 df = df.repartition('subreddit')
 df2 = df.sort(["subreddit","CreatedAt","link_id","parent_id","Year","Month","Day"],ascending=True)
 df2 = df2.sortWithinPartitions(["subreddit","CreatedAt","link_id","parent_id","Year","Month","Day"],ascending=True)
 df2.write.parquet("/gscratch/scrubbed/comdata/output/reddit_comments_by_subreddit.parquet", mode='overwrite', compression='snappy')
 df = df.repartition('author')
 df3 = df.sort(["author","CreatedAt","subreddit","link_id","parent_id","Year","Month","Day"],ascending=True)
 df3 = df3.sortWithinPartitions(["author","CreatedAt","subreddit","link_id","parent_id","Year","Month","Day"],ascending=True)
 df3.write.parquet("/gscratch/scrubbed/comdata/output/reddit_comments_by_author.parquet", mode='overwrite',compression='snappy')
--- a/datasets/comments_merge.py
+++ b/datasets/comments_merge.py
@@ -1,14 +0,0 @@
 #!/usr/bin/env python3
 """Collapse all layers in the comments final datasets into a single clean layer.
 Must be launched from a login node via the Hyak-provided wrapper:
  start_spark_and_run.sh 1 comments_merge.py
 See merge_layers.sh and dumps_helper.merge_layers for details.
 """
 from dumps_helper import COMMENTS, merge_layers
 if __name__ == "__main__":
    merge_layers(COMMENTS)
--- a/datasets/comments_part1.py
+++ b/datasets/comments_part1.py
@@ -1,24 +0,0 @@
 #!/usr/bin/env python3
 """Part 1 for comments: parse one RC_*.zst dump into a parquet file.
 CLI:
  comments_part1.py parse_dump RC_2018-08.zst
  comments_part1.py gen_task_list
  comments_part1.py parse_dump RC_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
 """
 import fire
 from dumps_helper import COMMENTS, parse_dump, gen_task_list
 def _parse_dump(partition, dumpdir=None, outdir=None):
    parse_dump(COMMENTS, partition, dumpdir=dumpdir, outdir=outdir)
 def _gen_task_list(dumpdir=None, tasklist=None):
    gen_task_list(COMMENTS, 'comments_part1.py', dumpdir=dumpdir, tasklist=tasklist)
 if __name__ == "__main__":
    fire.Fire({'parse_dump': _parse_dump,
               'gen_task_list': _gen_task_list})
--- a/datasets/comments_part2.py
+++ b/datasets/comments_part2.py
@@ -1,21 +0,0 @@
 #!/usr/bin/env python3
 """Part 2 for comments: Spark sort + repartition into the final datasets.
 Must be launched from a login node via the Hyak-provided wrapper:
  start_spark_and_run.sh 1 comments_part2.py
  start_spark_and_run.sh 1 comments_part2.py --indir=/path/to/parquets --mode=append
 --indir defaults to the temp comments dir in dumps_helper.py.
 --out_by_subreddit and --out_by_author default to the live dataset paths;
 override them to write to staging directories first (see add_months.sh).
 """
 import fire
 from dumps_helper import COMMENTS, sort_and_write
 if __name__ == "__main__":
    fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
              sort_and_write(COMMENTS, indir=indir,
                             out_by_subreddit=out_by_subreddit,
                             out_by_author=out_by_author))
--- a/datasets/dumps_helper.py
+++ b/datasets/dumps_helper.py
@@ -1,343 +0,0 @@
 """Shared logic for the comments and submissions dump-to-parquet pipeline.
 Used by comments_part1.py / submissions_part1.py (Part 1: one compressed
 dump file → one parquet file) and comments_part2.py / submissions_part2.py
 (Part 2: Spark sort + repartition of the per-source parquets).
 The two dump types only differ in their schemas and a handful of
 field-specific extractors. The parse loop, the file I/O wrapping, the
 task-list generator, and the Spark sort are all shared here.
 """
 import os
 import shutil
 from datetime import datetime
 from itertools import islice
 import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq
 import simdjson
 from helper import find_dumps, open_fileset
 _json = simdjson.Parser()
 # --- field-level extractors ------------------------------------------------
 def _ts(name):
    """Extractor for a unix-timestamp field (or None if missing)."""
    def handler(record):
        val = record.get(name)
        if val is None:
            return None
        return datetime.fromtimestamp(int(val), tz=None)
    return handler
 def _edited(record):
    """Returns (edited, time_edited). The dump packs both into one `edited`
    field that is either a bool (never edited / unknown timestamp) or a
    unix timestamp."""
    val = record.get('edited')
    if isinstance(val, bool):
        return (val, None)
    if val is None:
        return (None, None)
    return (True, datetime.fromtimestamp(int(val), tz=None))
 def _has_media(record):
    """Submissions don't have a `has_media` field directly — derive it."""
    return record.get('media') is not None
 # --- generic parse loop ----------------------------------------------------
 def parse_record(line, fields, handlers):
    """Parse one JSON line into a tuple aligned with `fields`.
    `handlers` maps field name → callable(record) returning either a single
    value (one column) or a tuple of values (multiple consecutive columns,
    consuming the next len(tuple)-1 entries in `fields`).
    Fields without a handler are pulled from the record by name, with
    missing keys yielding None.
    The last field in `fields` is reserved for an error message string
    and is set to None on success.
    """
    try:
        record = _json.parse(line)
    except (ValueError, KeyError) as e:
        row = [None] * len(fields)
        row[-1] = f"parse error|{e}|{line}"
        return tuple(row)
    row = []
    skip_next = 0
    for name in fields:
        if skip_next > 0:
            skip_next -= 1
            continue
        handler = handlers.get(name)
        if handler is None:
            try:
                row.append(record[name])
            except KeyError:
                row.append(None)
        else:
            result = handler(record)
            if isinstance(result, tuple):
                row.extend(result)
                skip_next = len(result) - 1
            else:
                row.append(result)
    return tuple(row)
 # --- comments schema -------------------------------------------------------
 COMMENT_FIELDS = [
    'id', 'subreddit', 'link_id', 'parent_id', 'created_utc', 'author',
    'ups', 'downs', 'score', 'edited', 'time_edited', 'subreddit_type',
    'subreddit_id', 'stickied', 'is_submitter', 'body', 'error',
 ]
 COMMENT_SCHEMA = pa.schema([
    pa.field('id', pa.string(), nullable=True),
    pa.field('subreddit', pa.string(), nullable=True),
    pa.field('link_id', pa.string(), nullable=True),
    pa.field('parent_id', pa.string(), nullable=True),
    pa.field('created_utc', pa.timestamp('ms'), nullable=True),
    pa.field('author', pa.string(), nullable=True),
    pa.field('ups', pa.int64(), nullable=True),
    pa.field('downs', pa.int64(), nullable=True),
    pa.field('score', pa.int64(), nullable=True),
    pa.field('edited', pa.bool_(), nullable=True),
    pa.field('time_edited', pa.timestamp('ms'), nullable=True),
    pa.field('subreddit_type', pa.string(), nullable=True),
    pa.field('subreddit_id', pa.string(), nullable=True),
    pa.field('stickied', pa.bool_(), nullable=True),
    pa.field('is_submitter', pa.bool_(), nullable=True),
    pa.field('body', pa.string(), nullable=True),
    pa.field('error', pa.string(), nullable=True),
 ])
 COMMENT_HANDLERS = {
    'created_utc': _ts('created_utc'),
    'edited': _edited,
 }
 # --- submissions schema ----------------------------------------------------
 SUBMISSION_FIELDS = [
    'id', 'author', 'subreddit', 'title', 'created_utc', 'permalink', 'url',
    'domain', 'score', 'ups', 'downs', 'over_18', 'has_media', 'selftext',
    'retrieved_on', 'num_comments', 'gilded', 'edited', 'time_edited',
    'subreddit_type', 'subreddit_id', 'subreddit_subscribers', 'name',
    'is_self', 'stickied', 'quarantine', 'error',
 ]
 SUBMISSION_SCHEMA = pa.schema([
    pa.field('id', pa.string(), nullable=True),
    pa.field('author', pa.string(), nullable=True),
    pa.field('subreddit', pa.string(), nullable=True),
    pa.field('title', pa.string(), nullable=True),
    pa.field('created_utc', pa.timestamp('ms'), nullable=True),
    pa.field('permalink', pa.string(), nullable=True),
    pa.field('url', pa.string(), nullable=True),
    pa.field('domain', pa.string(), nullable=True),
    pa.field('score', pa.int64(), nullable=True),
    pa.field('ups', pa.int64(), nullable=True),
    pa.field('downs', pa.int64(), nullable=True),
    pa.field('over_18', pa.bool_(), nullable=True),
    pa.field('has_media', pa.bool_(), nullable=True),
    pa.field('selftext', pa.string(), nullable=True),
    pa.field('retrieved_on', pa.timestamp('ms'), nullable=True),
    pa.field('num_comments', pa.int64(), nullable=True),
    pa.field('gilded', pa.int64(), nullable=True),
    pa.field('edited', pa.bool_(), nullable=True),
    pa.field('time_edited', pa.timestamp('ms'), nullable=True),
    pa.field('subreddit_type', pa.string(), nullable=True),
    pa.field('subreddit_id', pa.string(), nullable=True),
    pa.field('subreddit_subscribers', pa.int64(), nullable=True),
    pa.field('name', pa.string(), nullable=True),
    pa.field('is_self', pa.bool_(), nullable=True),
    pa.field('stickied', pa.bool_(), nullable=True),
    pa.field('quarantine', pa.bool_(), nullable=True),
    pa.field('error', pa.string(), nullable=True),
 ])
 SUBMISSION_HANDLERS = {
    'created_utc': _ts('created_utc'),
    'retrieved_on': _ts('retrieved_on'),
    'edited': _edited,
    'has_media': _has_media,
 }
 # --- per-type configuration ------------------------------------------------
 # Defaults that the entry-point scripts pass through, exposed here so the
 # field/schema/handler triplet, the canonical paths, and the dump filename
 # pattern all live in one place.
 COMMENTS = {
    'fields': COMMENT_FIELDS,
    'schema': COMMENT_SCHEMA,
    'handlers': COMMENT_HANDLERS,
    'dumpdir': "/gscratch/comdata/raw_data/reddit_dumps/comments",
    'outdir': "/gscratch/comdata/output/temp/reddit_comments.parquet",
    'file_pattern': 'RC_20*.*',
    'task_list': 'parse_comments_task_list',
    'output_by_subreddit': "/gscratch/comdata/output/reddit_comments_by_subreddit.parquet",
    'output_by_author': "/gscratch/comdata/output/reddit_comments_by_author.parquet",
    'subreddit_sort_keys': ["subreddit", "CreatedAt", "link_id", "parent_id", "Year", "Month", "Day"],
    'author_sort_keys': ["author", "CreatedAt", "subreddit", "link_id", "parent_id", "Year", "Month", "Day"],
    'app_name': "Reddit comments to parquet",
 }
 SUBMISSIONS = {
    'fields': SUBMISSION_FIELDS,
    'schema': SUBMISSION_SCHEMA,
    'handlers': SUBMISSION_HANDLERS,
    'dumpdir': "/gscratch/comdata/raw_data/reddit_dumps/submissions",
    'outdir': "/gscratch/comdata/output/temp/reddit_submissions.parquet",
    'file_pattern': 'RS_20*.*',
    'task_list': 'parse_submissions_task_list',
    'output_by_subreddit': "/gscratch/comdata/output/reddit_submissions_by_subreddit.parquet",
    'output_by_author': "/gscratch/comdata/output/reddit_submissions_by_author.parquet",
    'subreddit_sort_keys': ["subreddit", "CreatedAt", "id"],
    'author_sort_keys': ["author", "CreatedAt", "id"],
    'app_name': "Reddit submissions to parquet",
 }
 # --- Part 1: parse one dump file -> one parquet ----------------------------
 def parse_dump(config, partition, dumpdir=None, outdir=None, chunk_size=10000):
    """Read one compressed dump from `dumpdir/partition` and write a parquet
    file to `outdir/<basename>.parquet`. Streams chunks of `chunk_size`
    rows so memory stays bounded."""
    dumpdir = dumpdir or config['dumpdir']
    outdir = outdir or config['outdir']
    schema = config['schema']
    fields = config['fields']
    handlers = config['handlers']
    stream = open_fileset([os.path.join(dumpdir, partition)])
    rows = (parse_record(line, fields, handlers) for line in stream)
    os.makedirs(outdir, exist_ok=True)
    outfile = os.path.join(outdir, os.path.splitext(partition)[0] + ".parquet")
    with pq.ParquetWriter(outfile, schema=schema, compression='snappy', flavor='spark') as writer:
        while True:
            chunk = list(islice(rows, chunk_size))
            if not chunk:
                break
            pddf = pd.DataFrame(chunk, columns=schema.names)
            table = pa.Table.from_pandas(pddf, schema=schema)
            writer.write_table(table)
 def gen_task_list(config, script_name, dumpdir=None, tasklist=None):
    """Write a parallel-friendly task list of `script_name parse_dump <file>`
    lines, one per dump file found under `dumpdir`."""
    dumpdir = dumpdir or config['dumpdir']
    tasklist = tasklist or config['task_list']
    files = list(find_dumps(dumpdir, base_pattern=config['file_pattern']))
    with open(tasklist, 'w') as of:
        for fpath in files:
            partition = os.path.split(fpath)[1]
            of.write(f'python3 {script_name} parse_dump {partition}\n')
 # --- Part 2: spark sort + repartition --------------------------------------
 def sort_and_write(config, indir=None, out_by_subreddit=None, out_by_author=None):
    """Read a directory of per-source parquets, sort and repartition twice
    (once by subreddit, once by author), and write the two output datasets.
    indir defaults to config['outdir'].
    out_by_subreddit and out_by_author default to config['output_by_subreddit']
    and config['output_by_author']. Override them to write to staging directories
    instead of the live datasets (see add_months.sh).
    Pyspark is imported lazily so Part 1 callers don't pay the Spark startup
    cost.
    """
    from pyspark.sql import SparkSession, functions as f
    indir = indir or config['outdir']
    out_by_subreddit = out_by_subreddit or config['output_by_subreddit']
    out_by_author = out_by_author or config['output_by_author']
    spark = SparkSession.builder.appName(config['app_name']).getOrCreate()
    df = spark.read.parquet(indir, compression='snappy')
    df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
    df = df.drop('subreddit')
    df = df.withColumnRenamed('subreddit_2', 'subreddit')
    df = df.withColumnRenamed("created_utc", "CreatedAt")
    df = df.withColumn("Month", f.month(f.col("CreatedAt")))
    df = df.withColumn("Year", f.year(f.col("CreatedAt")))
    df = df.withColumn("Day", f.dayofmonth(f.col("CreatedAt")))
    sub_keys = config['subreddit_sort_keys']
    df_sub = df.repartition('subreddit').sort(sub_keys, ascending=True)
    df_sub = df_sub.sortWithinPartitions(sub_keys, ascending=True)
    df_sub.write.parquet(out_by_subreddit, mode='overwrite', compression='snappy')
    auth_keys = config['author_sort_keys']
    df_auth = df.repartition('author').sort(auth_keys, ascending=True)
    df_auth = df_auth.sortWithinPartitions(auth_keys, ascending=True)
    df_auth.write.parquet(out_by_author, mode='overwrite', compression='snappy')
 def merge_layers(config):
    """Collapse all accumulated layers in the final datasets into a single
    clean layer. Reads the existing by_subreddit dataset (which contains all
    layers), re-sorts twice, writes to temp paths, then atomically replaces
    the originals by renaming.
    Safe to interrupt after the writes complete but before the renames — the
    originals are untouched until the .merging directories exist. The .old
    directories are left behind if the process is interrupted after renaming;
    delete them manually once satisfied.
    Pyspark is imported lazily so Part 1 callers don't pay the Spark startup
    cost.
    """
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName(config['app_name'] + ' merge layers').getOrCreate()
    # Both final datasets have identical rows; read from by_subreddit.
    df = spark.read.parquet(config['output_by_subreddit'])
    tmp_sub = config['output_by_subreddit'] + '.merging'
    tmp_auth = config['output_by_author'] + '.merging'
    sub_keys = config['subreddit_sort_keys']
    df_sub = df.repartition('subreddit').sort(sub_keys, ascending=True)
    df_sub = df_sub.sortWithinPartitions(sub_keys, ascending=True)
    df_sub.write.parquet(tmp_sub, mode='overwrite', compression='snappy')
    auth_keys = config['author_sort_keys']
    df_auth = df.repartition('author').sort(auth_keys, ascending=True)
    df_auth = df_auth.sortWithinPartitions(auth_keys, ascending=True)
    df_auth.write.parquet(tmp_auth, mode='overwrite', compression='snappy')
    # Atomic swap: rename old → .old, then .merging → final, then delete .old.
    old_sub = config['output_by_subreddit'] + '.old'
    old_auth = config['output_by_author'] + '.old'
    os.rename(config['output_by_subreddit'], old_sub)
    os.rename(tmp_sub, config['output_by_subreddit'])
    os.rename(config['output_by_author'], old_auth)
    os.rename(tmp_auth, config['output_by_author'])
    shutil.rmtree(old_sub)
    shutil.rmtree(old_auth)
--- a/datasets/helper.py
+++ b/datasets/helper.py
@@ -24,8 +24,7 @@ def open_fileset(files):
    for fh in files:
        print(fh)
        lines = open_input_file(fh)
-        for line in lines:
+        yield from lines
            yield line
 def open_input_file(input_filename):
    if re.match(r'.*\.7z$', input_filename):
@@ -39,7 +38,7 @@ def open_input_file(input_filename):
    elif re.match(r'.*\.xz', input_filename):
        cmd = ["xzcat",'-dk', '-T 20',input_filename]
    elif re.match(r'.*\.zst',input_filename):
-        cmd = ['zstd','-dck', input_filename]
+        cmd = ['/kloneusr/bin/zstd','-dck', input_filename,  '--memory=2048MB --stdout']
    elif re.match(r'.*\.gz',input_filename):
        cmd = ['gzip','-dc', input_filename]
    try:
--- a/datasets/job_script.sh
+++ b/datasets/job_script.sh
@@ -0,0 +1,4 @@
 #!/usr/bin/bash
 start_spark_cluster.sh
 singularity exec  /gscratch/comdata/users/nathante/containers/nathante.sif spark-submit --master spark://$(hostname):7077 comments_2_parquet_part2.py 
 singularity exec /gscratch/comdata/users/nathante/containers/nathante.sif stop-all.sh
--- a/datasets/merge_layers.sh
+++ b/datasets/merge_layers.sh
@@ -1,32 +0,0 @@
 #!/usr/bin/env bash
 #
 # Collapse all accumulated layers in the final parquet datasets into a
 # single clean layer. Use this after several incremental adds via
 # add_months.sh when you want to reduce the number of partition files.
 #
 # Reads the existing by_subreddit / by_author datasets, re-sorts everything,
 # writes to temp paths, then atomically replaces the originals via rename.
 # The old directories are removed once the new ones are in place.
 #
 # If the process is interrupted after writing the .merging directories but
 # before the renames complete, re-run — the .merging directories will be
 # overwritten and the originals are still intact. If interrupted after the
 # renames, the .old directories are left behind; delete them manually once
 # satisfied with the output.
 #
 # To add new months without merging, use add_months.sh.
 # To rebuild everything from raw dumps, use build_from_scratch.sh.
 #
 # NOTE: This script and its workflow are written but not yet tested.
 # Remove this notice after a successful end-to-end run.
 #
 # Every command below is independently runnable for debugging.
 set -e
 cd "$(dirname "$0")"
 # merge and collapse comments layers
 start_spark_and_run.sh 1 comments_merge.py
 # merge and collapse submissions layers
 start_spark_and_run.sh 1 submissions_merge.py
--- a/datasets/submissions_2_parquet.sh
+++ b/datasets/submissions_2_parquet.sh
@@ -0,0 +1,9 @@
 #!/usr/bin/env bash
 ## this should be run manually since we don't have a nice way to wait on parallel_sql jobs
 srun -p compute-bigmem -A comdata --nodes=1 --mem-per-cpu=9g -c 40 --time=120:00:00 python3 $(pwd)/submissions_2_parquet_part1.py gen_task_list
 start_spark_and_run.sh 1 $(pwd)/submissions_2_parquet_part2.py
--- a/datasets/submissions_2_parquet_part1.py
+++ b/datasets/submissions_2_parquet_part1.py
@@ -0,0 +1,114 @@
 #!/usr/bin/env python3
 # two stages:
 # 1. from gz to arrow parquet (this script) 
 # 2. from arrow parquet to spark parquet (submissions_2_parquet_part2.py)
 from datetime import datetime
 from pathlib import Path
 from itertools import islice
 from helper import find_dumps, open_fileset
 import pandas as pd
 import pyarrow as pa
 import pyarrow.parquet as pq
 import fire
 import os
 import json
 def parse_submission(post, names = None):
    if names is None:
        names = ['id','author','subreddit','title','created_utc','permalink','url','domain','score','ups','downs','over_18','has_media','selftext','retrieved_on','num_comments','gilded','edited','time_edited','subreddit_type','subreddit_id','subreddit_subscribers','name','is_self','stickied','quarantine','error']
    try:
        post = json.loads(post)
    except (ValueError) as e:
        #        print(e)
        #        print(post)
        row = [None for _ in names]
        row[-1] = "Error parsing json|{0}|{1}".format(e,post)
        return tuple(row)
    row = []
    for name in names:
        if name == 'created_utc' or name == 'retrieved_on':
            val = post.get(name,None)
            if val is not None:
                row.append(datetime.fromtimestamp(int(post[name]),tz=None))
            else:
                row.append(None)
        elif name == 'edited':
            val = post[name]
            if type(val) == bool:
                row.append(val)
                row.append(None)
            else:
                row.append(True)
                row.append(datetime.fromtimestamp(int(val),tz=None))
        elif name == "time_edited":
            continue
        elif name == 'has_media':
            row.append(post.get('media',None) is not None)
        elif name not in post:
            row.append(None)
        else:
            row.append(post[name])
    return tuple(row)
 def parse_dump(partition):
    N=10000
    stream = open_fileset([f"/gscratch/comdata/raw_data/reddit_dumps/submissions/{partition}"])
    rows = map(parse_submission,stream)
    schema = pa.schema([
        pa.field('id', pa.string(),nullable=True),
        pa.field('author', pa.string(),nullable=True),
        pa.field('subreddit', pa.string(),nullable=True),
        pa.field('title', pa.string(),nullable=True),
        pa.field('created_utc', pa.timestamp('ms'),nullable=True),
        pa.field('permalink', pa.string(),nullable=True),
        pa.field('url', pa.string(),nullable=True),
        pa.field('domain', pa.string(),nullable=True),
        pa.field('score', pa.int64(),nullable=True),
        pa.field('ups', pa.int64(),nullable=True),
        pa.field('downs', pa.int64(),nullable=True),
        pa.field('over_18', pa.bool_(),nullable=True),
        pa.field('has_media',pa.bool_(),nullable=True),
        pa.field('selftext',pa.string(),nullable=True),
        pa.field('retrieved_on', pa.timestamp('ms'),nullable=True),
        pa.field('num_comments', pa.int64(),nullable=True),
        pa.field('gilded',pa.int64(),nullable=True),
        pa.field('edited',pa.bool_(),nullable=True),
        pa.field('time_edited',pa.timestamp('ms'),nullable=True),
        pa.field('subreddit_type',pa.string(),nullable=True),
        pa.field('subreddit_id',pa.string(),nullable=True),
        pa.field('subreddit_subscribers',pa.int64(),nullable=True),
        pa.field('name',pa.string(),nullable=True),
        pa.field('is_self',pa.bool_(),nullable=True),
        pa.field('stickied',pa.bool_(),nullable=True),
        pa.field('quarantine',pa.bool_(),nullable=True),
        pa.field('error',pa.string(),nullable=True)])
    Path("/gscratch/comdata/output/temp/reddit_submissions.parquet/").mkdir(exist_ok=True,parents=True)
    with pq.ParquetWriter(f"/gscratch/comdata/output/temp/reddit_submissions.parquet/{partition}",schema=schema,compression='snappy',flavor='spark') as writer:
        while True:
            chunk = islice(rows,N)
            pddf = pd.DataFrame(chunk, columns=schema.names)
            table = pa.Table.from_pandas(pddf,schema=schema)
            if table.shape[0] == 0:
                break
            writer.write_table(table)
        writer.close()
 def gen_task_list(dumpdir="/gscratch/comdata/raw_data/reddit_dumps/submissions"):
    files = list(find_dumps(dumpdir,base_pattern="RS_20*.*"))
    with open("submissions_task_list.sh",'w') as of:
        for fpath in files:
            partition = os.path.split(fpath)[1]
            of.write(f'python3 submissions_2_parquet_part1.py parse_dump {partition}\n')
 if __name__ == "__main__":
    fire.Fire({'parse_dump':parse_dump,
              'gen_task_list':gen_task_list})
--- a/datasets/submissions_2_parquet_part2.py
+++ b/datasets/submissions_2_parquet_part2.py
@@ -0,0 +1,42 @@
 #!/usr/bin/env python3
 # spark script to make sorted, and partitioned parquet files 
 import pyspark
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 import os
 spark = SparkSession.builder.getOrCreate()
 sc = spark.sparkContext
 conf = pyspark.SparkConf().setAppName("Reddit submissions to parquet")
 conf = conf.set("spark.sql.shuffle.partitions",2000)
 conf = conf.set('spark.sql.crossJoin.enabled',"true")
 conf = conf.set('spark.debug.maxToStringFields',200)
 sqlContext = pyspark.SQLContext(sc)
 df = spark.read.parquet("/gscratch/comdata/output/temp/reddit_submissions.parquet/")
 df = df.withColumn("subreddit_2", f.lower(f.col('subreddit')))
 df = df.drop('subreddit')
 df = df.withColumnRenamed('subreddit_2','subreddit')
 df = df.withColumnRenamed("created_utc","CreatedAt")
 df = df.withColumn("Month",f.month(f.col("CreatedAt")))
 df = df.withColumn("Year",f.year(f.col("CreatedAt")))
 df = df.withColumn("Day",f.dayofmonth(f.col("CreatedAt")))
 df = df.withColumn("subreddit_hash",f.sha2(f.col("subreddit"), 256)[0:3])
 # next we gotta resort it all.
 df = df.repartition("subreddit")
 df2 = df.sort(["subreddit","CreatedAt","id"],ascending=True)
 df2 = df.sortWithinPartitions(["subreddit","CreatedAt","id"],ascending=True)
 df2.write.parquet("/gscratch/comdata/output/temp/reddit_submissions_by_subreddit.parquet2", mode='overwrite',compression='snappy')
 # # we also want to have parquet files sorted by author then reddit. 
 df = df.repartition("author")
 df3 = df.sort(["author","CreatedAt","id"],ascending=True)
 df3 = df.sortWithinPartitions(["author","CreatedAt","id"],ascending=True)
 df3.write.parquet("/gscratch/comdata/output/temp/reddit_submissions_by_author.parquet2", mode='overwrite',compression='snappy')
--- a/datasets/submissions_merge.py
+++ b/datasets/submissions_merge.py
@@ -1,14 +0,0 @@
 #!/usr/bin/env python3
 """Collapse all layers in the submissions final datasets into a single clean layer.
 Must be launched from a login node via the Hyak-provided wrapper:
  start_spark_and_run.sh 1 submissions_merge.py
 See merge_layers.sh and dumps_helper.merge_layers for details.
 """
 from dumps_helper import SUBMISSIONS, merge_layers
 if __name__ == "__main__":
    merge_layers(SUBMISSIONS)
--- a/datasets/submissions_part1.py
+++ b/datasets/submissions_part1.py
@@ -1,24 +0,0 @@
 #!/usr/bin/env python3
 """Part 1 for submissions: parse one RS_*.zst dump into a parquet file.
 CLI:
  submissions_part1.py parse_dump RS_2018-08.zst
  submissions_part1.py gen_task_list
  submissions_part1.py parse_dump RS_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
 """
 import fire
 from dumps_helper import SUBMISSIONS, parse_dump, gen_task_list
 def _parse_dump(partition, dumpdir=None, outdir=None):
    parse_dump(SUBMISSIONS, partition, dumpdir=dumpdir, outdir=outdir)
 def _gen_task_list(dumpdir=None, tasklist=None):
    gen_task_list(SUBMISSIONS, 'submissions_part1.py', dumpdir=dumpdir, tasklist=tasklist)
 if __name__ == "__main__":
    fire.Fire({'parse_dump': _parse_dump,
               'gen_task_list': _gen_task_list})
--- a/datasets/submissions_part2.py
+++ b/datasets/submissions_part2.py
@@ -1,21 +0,0 @@
 #!/usr/bin/env python3
 """Part 2 for submissions: Spark sort + repartition into the final datasets.
 Must be launched from a login node via the Hyak-provided wrapper:
  start_spark_and_run.sh 1 submissions_part2.py
  start_spark_and_run.sh 1 submissions_part2.py --indir=/path/to/parquets --mode=append
 --indir defaults to the temp submissions dir in dumps_helper.py.
 --out_by_subreddit and --out_by_author default to the live dataset paths;
 override them to write to staging directories first (see add_months.sh).
 """
 import fire
 from dumps_helper import SUBMISSIONS, sort_and_write
 if __name__ == "__main__":
    fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
              sort_and_write(SUBMISSIONS, indir=indir,
                             out_by_subreddit=out_by_subreddit,
                             out_by_author=out_by_author))
--- a/density/Makefile
+++ b/density/Makefile
@@ -8,3 +8,9 @@ all: /gscratch/comdata/output/reddit_density/comment_terms_10000.feather /gscrat
 /gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10000.feather: overlap_density.py /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
 	start_spark_and_run.sh 1 overlap_density.py authors --inpath="/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet" --outpath="/gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10000.feather" --agg=pd.DataFrame.sum
 /gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10K_LSI/850.feather: overlap_density.py /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/850.feather
 	start_spark_and_run.sh 1 overlap_density.py authors --inpath="/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/850.feather" --outpath="/gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10K_LSI/850.feather" --agg=pd.DataFrame.sum
 /gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather: overlap_density.py /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather
 	start_spark_and_run.sh 1 overlap_density.py authors --inpath="/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather" --outpath="/gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather" --agg=pd.DataFrame.sum
--- a/density/job_script.sh
+++ b/density/job_script.sh
@@ -1,4 +1,4 @@
 #!/usr/bin/bash
 start_spark_cluster.sh
-spark-submit --master spark://$(hostname):18899 overlap_density.py authors --inpath=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather --outpath=/gscratch/comdata/output/reddit_density/comment_authors_10000.feather --agg=pd.DataFrame.sum
+singularity exec  /gscratch/comdata/users/nathante/cdsc_base.sif spark-submit --master spark://$(hostname).hyak.local:7077 overlap_density.py authors --inpath=/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors-tf_10k_LSI/600.feather --outpath=/gscratch/comdata/output/reddit_density/subreddit_author_tf_similarities_10K_LSI/600.feather --agg=pd.DataFrame.sum
-stop-all.sh
+singularity exec /gscratch/comdata/users/nathante/cdsc_base.sif stop-all.sh
--- a/density/overlap_density.py
+++ b/density/overlap_density.py
@@ -1,11 +1,12 @@
 import pandas as pd
 from pandas.core.groupby import DataFrameGroupBy as GroupBy
 from pathlib import Path
 import fire
 import numpy as np
 import sys
-sys.path.append("..")
+# sys.path.append("..")
-sys.path.append("../similarities")
+# sys.path.append("../similarities")
-from similarities.similarities_helper import reindex_tfidf, reindex_tfidf_time_interval
+# from similarities.similarities_helper import pull_tfidf
 # this is the mean of the ratio of the overlap to the focal size.
 # mean shared membership per focal community member
@@ -13,10 +14,12 @@ from similarities.similarities_helper import reindex_tfidf, reindex_tfidf_time_i
 def overlap_density(inpath, outpath, agg = pd.DataFrame.sum):
    df = pd.read_feather(inpath)
-    df = df.drop('subreddit',1)
+    df = df.drop('_subreddit',1)
    np.fill_diagonal(df.values,0)
    df = agg(df, 0).reset_index()
    df = df.rename({0:'overlap_density'},axis='columns')
    outpath = Path(outpath)
    outpath.parent.mkdir(parents=True, exist_ok = True)
    df.to_feather(outpath)
    return df
@@ -25,6 +28,8 @@ def overlap_density_weekly(inpath, outpath, agg = GroupBy.sum):
    # exclude the diagonal
    df = df.loc[df.subreddit != df.variable]
    res = agg(df.groupby(['subreddit','week'])).reset_index()
    outpath = Path(outpath)
    outpath.parent.mkdir(parents=True, exist_ok = True)
    res.to_feather(outpath)
    return res
--- a/dumps/check_comments_shas.py
+++ b/dumps/check_comments_shas.py
@@ -0,0 +1,33 @@
 #!/usr/bin/env python3
 # run from a build_machine
 import requests
 from os import path
 import hashlib
 shasums1 = requests.get("https://files.pushshift.io/reddit/comments/sha256sum.txt").text
 #shasums2 = requests.get("https://files.pushshift.io/reddit/comments/daily/sha256sum.txt").text
 shasums = shasums1 
 dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/comments"
 for l in shasums.strip().split('\n'):
    sha256_hash = hashlib.sha256()
    parts = l.split(' ')
    correct_sha256 = parts[0]
    filename = parts[-1]
    print(f"checking {filename}")
    fpath = path.join(dumpdir,filename)
    if path.isfile(fpath):
        with open(fpath,'rb') as f:
            for byte_block in iter(lambda: f.read(4096),b""):
                sha256_hash.update(byte_block)
        if sha256_hash.hexdigest() == correct_sha256:
            print(f"{filename} checks out")
        else:
            print(f"ERROR! {filename} has the wrong hash. Redownload and recheck!")
    else:
        print(f"Skipping {filename} as it doesn't exist")
--- a/dumps/check_submission_shas.py
+++ b/dumps/check_submission_shas.py
@@ -0,0 +1,31 @@
 #!/usr/bin/env python3
 # run from a build_machine
 import requests
 from os import path
 import hashlib
 file1 = requests.get("https://files.pushshift.io/reddit/submissions/sha256sums.txt").text
 file2 = requests.get("https://files.pushshift.io/reddit/submissions/old_v1_data/sha256sums.txt").text
 dumpdir = "/gscratch/comdata/raw_data/reddit_dumps/submissions"
 for l in file1.strip().split('\n') + file2.strip().split('\n'):
    sha256_hash = hashlib.sha256()
    parts = l.split(' ')
    correct_sha256 = parts[0]
    filename = parts[-1]
    print(f"checking {filename}")
    fpath = path.join(dumpdir,filename)
    if path.isfile(fpath):
        with open(fpath,'rb') as f:
            for byte_block in iter(lambda: f.read(4096),b""):
                sha256_hash.update(byte_block)
        if sha256_hash.hexdigest() == correct_sha256:
            print(f"{filename} checks out")
        else:
            print(f"ERROR! {filename} has the wrong hash. Redownload and recheck!")
    else:
        print(f"Skipping {filename} as it doesn't exist")
--- a/dumps/pull_pushshift_comments.sh
+++ b/dumps/pull_pushshift_comments.sh
@@ -0,0 +1,12 @@
 #!/bin/bash
 user_agent='"nathante teblunthuis <nathante@uw.edu>"'
 output_dir='/gscratch/comdata/raw_data/reddit_dumps/comments'
 base_url='https://files.pushshift.io/reddit/comments/'
 wget -r --no-parent -A 'RC_20*.bz2' -U $user_agent -P $output_dir -nd -nc $base_url
 wget -r --no-parent -A 'RC_20*.xz' -U $user_agent -P $output_dir -nd -nc $base_url
 wget -r --no-parent -A 'RC_20*.zst' -U $user_agent -P $output_dir -nd -nc $base_url
 ./check_comments_shas.py
--- a/dumps/pull_pushshift_submissions.sh
+++ b/dumps/pull_pushshift_submissions.sh
@@ -0,0 +1,14 @@
 #!/bin/bash
 user_agent='"nathante teblunthuis <nathante@uw.edu>"'
 output_dir='/gscratch/comdata/raw_data/reddit_dumps/submissions'
 base_url='https://files.pushshift.io/reddit/submissions/'
 wget -r --no-parent -A 'RS_20*.bz2' --user-agent=$user_agent -P $output_dir -nd -nc $base_url
 wget -r --no-parent -A 'RS_20*.xz' --user-agent=$user_agent -P $output_dir -nd -nc $base_url
 wget -r --no-parent -A 'RS_20*.zst' --user-agent=$user_agent -P $output_dir -nd -nc $base_url
 wget -r --no-parent -A 'RS_20*.bz2' --user-agent=$user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
 wget -r --no-parent -A 'RS_20*.xz' --user-agent=$user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
 wget -r --no-parent -A 'RS_20*.zst' --user-agent=$user_agent -P $output_dir -nd -nc $base_url/old_v1_data/
 ./check_submission_shas.py
--- a/ngrams/run_tf_jobs.sh
+++ b/ngrams/run_tf_jobs.sh
@@ -1,8 +1,6 @@
 #!/usr/bin/env bash
-module load parallel_sql
+
 source ./bin/activate
 python3 tf_comments.py gen_task_list
 psu --del --Y
 cat tf_task_list | psu --load
 for job in $(seq 1 50); do sbatch checkpoint_parallelsql.sbatch; done;
--- a/ngrams/sort_tf_comments.py
+++ b/ngrams/sort_tf_comments.py
@@ -2,12 +2,17 @@
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 import fire
 def main(inparquet, outparquet, colname):
    spark = SparkSession.builder.getOrCreate()
-df = spark.read.parquet("/gscratch/comdata/users/nathante/reddit_tfidf_test.parquet_temp/")
+    df = spark.read.parquet(inparquet)
-df = df.repartition(2000,'term')
+    df = df.repartition(2000,colname)
-df = df.sort(['term','week','subreddit'])
+    df = df.sort([colname,'week','subreddit'])
-df = df.sortWithinPartitions(['term','week','subreddit'])
+    df = df.sortWithinPartitions([colname,'week','subreddit'])
-df.write.parquet("/gscratch/comdata/users/nathante/reddit_tfidf_test_sorted_tf.parquet_temp",mode='overwrite',compression='snappy')
+    df.write.parquet(outparquet,mode='overwrite',compression='snappy')
 if __name__ == '__main__':
    fire.Fire(main)
--- a/ngrams/tf_comments.py
+++ b/ngrams/tf_comments.py
@@ -13,25 +13,30 @@ from nltk.corpus import stopwords
 from nltk.util import ngrams
 import string
 from random import random
-
+from redditcleaner import clean
-# remove urls
+from pathlib import Path
 # taken from https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url
 urlregex = re.compile(r"[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)")
 # compute term frequencies for comments in each subreddit by week
-def weekly_tf(partition, mwe_pass = 'first'):
+def weekly_tf(partition, outputdir = '/gscratch/comdata/output/reddit_ngrams/', input_dir="/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/", mwe_pass = 'first', excluded_users=None):
    dataset = ds.dataset(f'/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/{partition}', format='parquet')
    if not os.path.exists("/gscratch/comdata/users/nathante/reddit_comment_ngrams_10p_sample/"):
        os.mkdir("/gscratch/comdata/users/nathante/reddit_comment_ngrams_10p_sample/")
-    if not os.path.exists("/gscratch/comdata/users/nathante/reddit_tfidf_test_authors.parquet_temp/"):
+    dataset = ds.dataset(Path(input_dir)/partition, format='parquet')
-        os.mkdir("/gscratch/comdata/users/nathante/reddit_tfidf_test_authors.parquet_temp/")
+    outputdir = Path(outputdir)
    samppath = outputdir / "reddit_comment_ngrams_10p_sample"
    if not samppath.exists():
        samppath.mkdir(parents=True, exist_ok=True)
    ngram_output = partition.replace("parquet","txt")
    if excluded_users is not None:
        excluded_users = set(map(str.strip,open(excluded_users)))
        df = df.filter(~ (f.col("author").isin(excluded_users)))
    ngram_path = samppath / ngram_output
    if mwe_pass == 'first':
-        if os.path.exists(f"/gscratch/comdata/output/reddit_ngrams/comment_ngrams_10p_sample/{ngram_output}"):
+        if ngram_path.exists():
-            os.remove(f"/gscratch/comdata/output/reddit_ngrams/comment_ngrams_10p_sample/{ngram_output}")
+            ngram_path.unlink()
    batches = dataset.to_batches(columns=['CreatedAt','subreddit','body','author'])
@@ -65,8 +70,10 @@ def weekly_tf(partition, mwe_pass = 'first'):
    subreddit_weeks = groupby(rows, lambda r: (r.subreddit, r.week))
    mwe_path = outputdir / "multiword_expressions.feather"
    if mwe_pass != 'first':
-        mwe_dataset = pd.read_feather(f'/gscratch/comdata/output/reddit_ngrams/multiword_expressions.feather')
+        mwe_dataset = pd.read_feather(mwe_path)
        mwe_dataset = mwe_dataset.sort_values(['phrasePWMI'],ascending=False)
        mwe_phrases = list(mwe_dataset.phrase)
        mwe_phrases = [tuple(s.split(' ')) for s in mwe_phrases]
@@ -95,8 +102,8 @@ def weekly_tf(partition, mwe_pass = 'first'):
        # lowercase        
        text = text.lower()
-        # remove urls
+        # redditcleaner removes reddit markdown(newlines, quotes, bullet points, links, strikethrough, spoiler, code, superscript, table, headings)
-        text = urlregex.sub("", text)
+        text = clean(text)
        # sentence tokenize
        sentences = sent_tokenize(text)
@@ -107,19 +114,18 @@ def weekly_tf(partition, mwe_pass = 'first'):
        # remove punctuation
        sentences = map(remove_punct, sentences)
        # remove sentences with less than 2 words
        sentences = filter(lambda sentence: len(sentence) > 2, sentences)
        # datta et al. select relatively common phrases from the reddit corpus, but they don't really explain how. We'll try that in a second phase.
        # they say that the extract 1-4 grams from 10% of the sentences and then find phrases that appear often relative to the original terms
        # here we take a 10 percent sample of sentences 
        if mwe_pass == 'first':
            # remove sentences with less than 2 words
            sentences = filter(lambda sentence: len(sentence) > 2, sentences)
            sentences = list(sentences)
            for sentence in sentences:
                if random() <= 0.1:
                    grams = list(chain(*map(lambda i : ngrams(sentence,i),range(4))))
-                    with open(f'/gscratch/comdata/output/reddit_ngrams/comment_ngrams_10p_sample/{ngram_output}','a') as gram_file:
+                    with open(ngram_path,'a') as gram_file:
                        for ng in grams:
                            gram_file.write(' '.join(ng) + '\n')
                for token in sentence:
@@ -154,7 +160,14 @@ def weekly_tf(partition, mwe_pass = 'first'):
    outchunksize = 10000
-    with pq.ParquetWriter(f"/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet/{partition}",schema=schema,compression='snappy',flavor='spark') as writer, pq.ParquetWriter(f"/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet/{partition}",schema=author_schema,compression='snappy',flavor='spark') as author_writer:
+    termtf_outputdir = (outputdir / "comment_terms")
    termtf_outputdir.mkdir(parents=True, exist_ok=True)
    authortf_outputdir = (outputdir / "comment_authors")
    authortf_outputdir.mkdir(parents=True, exist_ok=True)    
    termtf_path = termtf_outputdir / partition
    authortf_path = authortf_outputdir / partition
    with pq.ParquetWriter(termtf_path, schema=schema, compression='snappy', flavor='spark') as writer, \
         pq.ParquetWriter(authortf_path, schema=author_schema, compression='snappy', flavor='spark') as author_writer:
        while True:
@@ -183,12 +196,12 @@ def weekly_tf(partition, mwe_pass = 'first'):
        author_writer.close()
-def gen_task_list(mwe_pass='first'):
+def gen_task_list(mwe_pass='first', outputdir='/gscratch/comdata/output/reddit_ngrams/', tf_task_list='tf_task_list', excluded_users_file=None):
    files = os.listdir("/gscratch/comdata/output/reddit_comments_by_subreddit.parquet/")
-    with open("tf_task_list",'w') as outfile:
+    with open(tf_task_list,'w') as outfile:
        for f in files:
            if f.endswith(".parquet"):
-                outfile.write(f"./tf_comments.py weekly_tf --mwe-pass {mwe_pass} {f}\n")
+                outfile.write(f"./tf_comments.py weekly_tf --mwe-pass {mwe_pass} --outputdir {outputdir} --excluded_users {excluded_users_file} {f}\n")
 if __name__ == "__main__":
    fire.Fire({"gen_task_list":gen_task_list,
--- a/ngrams/top_comment_phrases.py
+++ b/ngrams/top_comment_phrases.py
@@ -1,10 +1,17 @@
 #!/usr/bin/env python3
 from pyspark.sql import functions as f
 from pyspark.sql import Window
 from pyspark.sql import SparkSession
 import numpy as np
 import fire
 from pathlib import Path
 def main(ngram_dir="/gscratch/comdata/output/reddit_ngrams"):
    spark = SparkSession.builder.getOrCreate()
-df = spark.read.text("/gscratch/comdata/users/nathante/reddit_comment_ngrams_10p_sample/")
+    ngram_dir = Path(ngram_dir)
    ngram_sample = ngram_dir / "reddit_comment_ngrams_10p_sample"
    df = spark.read.text(str(ngram_sample))
    df = df.withColumnRenamed("value","phrase")
@@ -13,7 +20,6 @@ phrases = df.groupby('phrase').count()
    phrases = phrases.withColumnRenamed('count','phraseCount')
    phrases = phrases.filter(phrases.phraseCount > 10)
    # count overall
    N = phrases.select(f.sum(phrases.phraseCount).alias("phraseCount")).collect()[0].phraseCount
@@ -41,18 +47,23 @@ df = terms.select(['phrase','phraseCount','phraseLogProb','phrasePWMI'])
    df = df.sort(['phrasePWMI'],descending=True)
    df = df.sortWithinPartitions(['phrasePWMI'],descending=True)
 df.write.parquet("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.parquet/",mode='overwrite',compression='snappy')
-df = spark.read.parquet("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.parquet/")
+    pwmi_dir = ngram_dir / "reddit_comment_ngrams_pwmi.parquet/"
    df.write.parquet(str(pwmi_dir), mode='overwrite', compression='snappy')
-df.write.csv("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.csv/",mode='overwrite',compression='none')
+    df = spark.read.parquet(str(pwmi_dir))
-df = spark.read.parquet("/gscratch/comdata/users/nathante/reddit_comment_ngrams_pwmi.parquet")
+    df.write.csv(str(ngram_dir / "reddit_comment_ngrams_pwmi.csv/"),mode='overwrite',compression='none')
    df = spark.read.parquet(str(pwmi_dir))
    df = df.select('phrase','phraseCount','phraseLogProb','phrasePWMI')
    # choosing phrases occurring at least 3500 times in the 10% sample (35000 times) and then with a PWMI of at least 3 yeids about 65000 expressions.
    #
    df = df.filter(f.col('phraseCount') > 3500).filter(f.col("phrasePWMI")>3)
    df = df.toPandas()
-df.to_feather("/gscratch/comdata/users/nathante/reddit_multiword_expressions.feather")
+    df.to_feather(ngram_dir / "multiword_expressions.feather")
-df.to_csv("/gscratch/comdata/users/nathante/reddit_multiword_expressions.csv")
+    df.to_csv(ngram_dir / "multiword_expressions.csv")
 if __name__ == '__main__':
    fire.Fire(main)
--- a/old/#tfidf_authors.py#
+++ b/old/#tfidf_authors.py#
@@ -0,0 +1,21 @@
 from pyspark.sql import SparkSession
 from similarities_helper import build_tfidf_dataset
 import pandas as pd
 spark = SparkSession.builder.getOrCreate()
 df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
 include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
 include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
 # remove [deleted] and AutoModerator (TODO remove other bots)
 df = df.filter(df.author != '[deleted]')
 df = df.filter(df.author != 'AutoModerator')
 df = build_tfidf_dataset(df, include_subs, 'author')
 df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet',mode='overwrite',compression='snappy')
 spark.stop()
--- a/old/#tfidf_comments_weekly.py#
+++ b/old/#tfidf_comments_weekly.py#
@@ -0,0 +1,27 @@
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 from pyspark.sql import Window
 from similarities_helper import build_weekly_tfidf_dataset
 import pandas as pd
 ## TODO:need to exclude automoderator / bot posts.
 ## TODO:need to exclude better handle hyperlinks. 
 spark = SparkSession.builder.getOrCreate()
 df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
 include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
 include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
 # remove [deleted] and AutoModerator (TODO remove other bots)
 # df = df.filter(df.author != '[deleted]')
 # df = df.filter(df.author != 'AutoModerator')
 df = build_weekly_tfidf_dataset(df, include_subs, 'term')
 df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', mode='overwrite', compression='snappy')
 spark.stop()
--- a/old/author_cosine_similarity.py
+++ b/old/author_cosine_similarity.py
@@ -0,0 +1,106 @@
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 from pyspark.sql import Window
 import numpy as np
 import pyarrow
 import pandas as pd
 import fire
 from itertools import islice
 from pathlib import Path
 from similarities_helper import *
 #tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/subreddit_terms.parquet')
 def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, included_subreddits = None, topN = 500):
    spark = SparkSession.builder.getOrCreate()
    conf = spark.sparkContext.getConf()
    print(outfile)
    tfidf = spark.read.parquet(tfidf_path)
    if included_subreddits is None:
        included_subreddits = select_topN_subreddits(topN)
    else:
        included_subreddits = set(open(included_subreddits))
    print("creating temporary parquet with matrix indicies")
    tempdir = prep_tfidf_entries_weekly(tfidf, term_colname, min_df, included_subreddits)
    tfidf = spark.read.parquet(tempdir.name)
    # the ids can change each week.
    subreddit_names = tfidf.select(['subreddit','subreddit_id_new','week']).distinct().toPandas()
    subreddit_names = subreddit_names.sort_values("subreddit_id_new")
    subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
    spark.stop()
    weeks = list(subreddit_names.week.drop_duplicates())
    for week in weeks:
        print("loading matrix")
        mat = read_tfidf_matrix_weekly(tempdir.name, term_colname, week)
        print('computing similarities')
        sims = column_similarities(mat)
        del mat
        names = subreddit_names.loc[subreddit_names.week==week]
        sims = sims.rename({i:sr for i, sr in enumerate(names.subreddit.values)},axis=1)
        sims['subreddit'] = names.subreddit.values
        write_weekly_similarities(outfile, sims, week)
 def cosine_similarities(outfile, min_df = None, included_subreddits=None, topN=500):
    '''
    Compute similarities between subreddits based on tfi-idf vectors of author comments
    included_subreddits : string
        Text file containing a list of subreddits to include (one per line) if included_subreddits is None then do the top 500 subreddits
    min_df : int (default = 0.1 * (number of included_subreddits)
         exclude terms that appear in fewer than this number of documents.
    outfile: string
         where to output csv and feather outputs
 '''
    spark = SparkSession.builder.getOrCreate()
    conf = spark.sparkContext.getConf()
    print(outfile)
    tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet')
    if included_subreddits is None:
        included_subreddits = select_topN_subreddits(topN)
    else:
        included_subreddits = set(open(included_subreddits))
    print("creating temporary parquet with matrix indicies")
    tempdir = prep_tfidf_entries(tfidf, 'author', min_df, included_subreddits)
    tfidf = spark.read.parquet(tempdir.name)
    subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
    subreddit_names = subreddit_names.sort_values("subreddit_id_new")
    subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
    spark.stop()
    print("loading matrix")
    mat = read_tfidf_matrix(tempdir.name,'author')
    print('computing similarities')
    sims = column_similarities(mat)
    del mat
    sims = pd.DataFrame(sims.todense())
    sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)},axis=1)
    sims['subreddit'] = subreddit_names.subreddit.values
    p = Path(outfile)
    output_feather =  Path(str(p).replace("".join(p.suffixes), ".feather"))
    output_csv =  Path(str(p).replace("".join(p.suffixes), ".csv"))
    output_parquet =  Path(str(p).replace("".join(p.suffixes), ".parquet"))
    sims.to_feather(outfile)
    tempdir.cleanup()
 if __name__ == '__main__':
    fire.Fire(author_cosine_similarities)
--- a/old/term_cosine_similarity.py
+++ b/old/term_cosine_similarity.py
@@ -0,0 +1,61 @@
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 from pyspark.sql import Window
 from pyspark.mllib.linalg.distributed import RowMatrix, CoordinateMatrix
 import numpy as np
 import pyarrow
 import pandas as pd
 import fire
 from itertools import islice
 from pathlib import Path
 from similarities_helper import prep_tfidf_entries, read_tfidf_matrix, column_similarities, select_topN
 import scipy
 # outfile='test_similarities_500.feather';
 # min_df = None;
 # included_subreddits=None; topN=100; exclude_phrases=True;
 def term_cosine_similarities(outfile, min_df = None, included_subreddits=None, topN=500, exclude_phrases=False):
    spark = SparkSession.builder.getOrCreate()
    conf = spark.sparkContext.getConf()
    print(outfile)
    print(exclude_phrases)
    tfidf = spark.read.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_terms.parquet')
    if included_subreddits is None:
        included_subreddits = select_topN_subreddits(topN)
    else:
        included_subreddits = set(open(included_subreddits))
    if exclude_phrases == True:
        tfidf = tfidf.filter(~f.col(term).contains("_"))
    print("creating temporary parquet with matrix indicies")
    tempdir = prep_tfidf_entries(tfidf, 'term', min_df, included_subreddits)
    tfidf = spark.read.parquet(tempdir.name)
    subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
    subreddit_names = subreddit_names.sort_values("subreddit_id_new")
    subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
    spark.stop()
    print("loading matrix")
    mat = read_tfidf_matrix(tempdir.name,'term')
    print('computing similarities')
    sims = column_similarities(mat)
    del mat
    sims = pd.DataFrame(sims.todense())
    sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)},axis=1)
    sims['subreddit'] = subreddit_names.subreddit.values
    p = Path(outfile)
    output_feather =  Path(str(p).replace("".join(p.suffixes), ".feather"))
    output_csv =  Path(str(p).replace("".join(p.suffixes), ".csv"))
    output_parquet =  Path(str(p).replace("".join(p.suffixes), ".parquet"))
    sims.to_feather(outfile)
    tempdir.cleanup()
 if __name__ == '__main__':
    fire.Fire(term_cosine_similarities)
--- a/old/tfidf_authors.py
+++ b/old/tfidf_authors.py
@@ -0,0 +1,21 @@
 from pyspark.sql import SparkSession
 from similarities_helper import build_tfidf_dataset
 import pandas as pd
 spark = SparkSession.builder.getOrCreate()
 df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
 include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
 include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
 # remove [deleted] and AutoModerator (TODO remove other bots)
 df = df.filter(df.author != '[deleted]')
 df = df.filter(df.author != 'AutoModerator')
 df = build_tfidf_dataset(df, include_subs, 'author')
 df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf/subreddit_comment_authors.parquet',mode='overwrite',compression='snappy')
 spark.stop()
--- a/old/tfidf_authors_weekly.py
+++ b/old/tfidf_authors_weekly.py
@@ -0,0 +1,21 @@
 from pyspark.sql import SparkSession
 from similarities_helper import build_weekly_tfidf_dataset
 import pandas as pd
 spark = SparkSession.builder.getOrCreate()
 df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet")
 include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
 include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
 # remove [deleted] and AutoModerator (TODO remove other bots)
 df = df.filter(df.author != '[deleted]')
 df = df.filter(df.author != 'AutoModerator')
 df = build_weekly_tfidf_dataset(df, include_subs, 'author')
 df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet', mode='overwrite', compression='snappy')
 spark.stop()
--- a/old/tfidf_comments.py
+++ b/old/tfidf_comments.py
@@ -0,0 +1,18 @@
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 from pyspark.sql import Window
 from similarities_helper import build_tfidf_dataset
 ## TODO:need to exclude automoderator / bot posts.
 ## TODO:need to exclude better handle hyperlinks. 
 spark = SparkSession.builder.getOrCreate()
 df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
 include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
 include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
 df = build_tfidf_dataset(df, include_subs, 'term')
 df.write.parquet('/gscratch/comdata/output/reddit_similarity/reddit_similarity/subreddit_terms.parquet',mode='overwrite',compression='snappy')
 spark.stop()
--- a/old/tfidf_comments_weekly.py
+++ b/old/tfidf_comments_weekly.py
@@ -0,0 +1,27 @@
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 from pyspark.sql import Window
 from similarities_helper import build_weekly_tfidf_dataset
 import pandas as pd
 ## TODO:need to exclude automoderator / bot posts.
 ## TODO:need to exclude better handle hyperlinks. 
 spark = SparkSession.builder.getOrCreate()
 df = spark.read.parquet("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet")
 include_subs = pd.read_csv("/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv")
 include_subs = set(include_subs.loc[include_subs.comments_rank <= 25000]['subreddit'])
 # remove [deleted] and AutoModerator (TODO remove other bots)
 # df = df.filter(df.author != '[deleted]')
 # df = df.filter(df.author != 'AutoModerator')
 df = build_weekly_tfidf_dataset(df, include_subs, 'term')
 df.write.parquet('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', mode='overwrite', compression='snappy')
 spark.stop()
--- a/similarities/Makefile
+++ b/similarities/Makefile
@@ -1,25 +1,138 @@
-all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms.parquet
+
 #all: /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_130k.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_130k.parquet /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms_130k.parquet /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors_130k.parquet
 # srun_singularity=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity.sh
 # srun_singularity_huge=source /gscratch/comdata/users/nathante/cdsc_reddit/bin/activate && srun_singularity_huge.sh
 srun=srun -p compute-bigmem -A comdata --mem-per-cpu=9g --time=200:00:00 -c 40 
 srun_huge=srun -p compute-hugemem -A comdata --mem-per-cpu=9g --time=200:00:00 -c 40 
 similarity_data=/gscratch/scrubbed/comdata/reddit_similarity
 tfidf_data=${similarity_data}/tfidf
 tfidf_weekly_data=${similarity_data}/tfidf_weekly
 similarity_weekly_data=${similarity_data}/weekly
 lsi_components=[10,50,100,200,300,400,500,600,700,850,1000,1500]
 lsi_similarities: ${similarity_data}/subreddit_comment_terms_10k_LSI ${similarity_data}/subreddit_comment_authors-tf_10k_LSI ${similarity_data}/subreddit_comment_authors_10k_LSI ${similarity_data}/subreddit_comment_terms_30k_LSI ${similarity_data}/subreddit_comment_authors-tf_30k_LSI ${similarity_data}/subreddit_comment_authors_30k_LSI
 all: ${tfidf_data}/comment_terms_30k.parquet ${tfidf_data}/comment_terms_10k.parquet ${tfidf_data}/comment_authors_30k.parquet ${tfidf_data}/comment_authors_10k.parquet ${similarity_data}/subreddit_comment_authors_30k.feather ${similarity_data}/subreddit_comment_authors_10k.feather  ${similarity_data}/subreddit_comment_terms_10k.feather ${similarity_data}/subreddit_comment_terms_30k.feather ${similarity_data}/subreddit_comment_authors-tf_30k.feather ${similarity_data}/subreddit_comment_authors-tf_10k.feather
 #all: ${tfidf_data}/comment_terms_100k.parquet ${tfidf_data}/comment_terms_30k.parquet ${tfidf_data}/comment_terms_10k.parquet ${tfidf_data}/comment_authors_100k.parquet ${tfidf_data}/comment_authors_30k.parquet ${tfidf_data}/comment_authors_10k.parquet ${similarity_data}/subreddit_comment_authors_30k.feather ${similarity_data}/subreddit_comment_authors_10k.feather  ${similarity_data}/subreddit_comment_terms_10k.feather ${similarity_data}/subreddit_comment_terms_30k.feather ${similarity_data}/subreddit_comment_authors-tf_30k.feather ${similarity_data}/subreddit_comment_authors-tf_10k.feather ${similarity_data}/subreddit_comment_terms_100k.feather ${similarity_data}/subreddit_comment_authors_100k.feather ${similarity_data}/subreddit_comment_authors-tf_100k.feather ${similarity_weekly_data}/comment_terms.parquet
 #${tfidf_weekly_data}/comment_terms_100k.parquet ${tfidf_weekly_data}/comment_authors_100k.parquet ${tfidf_weekly_data}/comment_terms_30k.parquet ${tfidf_weekly_data}/comment_authors_30k.parquet ${similarity_weekly_data}/comment_terms_100k.parquet ${similarity_weekly_data}/comment_authors_100k.parquet  ${similarity_weekly_data}/comment_terms_30k.parquet ${similarity_weekly_data}/comment_authors_30k.parquet
 # /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_130k.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_130k.parquet /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_130k.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_130k.parquet /gscratch/comdata/output/reddit_similarity/comment_terms_weekly_130k.parquet
 # all: /gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_10000.parquet /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet
 ${similarity_weekly_data}/comment_terms.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv ${tfidf_weekly_data}/comment_terms.parquet
 	 ${srun} python3 weekly_cosine_similarities.py terms --topN=10000 --outfile=${similarity_weekly_data}/comment_terms.parquet
-# /gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.parquet: cosine_similarities.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
+${similarity_data}/subreddit_comment_terms_10k.feather: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
-# 	start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_authors_25000.feather
+	 ${srun} python3 cosine_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_10k.feather --topN=10000
-/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
+${similarity_data}/subreddit_comment_terms_10k_LSI: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
-	start_spark_and_run.sh 1 tfidf.py terms --topN=10000
+	 ${srun_huge} python3 lsi_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=200
-/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv
+${similarity_data}/subreddit_comment_terms_30k_LSI: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
-	start_spark_and_run.sh 1 tfidf.py authors --topN=10000
+	 ${srun_huge} python3 lsi_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_30k_LSI --topN=30000 --n_components=${lsi_components} --min_df=200 --inpath=$<
-/gscratch/comdata/output/reddit_similarity/comment_authors_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet 
+${similarity_data}/subreddit_comment_terms_30k.feather: ${tfidf_data}/comment_terms_30k.parquet similarities_helper.py
-	start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
+	 ${srun_huge} python3 cosine_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_30k.feather --topN=30000 --inpath=$<
-/gscratch/comdata/output/reddit_similarity/comment_terms.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet
+${similarity_data}/subreddit_comment_authors_30k.feather: ${tfidf_data}/comment_authors_30k.parquet similarities_helper.py
-	start_spark_and_run.sh 1 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
+	 ${srun_huge} python3 cosine_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_30k.feather --topN=30000 --inpath=$<
-# /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet: cosine_similarities.py /gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet
+${similarity_data}/subreddit_comment_authors_10k.feather: ${tfidf_data}/comment_authors_10k.parquet similarities_helper.py
 	 ${srun_huge} python3 cosine_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_10k.feather --topN=10000 --inpath=$<
 ${similarity_data}/subreddit_comment_authors_10k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun_huge} python3 lsi_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=10 --inpath=$<
 ${similarity_data}/subreddit_comment_authors_30k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun_huge} python3 lsi_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_30k_LSI --topN=30000 --n_components=${lsi_components} --min_df=10 --inpath=$<
 ${similarity_data}/subreddit_comment_authors-tf_30k.feather: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun} python3 cosine_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_30k.feather --topN=30000 --inpath=$<
 ${similarity_data}/subreddit_comment_authors-tf_10k.feather: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun} python3 cosine_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_10k.feather --topN=10000
 ${similarity_data}/subreddit_comment_authors-tf_10k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun_huge} python3 lsi_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_10k_LSI --topN=10000 --n_components=${lsi_components} --min_df=10 --inpath=$<
 ${similarity_data}/subreddit_comment_authors-tf_30k_LSI: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun_huge} python3 lsi_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_30k_LSI --topN=30000 --n_components=${lsi_components} --min_df=10 --inpath=$<
 ${similarity_data}/subreddit_comment_terms_100k.feather: ${tfidf_data}/comment_terms_100k.parquet similarities_helper.py
 	 ${srun} python3 cosine_similarities.py term --outfile=${similarity_data}/subreddit_comment_terms_100k.feather --topN=100000
 ${similarity_data}/subreddit_comment_authors_100k.feather: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun} python3 cosine_similarities.py author --outfile=${similarity_data}/subreddit_comment_authors_100k.feather --topN=100000
 ${similarity_data}/subreddit_comment_authors-tf_100k.feather: ${tfidf_data}/comment_authors_100k.parquet similarities_helper.py
 	 ${srun} python3 cosine_similarities.py author-tf --outfile=${similarity_data}/subreddit_comment_authors-tf_100k.feather --topN=100000
 ${similarity_data}/subreddits_by_num_comments_nonsfw.csv:
 	start_spark_and_run.sh 3 top_subreddits_by_comments.py
 ${tfidf_data}/comment_terms_100k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 #	mkdir -p ${tfidf_data}/
 	start_spark_and_run.sh 3 tfidf.py terms --topN=100000 --inpath=$< --outpath=${tfidf_data}/comment_terms_100k.parquet
 ${tfidf_data}/comment_terms_30k.feather: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 #	mkdir -p ${tfidf_data}/
 	start_spark_and_run.sh 3 tfidf.py terms --topN=30000 --inpath=$< --outpath=${tfidf_data}/comment_terms_30k.feather
 ${tfidf_data}/comment_terms_10k.feather: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 #	mkdir -p ${tfidf_data}/
 	start_spark_and_run.sh 3 tfidf.py terms --topN=10000 --inpath=$< --outpath=${tfidf_data}/comment_terms_10k.feather
 ${tfidf_data}/comment_authors_100k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 #	mkdir -p ${tfidf_data}/
 	start_spark_and_run.sh 3 tfidf.py authors --topN=100000 --inpath=$< --outpath=${tfidf_data}/comment_authors_100k.parquet
 ${tfidf_data}/comment_authors_10k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 #	mkdir -p ${tfidf_data}/
 	start_spark_and_run.sh 3 tfidf.py authors --topN=10000 --inpath=$< --outpath=${tfidf_data}/comment_authors_10k.parquet
 ${tfidf_data}/comment_authors_30k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 #	mkdir -p ${tfidf_data}/
 	start_spark_and_run.sh 3 tfidf.py authors --topN=30000 --inpath=$< --outpath=${tfidf_data}/comment_authors_30k.parquet
 ${tfidf_data}/tfidf_weekly/comment_terms_100k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 	start_spark_and_run.sh 3 tfidf.py terms_weekly --topN=100000 --outpath=${similarity_data}/tfidf_weekly/comment_authors_100k.parquet
 ${tfidf_data}/tfidf_weekly/comment_authors_100k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_ppnum_comments.csv
 	start_spark_and_run.sh 3 tfidf.py authors_weekly --topN=100000 --inpath=$< --outpath=${tfidf_weekly_data}/comment_authors_100k.parquet
 ${tfidf_weekly_data}/comment_terms_30k.parquet:  /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 	start_spark_and_run.sh 2 tfidf.py terms_weekly --topN=30000 --inpath=$< --outpath=${tfidf_weekly_data}/comment_authors_30k.parquet
 ${tfidf_weekly_data}/comment_authors_30k.parquet: /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv
 	start_spark_and_run.sh 3 tfidf.py authors_weekly --topN=30000 --inpath=$< --outpath=${tfidf_weekly_data}/comment_authors_30k.parquet
 ${similarity_weekly_data}/comment_terms_100k.parquet: weekly_cosine_similarities.py similarities_helper.py ${tfidf_weekly_data}/comment_terms_100k.parquet
 	 ${srun} python3 weekly_cosine_similarities.py terms --topN=100000 --outfile=${similarity_weekly_data}/comment_terms_100k.parquet
 ${similarity_weekly_data}/comment_authors_100k.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv ${tfidf_weekly_data}/comment_authors_100k.parquet
 	 ${srun} python3 weekly_cosine_similarities.py authors --topN=100000 --outfile=${similarity_weekly_data}/comment_authors_100k.parquet
 ${similarity_weekly_data}/comment_terms_30k.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv ${tfidf_weekly_data}/comment_terms_30k.parquet
 	 ${srun} python3 weekly_cosine_similarities.py terms --topN=30000 --outfile=${similarity_weekly_data}/comment_authors_30k.parquet
 ,${similarity_weekly_data}/comment_authors_30k.parquet: weekly_cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_terms.parquet ${similarity_data}/subreddits_by_num_comments_nonsfw.csv ${tfidf_weekly_data}/comment_authors_30k.parquet
 	 ${srun} python3 weekly_cosine_similarities.py authors --topN=30000 --outfile=${similarity_weekly_data}/comment_authors_30k.parquet
 # ${tfidf_weekly_data}/comment_authors_130k.parquet: tfidf.py similarities_helper.py /gscratch/comdata/output/reddit_ngrams/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments_nonsfw.csv
 # 	start_spark_and_run.sh 1 tfidf.py authors_weekly --topN=130000
 # /gscratch/comdata/output/reddit_similarity/comment_authors_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet 
 # 	start_spark_and_run.sh 1 cosine_similarities.py author --outfile=/gscratch/comdata/output/reddit_similarity/comment_authors_10000.feather
 # /gscratch/comdata/output/reddit_similarity/comment_terms.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet
 # 	start_spark_and_run.sh 1 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
 # /gscratch/comdata/output/reddit_similarity/comment_terms_10000_weekly.parquet: cosine_similarities.py ${tfidf_weekly_data}/comment_authors.parquet
 # 	start_spark_and_run.sh 1 weekly_cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_comment_terms_10000_weely.parquet
-/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
+# /gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet: cosine_similarities.py similarities_helper.py /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet /gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet
-	start_spark_and_run.sh 1 cosine_similarities.py author-tf --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
+# 	start_spark_and_run.sh 1 cosine_similarities.py author-tf --outfile=/gscratch/comdata/output/reddit_similarity/subreddit_author_tf_similarities_10000.parquet
--- a/similarities/README.md
+++ b/similarities/README.md
@@ -1,175 +0,0 @@
 # Subreddit similarity
 This directory holds the code that computes pairwise similarities between
 subreddits — both term-based (from TF-IDF over comment text) and
 author-based (from overlapping commenter sets). Similarity matrices
 produced here feed downstream clustering (`../clustering/`) and density
 analysis (`../density/`).
 ## Datasets
 Subreddit similarity datasets based on comment terms and comment authors
 are available on Hyak in `/gscratch/comdata/output/reddit_similarity`.
 The overall approach to subreddit similarity seems to work reasonably
 well and the code is stabilizing. If you want help using these
 similarities in a project, just reach out to
 [Nate](https://wiki.communitydata.science/People#Nathan_TeBlunthuis_.28University_of_Texas_at_Austin.29).
 By default, the scripts here take a `TopN` parameter which selects the
 subreddits to include in the similarity dataset according to how many
 total comments they have. You can alternatively pass a value to the
 `included_subreddits` parameter for a file with the names of the
 subreddits you would like to include on each line.
 ## Scripts
 | Script | What it does |
 |---|---|
 | `tfidf.py` | Builds TF-IDF vectors for subreddits. Fire CLI subcommands for `authors`, `terms`, `authors_weekly`, `terms_weekly`. |
 | `cosine_similarities.py` | Computes cosine similarities between subreddit TF-IDF vectors. Fire CLI subcommands `author`, `term`, `author-tf`. |
 | `weekly_cosine_similarities.py` | Same idea but operating on the weekly TF-IDF vectors. |
 | `wang_similarity.py` | A variant similarity computation based on user overlaps in the style of Wang et al. |
 | `top_subreddits_by_comments.py` | Produces the `subreddits_by_num_comments.csv` ranking used to pick the top-N subreddits for the similarity matrices. |
 | `similarities_helper.py` | Shared helpers for building TF-IDF datasets, reindexing, and selecting the top-N subreddits. |
 | `Makefile` | Wires everything together with the canonical Hyak output paths. |
 ## Methods
 [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common and
 simple information-retrieval technique that we can use to quantify the
 topic of a subreddit. The goal of TF-IDF is to build a vector for each
 subreddit that scores every term (or phrase) according to how
 characteristic it is of the overall lexicon used in that subreddit. For
 example, the most characteristic terms in the subreddit `/r/christianity`
 in the current version of the TF-IDF model are:
 | Term         | tf_idf |
 |:------------:|:------:|
 | christians   | 0.581  |
 | christianity | 0.569  |
 | kjv          | 0.568  |
 | bible        | 0.557  |
 | scripture    | 0.55   |
 TF-IDF stands for "term frequency — inverse document frequency" because
 it is the product of two terms "term frequency" and "inverse document
 frequency." Term frequency quantifies the amount that a term appears in
 a subreddit (document). Inverse document frequency quantifies how much
 that term appears in other subreddits (documents). As you can see on
 the Wikipedia page, there are many possible ways of constructing and
 combining these terms.
 I chose to normalize term frequency by the maximum (raw) term frequency
 for each subreddit:
 $$\mathrm{tf}_{t,d} = \frac{f_{t,d}}{\max_{t' \in d}{f_{t',d}}}$$
 I use the log inverse document frequency:
 $$\mathrm{idf}_{t} = \log\frac{N}{|\{d \in D : t \in d\}|}$$
 I then combine them using some smoothing to get:
 $$\mathrm{tfidf}_{t,d} = (0.5 + 0.5 \cdot \mathrm{tf}_{t,d}) \cdot \mathrm{idf}_{t}$$
 (Other normalization strategies are worth trying — see the note in
 `TODO`.)
 ### Building TF-IDF vectors
 The process for building TF-IDF vectors has four steps:
 1. Extracting terms using `../ngrams/tf_comments.py`
 2. Detecting common phrases using `../ngrams/top_comment_phrases.py`
 3. Extracting terms and common phrases using
   `../ngrams/tf_comments.py --mwe-pass='second'`
 4. Building IDF and TF-IDF scores in `tfidf.py`
 #### Running `tf_comments.py` on the backfill queue
 The main reason that I did it in four steps instead of one is to take
 advantage of the backfill queue for running `tf_comments.py`. This step
 requires reading all of the text in every comment and converting it to
 a bag of words at the subreddit level. This is a lot of computation
 that is easily parallelizable. The script `../ngrams/run_tf_jobs.sh`
 partially automates running steps 1 (or 3) on the backfill queue.
 #### Phrase detection using pointwise mutual information
 TF-IDF is simple, but only uses single words (unigrams). Sequences of
 multiple words can be important to account for how words have different
 meanings in different contexts or how sequences of words refer to
 distinct things like names. Dealing with context or longer sequences of
 words is a common challenge in natural language processing since the
 number of possible n-grams grows like crazy as n gets bigger. Phrase
 detection helps this problem by limiting the set of n-grams to those
 most informative.
 But how do we detect phrases? I implemented [pointwise mutual
 information](https://en.wikipedia.org/wiki/Pointwise_mutual_information),
 which is a pretty simple way but seems to work pretty well.
 PMI is a quantity derived from information theory. The intuition is
 that if two words occur together quite frequently compared to how often
 they appear separately then the cooccurrance is likely to be
 informative.
 $$\operatorname{pmi}(x;y) \equiv \log\frac{p(x,y)}{p(x)\,p(y)} = \log\frac{p(x|y)}{p(x)} = \log\frac{p(y|x)}{p(y)}$$
 In `../ngrams/tf_comments.py` if `--mwe-pass=first` then a 10% sample
 of 1-4-grams (sequences of terms up to length 4) will be written to a
 file to be consumed by `../ngrams/top_comment_phrases.py`.
 `top_comment_phrases.py` computes the PMI for these possible phrases
 and writes those that occur at least 3500 times in the sample of
 n-grams and have a PMI of at least 3 (about 65000 expressions).
 `tf_comments.py --mwe-pass=second` then uses the detected phrases and
 adds them to the term frequency data.
 ## Cosine similarity
 Once the TF-IDF vectors are built, making a similarity score between
 two subreddits is straightforward using cosine similarity.
 $$\text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\|\,\|\mathbf{B}\|} = \frac{\sum_{i=1}^{n}{A_i\,B_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}}\,\sqrt{\sum_{i=1}^{n}{B_i^2}}}$$
 Intuitively, we represent two subreddits as lines in a high-dimensional
 space (TF-IDF vectors). In linear algebra, the dot product ($\cdot$)
 between two vectors takes their weighted sum (e.g. linear regression is
 a dot product of a vector of covariates and a vector of weights). The
 vectors might have different lengths — if one subreddit has more words
 in comments than the other — so in cosine similarity the dot product
 is normalized by the magnitude (length) of the vectors. It turns out
 that this is equivalent to taking the cosine of the two vectors. So
 cosine similarity in essence quantifies the angle between the two lines
 in high-dimensional space. If the cosine similarity between two
 subreddits is greater then their TF-IDF vectors are more correlated.
 Cosine similarity with TF-IDF is popular (indeed it has been applied to
 Reddit in research several times before) because it quantifies the
 correlation between the most characteristic terms for two communities.
 Compared to other approaches to similarity like those using word
 embeddings or topic models it may struggle to handle polysemy, synonymy,
 or correlations between different terms. Using phrase detection helps
 with this a little bit. The advantages of this approach are simplicity
 and scalability. I'm thinking about using [latent semantic
 analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis) as an
 intermediate step to improve upon similarities based on raw TF-IDFs.
 Even still, computing similarities between a large number of subreddits
 is computationally expensive and requires $n(n-1)/2$ dot-product
 evaluations. This can be sped up by passing
 `similarity-threshold=X` where $X>0$ into `cosine_similarities.py`. I
 used a cosine similarity function that's built into the spark matrix
 library which supports the `DIMSUM` algorithm for approximating
 matrix-matrix products. This algorithm is commonly used in industry
 (i.e. at Twitter, Google) for large-scale similarity scoring.
 ## See also
 The CDSC wiki page
 [CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
 is the landing page for this project on the wiki. The methods writeup
 above used to live there; it now lives here so that doc and code stay
 in sync.
--- a/similarities/TODO
+++ b/similarities/TODO
@@ -1 +0,0 @@
 Try normalizing tf by the mean or std instead of the max to avoid penalizing subreddits with very active users.
--- a/similarities/cosine_similarities.py
+++ b/similarities/cosine_similarities.py
@@ -2,11 +2,14 @@ import pandas as pd
 import fire
 from pathlib import Path
 from similarities_helper import similarities, column_similarities
 from functools import partial
 def cosine_similarities(infile, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
    return similarities(infile=infile, simfunc=column_similarities, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=exclude_phrases,from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
 # change so that these take in an input as an optional argument (for speed, but also for idf).
 def term_cosine_similarities(outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
 def term_cosine_similarities(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet', min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
--- a/similarities/job_script.sh
+++ b/similarities/job_script.sh
@@ -1,4 +1,4 @@
 #!/usr/bin/bash
 start_spark_cluster.sh
-spark-submit --master spark://$(hostname):18899 cosine_similarities.py term --outfile=/gscratch/comdata/output/reddit_similarity/comment_terms_10000.feather
+singularity exec  /gscratch/comdata/users/nathante/containers/nathante.sif spark-submit --master spark://$(hostname):7077 tfidf.py authors --topN=100000 --inpath=/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet --outpath=/gscratch/scrubbed/comdata/reddit_similarity/tfidf/comment_authors_100k.parquet
-stop-all.sh
+singularity exec /gscratch/comdata/users/nathante/containers/nathante.sif stop-all.sh
--- a/similarities/lsi_similarities.py
+++ b/similarities/lsi_similarities.py
@@ -0,0 +1,86 @@
 import pandas as pd
 import fire
 from pathlib import Path
 from similarities_helper import *
 #from similarities_helper import similarities, lsi_column_similarities
 from functools import partial
 # inpath = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/tfidf/comment_authors_compex.parquet"
 # term_colname='authors'
 # outfile='/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity/comment_test_compex_LSI'
 # n_components=[10,50,100]
 # included_subreddits="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/included_subreddits.txt"
 # n_iter=5
 # random_state=1968
 # algorithm='randomized'
 # topN = None
 # from_date=None
 # to_date=None
 # min_df=None
 # max_df=None
 def lsi_similarities(inpath, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None, tfidf_colname='tf_idf',n_components=100,n_iter=5,random_state=1968,algorithm='arpack',lsi_model=None):
    print(n_components,flush=True)
    if lsi_model is None:
        if type(n_components) == list:
            lsi_model = Path(outfile) / f'{max(n_components)}_{term_colname}_LSIMOD.pkl'
        else:
            lsi_model = Path(outfile) / f'{n_components}_{term_colname}_LSIMOD.pkl'
    simfunc = partial(lsi_column_similarities,n_components=n_components,n_iter=n_iter,random_state=random_state,algorithm=algorithm,lsi_model_save=lsi_model)
    return similarities(inpath=inpath, simfunc=simfunc, term_colname=term_colname, outfile=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, from_date=from_date, to_date=to_date, tfidf_colname=tfidf_colname)
 # change so that these take in an input as an optional argument (for speed, but also for idf).
 def term_lsi_similarities(inpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet',outfile=None, min_df=None, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None, algorithm='arpack', n_components=300,n_iter=5,random_state=1968):
    res =  lsi_similarities(inpath,
                            'term',
                            outfile,
                            min_df,
                            max_df,
                            included_subreddits,
                            topN,
                            from_date,
                            to_date,
                            n_components=n_components,
                            algorithm = algorithm
                            )
    return res
 def author_lsi_similarities(inpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_100k.parquet',outfile=None, min_df=2, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None,algorithm='arpack',n_components=300,n_iter=5,random_state=1968):
    return lsi_similarities(inpath,
                            'author',
                            outfile,
                            min_df,
                            max_df,
                            included_subreddits,
                            topN,
                            from_date=from_date,
                            to_date=to_date,
                            n_components=n_components
                               )
 def author_tf_similarities(inpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors_100k.parquet',outfile=None, min_df=2, max_df=None, included_subreddits=None, topN=None, from_date=None, to_date=None,algorithm='arpack',n_components=300,n_iter=5,random_state=1968):
    return lsi_similarities(inpath,
                            'author',
                            outfile,
                            min_df,
                            max_df,
                            included_subreddits,
                            topN,
                            from_date=from_date,
                            to_date=to_date,
                            tfidf_colname='relative_tf',
                            n_components=n_components,
                            algorithm=algorithm
                            )
 if __name__ == "__main__":
    fire.Fire({'term':term_lsi_similarities,
               'author':author_lsi_similarities,
               'author-tf':author_tf_similarities})
--- a/similarities/similarities_helper.py
+++ b/similarities/similarities_helper.py
@@ -2,143 +2,199 @@ from pyspark.sql import SparkSession
 from pyspark.sql import Window
 from pyspark.sql import functions as f
 from enum import Enum
 from multiprocessing import cpu_count, Pool
 from pyspark.mllib.linalg.distributed import CoordinateMatrix
 from tempfile import TemporaryDirectory
 import pyarrow
 import pyarrow.dataset as ds
 from sklearn.metrics import pairwise_distances
 from scipy.sparse import csr_matrix, issparse
 from sklearn.decomposition import TruncatedSVD
 import pandas as pd
 import numpy as np
 import pathlib
 from datetime import datetime
 from pathlib import Path
 import pickle
 class tf_weight(Enum):
    MaxTF = 1
    Norm05 = 2
-infile = "/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet"
+# infile = "/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet"
 # cache_file = "/gscratch/comdata/users/nathante/cdsc_reddit/similarities/term_tfidf_entries_bak.parquet"
-def reindex_tfidf_time_interval(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None):
+# subreddits missing after this step don't have any terms that have a high enough idf
-    term = term_colname
+# try rewriting without merges
    term_id = term + '_id'
    term_id_new = term + '_id_new'
-    spark = SparkSession.builder.getOrCreate()
+# does reindex_tfidf, but without reindexing.
-    conf = spark.sparkContext.getConf()
+def reindex_tfidf(*args, **kwargs):
-    print(exclude_phrases)
+    df, tfidf_ds, ds_filter = _pull_or_reindex_tfidf(*args, **kwargs, reindex=True)
    tfidf_weekly = spark.read.parquet(infile)
-    # create the time interval
+    print("assigning names")
-    if from_date is not None:
+    subreddit_names = tfidf_ds.to_table(filter=ds_filter,columns=['subreddit','subreddit_id'])
-        if type(from_date) is str:
+    batches = subreddit_names.to_batches()
            from_date = datetime.fromisoformat(from_date)
-        tfidf_weekly = tfidf_weekly.filter(tfidf_weekly.week >= from_date)
+    with Pool(cpu_count()) as pool:
        chunks = pool.imap_unordered(pull_names,batches) 
        subreddit_names = pd.concat(chunks,copy=False).drop_duplicates()
        subreddit_names = subreddit_names.set_index("subreddit_id")
-    if to_date is not None:
+    new_ids = df.loc[:,['subreddit_id','subreddit_id_new']].drop_duplicates()
-        if type(to_date) is str:
+    new_ids = new_ids.set_index('subreddit_id')
-            to_date = datetime.fromisoformat(to_date)
+    subreddit_names = subreddit_names.join(new_ids,on='subreddit_id').reset_index()
-        tfidf_weekly = tfidf_weekly.filter(tfidf_weekly.week < to_date)
+    subreddit_names = subreddit_names.drop("subreddit_id",1)
    tfidf = tfidf_weekly.groupBy(["subreddit","week", term_id, term]).agg(f.sum("tf").alias("tf"))
    tfidf = _calc_tfidf(tfidf, term_colname, tf_weight.Norm05)
    tempdir = prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits)
    tfidf = spark.read_parquet(tempdir.name)
    subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
    subreddit_names = subreddit_names.sort_values("subreddit_id_new")
-    subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
+    return(df, subreddit_names)
    return(tempdir, subreddit_names)
-def reindex_tfidf(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False):
+def pull_tfidf(*args, **kwargs):
-    spark = SparkSession.builder.getOrCreate()
+    df, _, _ =  _pull_or_reindex_tfidf(*args, **kwargs, reindex=False)
-    conf = spark.sparkContext.getConf()
+    return df
    print(exclude_phrases)
-    tfidf = spark.read.parquet(infile)
+def _pull_or_reindex_tfidf(infile, term_colname, min_df=None, max_df=None, included_subreddits=None, topN=500, week=None, from_date=None, to_date=None, rescale_idf=True, tf_family=tf_weight.MaxTF, reindex=True):
    print(f"loading tfidf {infile}", flush=True)
    if week is not None:
        tfidf_ds = ds.dataset(infile, partitioning='hive')
    else: 
        tfidf_ds = ds.dataset(infile)
    if included_subreddits is None:
        included_subreddits = select_topN_subreddits(topN)
    else:
-        included_subreddits = set(map(str.strip,map(str.lower,open(included_subreddits))))
+        included_subreddits = set(map(str.strip,open(included_subreddits)))
-    if exclude_phrases == True:
+    ds_filter = ds.field("subreddit").isin(included_subreddits)
        tfidf = tfidf.filter(~f.col(term_colname).contains("_"))
-    print("creating temporary parquet with matrix indicies")
+    if min_df is not None:
-    tempdir = prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits)
+        ds_filter &= ds.field("count") >= min_df
-    tfidf = spark.read.parquet(tempdir.name)
+    if max_df is not None:
-    subreddit_names = tfidf.select(['subreddit','subreddit_id_new']).distinct().toPandas()
+        ds_filter &= ds.field("count") <= max_df
    if week is not None:
        ds_filter &= ds.field("week") == week
    if from_date is not None:
        ds_filter &= ds.field("week") >= from_date
    if to_date is not None:
        ds_filter &= ds.field("week") <= to_date
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    projection = {
        'subreddit_id':ds.field('subreddit_id'),
        term_id:ds.field(term_id),
        'relative_tf':ds.field("relative_tf").cast('float32')
        }
    if not rescale_idf:
        projection = {
            'subreddit_id':ds.field('subreddit_id'),
            term_id:ds.field(term_id),
            'relative_tf':ds.field('relative_tf').cast('float32'),
            'tf_idf':ds.field('tf_idf').cast('float32')}
        print(projection)
    df = tfidf_ds.to_table(filter=ds_filter,columns=projection)
    df = df.to_pandas(split_blocks=True,self_destruct=True)
    print("assigning indexes",flush=True)
    if reindex:
        df['subreddit_id_new'] = df.groupby("subreddit_id").ngroup()
    else:
        df['subreddit_id_new'] = df['subreddit_id']
    if reindex:
        grouped = df.groupby(term_id)
        df[term_id_new] = grouped.ngroup()
    else:
        df[term_id_new] = df[term_id]
    if rescale_idf:
        print("computing idf", flush=True)
        df['new_count'] = grouped[term_id].transform('count')
        N_docs = df.subreddit_id_new.max() + 1
        df['idf'] = np.log(N_docs/(1+df.new_count),dtype='float32') + 1
        if tf_family == tf_weight.MaxTF:
            df["tf_idf"] = df.relative_tf * df.idf
        else: # tf_fam = tf_weight.Norm05
            df["tf_idf"] = (0.5 + 0.5 * df.relative_tf) * df.idf
    return (df, tfidf_ds, ds_filter)
    with Pool(cpu_count()) as pool:
        chunks = pool.imap_unordered(pull_names,batches) 
        subreddit_names = pd.concat(chunks,copy=False).drop_duplicates()
    subreddit_names = subreddit_names.set_index("subreddit_id")
    new_ids = df.loc[:,['subreddit_id','subreddit_id_new']].drop_duplicates()
    new_ids = new_ids.set_index('subreddit_id')
    subreddit_names = subreddit_names.join(new_ids,on='subreddit_id').reset_index()
    subreddit_names = subreddit_names.drop("subreddit_id",1)
    subreddit_names = subreddit_names.sort_values("subreddit_id_new")
-    subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
+    return(df, subreddit_names)
    spark.stop()
    return (tempdir, subreddit_names)
 def pull_names(batch):
    return(batch.to_pandas().drop_duplicates())
-def similarities(infile, simfunc, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, exclude_phrases=False, from_date=None, to_date=None, tfidf_colname='tf_idf'):
+def similarities(inpath, simfunc, term_colname, outfile, min_df=None, max_df=None, included_subreddits=None, topN=500, from_date=None, to_date=None, tfidf_colname='tf_idf'):
    '''
    tfidf_colname: set to 'relative_tf' to use normalized term frequency instead of tf-idf, which can be useful for author-based similarities.
    '''
    if from_date is not None or to_date is not None:
        tempdir, subreddit_names = reindex_tfidf_time_interval(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=False, from_date=from_date, to_date=to_date)
    else:
        tempdir, subreddit_names = reindex_tfidf(infile, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN, exclude_phrases=False)
    print("loading matrix")
    #    mat = read_tfidf_matrix("term_tfidf_entries7ejhvnvl.parquet", term_colname)
    mat = read_tfidf_matrix(tempdir.name, term_colname, tfidf_colname)
    print(f'computing similarities on mat. mat.shape:{mat.shape}')
    print(f"size of mat is:{mat.data.nbytes}")
    sims = simfunc(mat)
    del mat
    def proc_sims(sims, outfile):
        if issparse(sims):
            sims = sims.todense()
        print(f"shape of sims:{sims.shape}")
-    print(f"len(subreddit_names.subreddit.values):{len(subreddit_names.subreddit.values)}")
+        print(f"len(subreddit_names.subreddit.values):{len(subreddit_names.subreddit.values)}",flush=True)
        sims = pd.DataFrame(sims)
        sims = sims.rename({i:sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1)
-    sims['subreddit'] = subreddit_names.subreddit.values
+        sims['_subreddit'] = subreddit_names.subreddit.values
        p = Path(outfile)
        output_feather =  Path(str(p).replace("".join(p.suffixes), ".feather"))
        output_csv =  Path(str(p).replace("".join(p.suffixes), ".csv"))
        output_parquet =  Path(str(p).replace("".join(p.suffixes), ".parquet"))
        p.parent.mkdir(exist_ok=True, parents=True)
        sims.to_feather(outfile)
    tempdir.cleanup()
 def read_tfidf_matrix_weekly(path, term_colname, week, tfidf_colname='tf_idf'):
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
-    dataset = ds.dataset(path,format='parquet')
+    entries, subreddit_names = reindex_tfidf(inpath, term_colname=term_colname, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=topN,from_date=from_date,to_date=to_date)
-    entries = dataset.to_table(columns=[tfidf_colname,'subreddit_id_new', term_id_new],filter=ds.field('week')==week).to_pandas()
+    mat = csr_matrix((entries[tfidf_colname],(entries[term_id_new], entries.subreddit_id_new)))
    return(csr_matrix((entries[tfidf_colname], (entries[term_id_new]-1, entries.subreddit_id_new-1))))
-def read_tfidf_matrix(path, term_colname, tfidf_colname='tf_idf'):
+    print("loading matrix")        
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    dataset = ds.dataset(path,format='parquet')
    print(f"tfidf_colname:{tfidf_colname}")
    entries = dataset.to_table(columns=[tfidf_colname, 'subreddit_id_new',term_id_new]).to_pandas()
    return(csr_matrix((entries[tfidf_colname],(entries[term_id_new]-1, entries.subreddit_id_new-1))))
    #    mat = read_tfidf_matrix("term_tfidf_entries7ejhvnvl.parquet", term_colname)
    print(f'computing similarities on mat. mat.shape:{mat.shape}')
    print(f"size of mat is:{mat.data.nbytes}",flush=True)
    sims = simfunc(mat)
    del mat
    if hasattr(sims,'__next__'):
        for simmat, name in sims:
            proc_sims(simmat, Path(outfile)/(str(name) + ".feather"))
    else:
        proc_sims(sims, outfile)
 def write_weekly_similarities(path, sims, week, names):
    sims['week'] = week
    p = pathlib.Path(path)
    if not p.is_dir():
-        p.mkdir()
+        p.mkdir(exist_ok=True,parents=True)
    # reformat as a pairwise list
-    sims = sims.melt(id_vars=['subreddit','week'],value_vars=names.subreddit.values)
+    sims = sims.melt(id_vars=['_subreddit','week'],value_vars=names.subreddit.values)
    sims.to_parquet(p / week.isoformat())
 def column_overlaps(mat):
@@ -150,136 +206,76 @@ def column_overlaps(mat):
    return intersection / den
 def test_lsi_sims():
    term = "term"
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    t1 = time.perf_counter()
    entries, subreddit_names = reindex_tfidf("/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k_repartitioned.parquet",
                                             term_colname='term',
                                             min_df=2000,
                                             topN=10000
                                             )
    t2 = time.perf_counter()
    print(f"first load took:{t2 - t1}s")
    entries, subreddit_names = reindex_tfidf("/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet",
                                             term_colname='term',
                                             min_df=2000,
                                             topN=10000
                                             )
    t3=time.perf_counter()
    print(f"second load took:{t3 - t2}s")
    mat = csr_matrix((entries['tf_idf'],(entries[term_id_new], entries.subreddit_id_new)))
    sims = list(lsi_column_similarities(mat, [10,50]))
    sims_og = sims
    sims_test = list(lsi_column_similarities(mat,[10,50],algorithm='randomized',n_iter=10))
 # n_components is the latent dimensionality. sklearn recommends 100. More might be better
 # if n_components is a list we'll return a list of similarities with different latent dimensionalities
 # if algorithm is 'randomized' instead of 'arpack' then n_iter gives the number of iterations.
 # this function takes the svd and then the column similarities of it
 def lsi_column_similarities(tfidfmat,n_components=300,n_iter=10,random_state=1968,algorithm='randomized',lsi_model_save=None,lsi_model_load=None):
    # first compute the lsi of the matrix
    # then take the column similarities
    if type(n_components) is int:
        n_components = [n_components]
    n_components = sorted(n_components,reverse=True)
    svd_components = n_components[0]
    if lsi_model_load is not None and Path(lsi_model_load).exists():
        print("loading LSI")
        mod = pickle.load(open(lsi_model_load ,'rb'))
        lsi_model_save = lsi_model_load
    else:
        print("running LSI",flush=True)
        svd = TruncatedSVD(n_components=svd_components,random_state=random_state,algorithm=algorithm,n_iter=n_iter)
        mod = svd.fit(tfidfmat.T)
    lsimat = mod.transform(tfidfmat.T)
    if lsi_model_save is not None:
        Path(lsi_model_save).parent.mkdir(exist_ok=True, parents=True)
        pickle.dump(mod, open(lsi_model_save,'wb'))
    sims_list = []
    for n_dims in n_components:
        sims = column_similarities(lsimat[:,np.arange(n_dims)])
        if len(n_components) > 1:
            yield (sims, n_dims)
        else:
            return sims
 def column_similarities(mat):
-    norm = np.matrix(np.power(mat.power(2).sum(axis=0),0.5,dtype=np.float32))
+    return 1 - pairwise_distances(mat,metric='cosine')
    mat = mat.multiply(1/norm)
    sims = mat.T @ mat
    return(sims)
 def prep_tfidf_entries_weekly(tfidf, term_colname, min_df, max_df, included_subreddits):
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    if min_df is None:
        min_df = 0.1 * len(included_subreddits)
        tfidf = tfidf.filter(f.col('count') >= min_df)
    if max_df is not None:
        tfidf = tfidf.filter(f.col('count') <= max_df)
    tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
    # we might not have the same terms or subreddits each week, so we need to make unique ids for each week.
    sub_ids = tfidf.select(['subreddit_id','week']).distinct()
    sub_ids = sub_ids.withColumn("subreddit_id_new",f.row_number().over(Window.partitionBy('week').orderBy("subreddit_id")))
    tfidf = tfidf.join(sub_ids,['subreddit_id','week'])
    # only use terms in at least min_df included subreddits in a given week
    new_count = tfidf.groupBy([term_id,'week']).agg(f.count(term_id).alias('new_count'))
    tfidf = tfidf.join(new_count,[term_id,'week'],how='inner')
    # reset the term ids
    term_ids = tfidf.select([term_id,'week']).distinct()
    term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.partitionBy('week').orderBy(term_id)))
    tfidf = tfidf.join(term_ids,[term_id,'week'])
    tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
    tfidf = tfidf.withColumn("tf_idf", (tfidf.relative_tf * tfidf.idf).cast('float'))
    tempdir =TemporaryDirectory(suffix='.parquet',prefix='term_tfidf_entries',dir='.')
    tfidf = tfidf.repartition('week')
    tfidf.write.parquet(tempdir.name,mode='overwrite',compression='snappy')
    return(tempdir)
 def prep_tfidf_entries(tfidf, term_colname, min_df, max_df, included_subreddits):
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    if min_df is None:
        min_df = 0.1 * len(included_subreddits)
        tfidf = tfidf.filter(f.col('count') >= min_df)
    if max_df is not None:
        tfidf = tfidf.filter(f.col('count') <= max_df)
    tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
    # reset the subreddit ids
    sub_ids = tfidf.select('subreddit_id').distinct()
    sub_ids = sub_ids.withColumn("subreddit_id_new", f.row_number().over(Window.orderBy("subreddit_id")))
    tfidf = tfidf.join(sub_ids,'subreddit_id')
    # only use terms in at least min_df included subreddits
    new_count = tfidf.groupBy(term_id).agg(f.count(term_id).alias('new_count'))
    tfidf = tfidf.join(new_count,term_id,how='inner')
    # reset the term ids
    term_ids = tfidf.select([term_id]).distinct()
    term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.orderBy(term_id)))
    tfidf = tfidf.join(term_ids,term_id)
    tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
    tfidf = tfidf.withColumn("tf_idf", (tfidf.relative_tf * tfidf.idf).cast('float'))
    tempdir =TemporaryDirectory(suffix='.parquet',prefix='term_tfidf_entries',dir='.')
    tfidf.write.parquet(tempdir.name,mode='overwrite',compression='snappy')
    return tempdir
 # try computing cosine similarities using spark
 def spark_cosine_similarities(tfidf, term_colname, min_df, included_subreddits, similarity_threshold):
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    if min_df is None:
        min_df = 0.1 * len(included_subreddits)
    tfidf = tfidf.filter(f.col("subreddit").isin(included_subreddits))
    tfidf = tfidf.cache()
    # reset the subreddit ids
    sub_ids = tfidf.select('subreddit_id').distinct()
    sub_ids = sub_ids.withColumn("subreddit_id_new",f.row_number().over(Window.orderBy("subreddit_id")))
    tfidf = tfidf.join(sub_ids,'subreddit_id')
    # only use terms in at least min_df included subreddits
    new_count = tfidf.groupBy(term_id).agg(f.count(term_id).alias('new_count'))
    tfidf = tfidf.join(new_count,term_id,how='inner')
    # reset the term ids
    term_ids = tfidf.select([term_id]).distinct()
    term_ids = term_ids.withColumn(term_id_new,f.row_number().over(Window.orderBy(term_id)))
    tfidf = tfidf.join(term_ids,term_id)
    tfidf = tfidf.withColumnRenamed("tf_idf","tf_idf_old")
    tfidf = tfidf.withColumn("tf_idf", tfidf.relative_tf * tfidf.idf)
    # step 1 make an rdd of entires
    # sorted by (dense) spark subreddit id
    n_partitions = int(len(included_subreddits)*2 / 5)
    entries = tfidf.select(f.col(term_id_new)-1,f.col("subreddit_id_new")-1,"tf_idf").rdd.repartition(n_partitions)
    # put like 10 subredis in each partition
    # step 2 make it into a distributed.RowMatrix
    coordMat = CoordinateMatrix(entries)
    coordMat = CoordinateMatrix(coordMat.entries.repartition(n_partitions))
    # this needs to be an IndexedRowMatrix()
    mat = coordMat.toRowMatrix()
    #goal: build a matrix of subreddit columns and tf-idfs rows
    sim_dist = mat.columnSimilarities(threshold=similarity_threshold)
    return (sim_dist, tfidf)
 def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
@@ -306,20 +302,20 @@ def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weig
    idf = idf.withColumn('idf',f.log(idf.subreddits_in_week) / (1+f.col('count'))+1)
    # collect the dictionary to make a pydict of terms to indexes
-    terms = idf.select([term,'week']).distinct() # terms are distinct
+    terms = idf.select([term]).distinct() # terms are distinct
-    terms = terms.withColumn(term_id,f.row_number().over(Window.partitionBy('week').orderBy(term))) # term ids are distinct
+    terms = terms.withColumn(term_id,f.row_number().over(Window.orderBy(term))) # term ids are distinct
    # make subreddit ids
-    subreddits = df.select(['subreddit','week']).distinct()
+    subreddits = df.select(['subreddit']).distinct()
-    subreddits = subreddits.withColumn('subreddit_id',f.row_number().over(Window.partitionBy("week").orderBy("subreddit")))
+    subreddits = subreddits.withColumn('subreddit_id',f.row_number().over(Window.orderBy("subreddit")))
-    df = df.join(subreddits,on=['subreddit','week'])
+    df = df.join(subreddits,on=['subreddit'])
    # map terms to indexes in the tfs and the idfs
-    df = df.join(terms,on=[term,'week']) # subreddit-term-id is unique
+    df = df.join(terms,on=[term]) # subreddit-term-id is unique
-    idf = idf.join(terms,on=[term,'week'])
+    idf = idf.join(terms,on=[term])
    # join on subreddit/term to create tf/dfs indexed by term
    df = df.join(idf, on=[term_id, term,'week'])
@@ -331,7 +327,9 @@ def build_weekly_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weig
    else: # tf_fam = tf_weight.Norm05
        df = df.withColumn("tf_idf",  (0.5 + 0.5 * df.relative_tf) * df.idf)
-    return df
+    df = df.repartition(400,'subreddit','week')
    dfwriter = df.write.partitionBy("week")
    return dfwriter
 def _calc_tfidf(df, term_colname, tf_family):
    term = term_colname
@@ -342,7 +340,7 @@ def _calc_tfidf(df, term_colname, tf_family):
    df = df.join(max_subreddit_terms, on='subreddit')
-    df = df.withColumn("relative_tf", df.tf / df.sr_max_tf)
+    df = df.withColumn("relative_tf", (df.tf / df.sr_max_tf))
    # group by term. term is unique
    idf = df.groupby([term]).count()
@@ -377,7 +375,7 @@ def _calc_tfidf(df, term_colname, tf_family):
    return df
-def build_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
+def tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm05):
    term = term_colname
    term_id = term + '_id'
    # aggregate counts by week. now subreddit-term is distinct
@@ -385,10 +383,28 @@ def build_tfidf_dataset(df, include_subs, term_colname, tf_family=tf_weight.Norm
    df = df.groupBy(['subreddit',term]).agg(f.sum('tf').alias('tf'))
    df = _calc_tfidf(df, term_colname, tf_family)
-
+    df = df.repartition('subreddit')
-    return df
+    dfwriter = df.write
    return dfwriter
 def select_topN_subreddits(topN, path="/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments_nonsfw.csv"):
    rankdf = pd.read_csv(path)
    included_subreddits = set(rankdf.loc[rankdf.comments_rank <= topN,'subreddit'].values)
    return included_subreddits
 def repartition_tfidf(inpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k.parquet",
                      outpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_100k_repartitioned.parquet"):
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.parquet(inpath)
    df = df.repartition(400,'subreddit')
    df.write.parquet(outpath,mode='overwrite')
 def repartition_tfidf_weekly(inpath="/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet",
                      outpath="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms_repartitioned.parquet"):
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.parquet(inpath)
    df = df.repartition(400,'subreddit','week')
    dfwriter = df.write.partitionBy("week")
    dfwriter.parquet(outpath,mode='overwrite')
--- a/similarities/tfidf.py
+++ b/similarities/tfidf.py
@@ -1,9 +1,12 @@
 import fire
 from pyspark.sql import SparkSession
 from pyspark.sql import functions as f
-from similarities_helper import build_tfidf_dataset, build_weekly_tfidf_dataset, select_topN_subreddits
+from similarities_helper import tfidf_dataset, build_weekly_tfidf_dataset, select_topN_subreddits
 from functools import partial
-def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_subreddits):
+inpath = '/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/tfidf/comment_authors_compex.parquet'
 # include_terms is a path to a parquet file that contains a column of term_colname + '_id' to include.
 def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_subreddits, included_terms=None, min_df=None, max_df=None):
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.parquet(inpath)
@@ -11,65 +14,91 @@ def _tfidf_wrapper(func, inpath, outpath, topN, term_colname, exclude, included_
    df = df.filter(~ f.col(term_colname).isin(exclude))
    if included_subreddits is not None:
-        include_subs = set(map(str.strip,map(str.lower, open(included_subreddits))))
+        include_subs = set(map(str.strip,open(included_subreddits)))
    else:
        include_subs = select_topN_subreddits(topN)
-    df = func(df, include_subs, term_colname)
+    include_subs = spark.sparkContext.broadcast(include_subs)
-    df.write.parquet(outpath,mode='overwrite',compression='snappy')
+    #    term_id = term_colname + "_id"
    if included_terms is not None:
        terms_df = spark.read.parquet(included_terms)
        terms_df = terms_df.select(term_colname).distinct()
        df = df.join(terms_df, on=term_colname, how='left_semi')
    dfwriter = func(df, include_subs.value, term_colname)
    dfwriter.parquet(outpath,mode='overwrite',compression='snappy')
    spark.stop()
-def tfidf(inpath, outpath, topN, term_colname, exclude, included_subreddits):
+def tfidf(inpath, outpath, topN, term_colname, exclude, included_subreddits, min_df, max_df):
-    return _tfidf_wrapper(build_tfidf_dataset, inpath, outpath, topN, term_colname, exclude, included_subreddits)
+    tfidf_func = partial(tfidf_dataset, max_df=max_df, min_df=min_df)
    return _tfidf_wrapper(tfidf_func, inpath, outpath, topN, term_colname, exclude, included_subreddits)
-def tfidf_weekly(inpath, outpath, topN, term_colname, exclude, included_subreddits):
+def tfidf_weekly(inpath, outpath, static_tfidf_path, topN, term_colname, exclude, included_subreddits):
-    return _tfidf_wrapper(build_weekly_tfidf_dataset, inpath, outpath, topN, term_colname, exclude, included_subreddits)
+    return _tfidf_wrapper(build_weekly_tfidf_dataset, inpath, outpath, topN, term_colname, exclude, included_subreddits, included_terms=static_tfidf_path)
 def tfidf_authors(outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet',
                  topN=25000,
                  included_subreddits=None):
-    return tfidf("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet",
+def tfidf_authors(inpath="/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet",
                  outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet',
                  topN=None,
                  included_subreddits=None,
                  min_df=None,
                  max_df=None):
    return tfidf(inpath,
                 outpath,
                 topN,
                 'author',
                 ['[deleted]','AutoModerator'],
-                 included_subreddits=included_subreddits
+                 included_subreddits=included_subreddits,
                 min_df=min_df,
                 max_df=max_df
                 )
-def tfidf_terms(outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet',
+def tfidf_terms(inpath="/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet",
-                topN=25000,
+                outpath='/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet',
-                included_subreddits=None):
+                topN=None,
                included_subreddits=None,
                min_df=None,
                max_df=None):
-    return tfidf("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet",
+    return tfidf(inpath,
                 outpath,
                 topN,
                 'term',
                 [],
-                 included_subreddits=included_subreddits
+                 included_subreddits=included_subreddits,
                 min_df=min_df,
                 max_df=max_df
                 )
-def tfidf_authors_weekly(outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
+def tfidf_authors_weekly(inpath="/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet",
-                         topN=25000,
+                         static_tfidf_path="/gscratch/comdata/output/reddit_similarity/tfidf/comment_authors.parquet",
                         outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
                         topN=None,
                         included_subreddits=None):
-    return tfidf_weekly("/gscratch/comdata/output/reddit_ngrams/comment_authors.parquet",
+    return tfidf_weekly(inpath,
                        outpath,
                        static_tfidf_path,
                        topN,
                        'author',
                        ['[deleted]','AutoModerator'],
                        included_subreddits=included_subreddits
                        )
-def tfidf_terms_weekly(outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
+def tfidf_terms_weekly(inpath="/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet",
-                       topN=25000,
+                       static_tfidf_path="/gscratch/comdata/output/reddit_similarity/tfidf/comment_terms.parquet",
                       outpath='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
                       topN=None,
                       included_subreddits=None):
-    return tfidf_weekly("/gscratch/comdata/output/reddit_ngrams/comment_terms.parquet",
+    return tfidf_weekly(inpath,
                        outpath,
                        static_tfidf_path,
                        topN,
                        'term',
                        [],
--- a/similarities/top_subreddits_by_comments.py
+++ b/similarities/top_subreddits_by_comments.py
@@ -26,4 +26,4 @@ df = df.toPandas()
 df = df.sort_values("n_comments")
-df.to_csv('/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv', index=False)
+df.to_csv('/gscratch/scrubbed/comdata/reddit_similarity/subreddits_by_num_comments_nonsfw.csv', index=False)
--- a/similarities/weekly_cosine_similarities.py
+++ b/similarities/weekly_cosine_similarities.py
@@ -1,81 +1,149 @@
 #!/usr/bin/env python3
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
 from pyspark.sql import Window
 import numpy as np
 import pyarrow
 import pyarrow.dataset as ds
 import pandas as pd
 import fire
-from itertools import islice
+from itertools import islice, chain
 from pathlib import Path
-from similarities_helper import *
+from similarities_helper import pull_tfidf, column_similarities, write_weekly_similarities, lsi_column_similarities
 from scipy.sparse import csr_matrix
 from multiprocessing import Pool, cpu_count
 from functools import partial
 import pickle
-def _week_similarities(tempdir, term_colname, week):
+# tfidf_path = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity_weekly/comment_authors_tfidf.parquet"
 # #tfidf_path = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data//comment_authors_compex.parquet"
 # min_df=2
 # included_subreddits="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/included_subreddits.txt"
 # max_df = None
 # topN=100
 # term_colname='author'
 # # outfile = '/gscratch/comdata/output/reddit_similarity/weekly/comment_authors_test.parquet'
 # # included_subreddits=None
 outfile="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity_weekly/comment_authors.parquet"; infile="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/tfidf_weekly/comment_authors_tfidf.parquet"; included_subreddits="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/included_subreddits.txt"; lsi_model="/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity/comment_authors_compex_LSI/2000_authors_LSIMOD.pkl"; n_components=1500; algorithm="randomized"; term_colname='author'; tfidf_path=infile; random_state=1968;
 # static_tfidf = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/tfidf/comment_authors_compex.parquet"
 # dftest = spark.read.parquet(static_tfidf)
 def _week_similarities(week, simfunc, tfidf_path, term_colname, included_subreddits, outdir:Path, subreddit_names, nterms, topN=None, min_df=None, max_df=None):
    term = term_colname
    term_id = term + '_id'
    term_id_new = term + '_id_new'
    print(f"loading matrix: {week}")
-        mat = read_tfidf_matrix_weekly(tempdir.name, term_colname, week)
+
    entries = pull_tfidf(infile = tfidf_path,
                         term_colname=term_colname,
                         included_subreddits=included_subreddits,
                         topN=topN,
                         week=week.isoformat(),
                         rescale_idf=False)
    tfidf_colname='tf_idf'
    # if the max subreddit id we found is less than the number of subreddit names then we have to fill in 0s
    mat = csr_matrix((entries[tfidf_colname],(entries[term_id_new]-1, entries.subreddit_id_new-1)),shape=(nterms,subreddit_names.shape[0]))
    print('computing similarities')
-        sims = column_similarities(mat)
+    print(simfunc)
    sims = simfunc(mat)
    del mat
    sims = next(sims)[0]
    sims = pd.DataFrame(sims)
    sims = sims.rename({i: sr for i, sr in enumerate(subreddit_names.subreddit.values)}, axis=1)
    sims['_subreddit'] = subreddit_names.subreddit.values
    outfile = str(Path(outdir) / str(week))
    write_weekly_similarities(outfile, sims, week, subreddit_names)
-        names = subreddit_names.loc[subreddit_names.week == week]
+def pull_weeks(batch):
-        sims = pd.DataFrame(sims.todense())
+    return set(batch.to_pandas()['week'])
-        sims = sims.rename({i: sr for i, sr in enumerate(names.subreddit.values)}, axis=1)
+# This requires a prefit LSI model, since we shouldn't fit different LSI models for every week. 
-        sims['_subreddit'] = names.subreddit.values
+def cosine_similarities_weekly_lsi(*args, n_components=100, lsi_model=None, **kwargs):
    print(args)
    print(kwargs)
    term_colname= kwargs.get('term_colname')
    # lsi_model = "/gscratch/comdata/users/nathante/competitive_exclusion_reddit/data/similarity/comment_authors_compex_LSI/1000_author_LSIMOD.pkl"
-        write_weekly_similarities(outfile, sims, week, names)
+    lsi_model = pickle.load(open(lsi_model,'rb'))
    #simfunc = partial(lsi_column_similarities,n_components=n_components,random_state=random_state,algorithm='randomized',lsi_model=lsi_model)
    simfunc = partial(lsi_column_similarities,n_components=n_components,random_state=kwargs.get('random_state'),lsi_model=lsi_model)
    return cosine_similarities_weekly(*args, simfunc=simfunc, **kwargs)
 #tfidf = spark.read.parquet('/gscratch/comdata/users/nathante/subreddit_tfidf_weekly.parquet')
-def cosine_similarities_weekly(tfidf_path, outfile, term_colname, min_df = None, included_subreddits = None, topN = 500):
+def cosine_similarities_weekly(tfidf_path, outfile, term_colname, included_subreddits = None, topN = None, simfunc=column_similarities, min_df=None,max_df=None):
    spark = SparkSession.builder.getOrCreate()
    conf = spark.sparkContext.getConf()
    print(outfile)
    tfidf = spark.read.parquet(tfidf_path)
    if included_subreddits is None:
        included_subreddits = select_topN_subreddits(topN)
    else:
        included_subreddits = set(open(included_subreddits))
    print(f"computing weekly similarities for {len(included_subreddits)} subreddits")
    print("creating temporary parquet with matrix indicies")
    tempdir = prep_tfidf_entries_weekly(tfidf, term_colname, min_df, max_df=None, included_subreddits=included_subreddits)
    tfidf = spark.read.parquet(tempdir.name)
    # the ids can change each week.
    subreddit_names = tfidf.select(['subreddit','subreddit_id_new','week']).distinct().toPandas()
    subreddit_names = subreddit_names.sort_values("subreddit_id_new")
    subreddit_names['subreddit_id_new'] = subreddit_names['subreddit_id_new'] - 1
    spark.stop()
    weeks = sorted(list(subreddit_names.week.drop_duplicates()))
    # do this step in parallel if we have the memory for it.
    # should be doable with pool.map
-    def week_similarities_helper(week):
+    spark = SparkSession.builder.getOrCreate()
-        _week_similarities(tempdir, term_colname, week)
+    df = spark.read.parquet(tfidf_path)
-    with Pool(cpu_count()) as pool: # maybe it can be done with 40 cores on the huge machine?
+    # load subreddits + topN
        list(pool.map(week_similarities_helper,weeks))
-def author_cosine_similarities_weekly(outfile, min_df=2 , included_subreddits=None, topN=500):
+    subreddit_names = df.select(['subreddit','subreddit_id']).distinct().toPandas()
-    return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors.parquet',
+    subreddit_names = subreddit_names.sort_values("subreddit_id")
    nterms = df.select(f.max(f.col(term_colname + "_id")).alias('max')).collect()[0].max
    weeks = df.select(f.col("week")).distinct().toPandas().week.values
    spark.stop()
    print(f"computing weekly similarities")
    week_similarities_helper = partial(_week_similarities,simfunc=simfunc, tfidf_path=tfidf_path, term_colname=term_colname, outdir=outfile, min_df=min_df, max_df=max_df, included_subreddits=included_subreddits, topN=None, subreddit_names=subreddit_names,nterms=nterms)
    for week in weeks:
        week_similarities_helper(week)
    # pool = Pool(cpu_count())
    # list(pool.imap(week_similarities_helper, weeks))
    # pool.close()
    #    with Pool(cpu_count()) as pool: # maybe it can be done with 40 cores on the huge machine?
 def author_cosine_similarities_weekly(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors_test.parquet', min_df=2, max_df=None, included_subreddits=None, topN=500):
    return cosine_similarities_weekly(infile,
                                      outfile,
                                      'author',
-                                      min_df,
+                                      max_df,
                                      included_subreddits,
-                                      topN)
+                                      topN,
                                      min_df=2
 )
-def term_cosine_similarities_weekly(outfile, min_df=None, included_subreddits=None, topN=500):
+def term_cosine_similarities_weekly(outfile, infile='/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', min_df=None, max_df=None, included_subreddits=None, topN=None):
-    return cosine_similarities_weekly('/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet',
+        return cosine_similarities_weekly(infile,
                                          outfile,
                                          'term',
                                          min_df,
                                          max_df,
                                          included_subreddits,
                                          topN)
 def author_cosine_similarities_weekly_lsi(outfile, infile = '/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_authors_test.parquet', included_subreddits=None, n_components=100,lsi_model=None):
    return cosine_similarities_weekly_lsi(infile,
                                          outfile,
                                          'author',
                                          included_subreddits=included_subreddits,
                                          n_components=n_components,
                                          lsi_model=lsi_model
                                          )
 def term_cosine_similarities_weekly_lsi(outfile, infile = '/gscratch/comdata/output/reddit_similarity/tfidf_weekly/comment_terms.parquet', included_subreddits=None, n_components=100,lsi_model=None):
        return cosine_similarities_weekly_lsi(infile,
                                              outfile,
                                              'term',
                                              included_subreddits=included_subreddits,
                                              n_components=n_components,
                                              lsi_model=lsi_model,
                                              )
 if __name__ == "__main__":
    fire.Fire({'authors':author_cosine_similarities_weekly,
-               'terms':term_cosine_similarities_weekly})
+               'terms':term_cosine_similarities_weekly,
               'authors-lsi':author_cosine_similarities_weekly_lsi,
               'terms-lsi':term_cosine_similarities_weekly_lsi
               })
--- a/timeseries/init.py
+++ b/timeseries/init.py
@@ -0,0 +1,2 @@
 from .choose_clusters import load_clusters, load_densities
 from .cluster_timeseries import build_cluster_timeseries
--- a/timeseries/cluster_timeseries.py
+++ b/timeseries/cluster_timeseries.py
@@ -2,20 +2,16 @@ import pandas as pd
 import numpy as np
 from pyspark.sql import functions as f
 from pyspark.sql import SparkSession
-from choose_clusters import load_clusters, load_densities
+from .choose_clusters import load_clusters, load_densities
 import fire
 from pathlib import Path
-def main(term_clusters_path="/gscratch/comdata/output/reddit_clustering/comment_terms_10000.feather",
+def build_cluster_timeseries(term_clusters_path="/gscratch/comdata/output/reddit_clustering/comment_terms_10000.feather",
         author_clusters_path="/gscratch/comdata/output/reddit_clustering/comment_authors_10000.feather",
         term_densities_path="/gscratch/comdata/output/reddit_density/comment_terms_10000.feather",
         author_densities_path="/gscratch/comdata/output/reddit_density/comment_authors_10000.feather",
         output="data/subreddit_timeseries.parquet"):
    clusters = load_clusters(term_clusters_path, author_clusters_path)
    densities = load_densities(term_densities_path, author_densities_path)
    spark = SparkSession.builder.getOrCreate()
    df = spark.read.parquet("/gscratch/comdata/output/reddit_comments_by_subreddit.parquet")
@@ -26,12 +22,16 @@ def main(term_clusters_path="/gscratch/comdata/output/reddit_clustering/comment_
    ts = df.select(['subreddit','week','author']).distinct().groupby(['subreddit','week']).count()
    ts = ts.repartition('subreddit')
    spk_clusters = spark.createDataFrame(clusters)
-    ts = ts.join(spk_clusters, on='subreddit', how='inner')
+    if term_densities_path is not None and author_densities_path is not None:
        densities = load_densities(term_densities_path, author_densities_path)
        spk_densities = spark.createDataFrame(densities)
        ts = ts.join(spk_densities, on='subreddit', how='inner')
    clusters = load_clusters(term_clusters_path, author_clusters_path)
    spk_clusters = spark.createDataFrame(clusters)
    ts = ts.join(spk_clusters, on='subreddit', how='inner')
    ts.write.parquet(output, mode='overwrite')
 if __name__ == "__main__":
-    fire.Fire(main)
+    fire.Fire(build_cluster_timeseries)
--- a/visualization/tsne_vis.py
+++ b/visualization/tsne_vis.py
@@ -22,8 +22,12 @@ def base_plot(plot_data):
    #
    #    subreddit_select = alt.selection_single(on='click',fields=['subreddit'],bind=subreddit_dropdown,name='subreddit_click')
    base_scale = alt.Scale(scheme={"name":'category10',
                                   "extent":[0,100],
                                   "count":10})
    color = alt.condition(cluster_click_select ,
-                          alt.Color(field='color',type='nominal',scale=alt.Scale(scheme='category10')),
+                          alt.Color(field='color',type='nominal',scale=base_scale),
                          alt.value("lightgray"))
@@ -84,6 +88,11 @@ def viewport_plot(plot_data):
    return chart
 def assign_cluster_colors(tsne_data, clusters, n_colors, n_neighbors = 4):
    isolate_color = 101
    cluster_sizes = clusters.groupby('cluster').count()
    singletons = set(cluster_sizes.loc[cluster_sizes.subreddit == 1].reset_index().cluster)
    tsne_data = tsne_data.merge(clusters,on='subreddit')
    centroids = tsne_data.groupby('cluster').agg({'x':np.mean,'y':np.mean})
@@ -120,6 +129,9 @@ def assign_cluster_colors(tsne_data, clusters, n_colors, n_neighbors = 4):
    color_assignments = np.repeat(-1,len(centroids))
    for i in range(len(centroids)):
        if (centroids.iloc[i].name == -1) or (i in singletons):
            color_assignments[i] = isolate_color
        else:
            knn = indices[i]
            knn_colors = color_assignments[knn]
            available_colors = color_ids[list(set(color_ids) - set(knn_colors))]
@@ -129,7 +141,6 @@ def assign_cluster_colors(tsne_data, clusters, n_colors, n_neighbors = 4):
            else:
                raise Exception("Can't color this many neighbors with this many colors")
    centroids = centroids.reset_index()
    colors = centroids.loc[:,['cluster']]
    colors['color'] = color_assignments
@@ -143,12 +154,13 @@ def build_visualization(tsne_data, clusters, output):
    # clusters = "/gscratch/comdata/output/reddit_clustering/subreddit_author_tf_similarities_10000.feather"
    tsne_data = pd.read_feather(tsne_data)
    tsne_data = tsne_data.rename(columns={'_subreddit':'subreddit'})
    clusters = pd.read_feather(clusters)
    tsne_data = assign_cluster_colors(tsne_data,clusters,10,8)
-    # sr_per_cluster = tsne_data.groupby('cluster').subreddit.count().reset_index()
+    sr_per_cluster = tsne_data.groupby('cluster').subreddit.count().reset_index()
-    # sr_per_cluster = sr_per_cluster.rename(columns={'subreddit':'cluster_size'})
+    sr_per_cluster = sr_per_cluster.rename(columns={'subreddit':'cluster_size'})
    tsne_data = tsne_data.merge(sr_per_cluster,on='cluster')
Author	SHA1	Message	Date
Nathan TeBlunthuis	55b75ea6fc	Merge remote-tracking branch 'refs/remotes/origin/excise_reindex' into excise_reindex	2022-04-06 11:14:13 -07:00
Nathan TeBlunthuis	197518a222	git-annex in	2022-04-06 11:11:11 -07:00
Nathan TeBlunthuis	65deba5e4e	Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex	2022-01-19 14:01:44 -08:00
Nathan TeBlunthuis	7b130a30af	commit changes from smap project.	2022-01-19 13:57:02 -08:00
Nathan TeBlunthuis	98c1317af5	update pushshift dumps.	2021-12-10 21:23:32 -08:00
Nathan TeBlunthuis	541e125b28	lsi support for weekly similarities	2021-08-11 22:48:33 -07:00
Nathan TeBlunthuis	b7c39a3494	Merge branch 'master' of code:cdsc_reddit into excise_reindex	2021-08-03 15:13:39 -07:00
Nathan TeBlunthuis	ce549c6c97	Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex	2021-08-03 15:13:21 -07:00
Nathan TeBlunthuis	6e43294a41	Updates to similarities code for smap project.	2021-08-03 15:06:48 -07:00
Nathan TeBlunthuis	2d21ff1137	Merge branch 'master' of code:cdsc_reddit into excise_reindex	2021-08-03 15:02:08 -07:00
Nate E TeBlunthuis	cf86c7492c	update clustering scripts	2021-08-03 14:55:02 -07:00
Nate E TeBlunthuis	87ffaa6858	script for picking the best clustering given constraints	2021-05-14 19:10:36 -07:00
Nate E TeBlunthuis	7b14db67de	Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex	2021-05-13 22:28:31 -07:00
Nate E TeBlunthuis	0b95bea30e	support isolates in visualization	2021-05-13 22:26:58 -07:00
Nate E TeBlunthuis	582cf263ea	bug fix in affinity clustering	2021-05-13 22:26:15 -07:00
Nate E TeBlunthuis	8a2248fae1	Merge remote-tracking branch 'origin/excise_reindex' into temp	2021-05-10 18:32:03 -07:00
Nate E TeBlunthuis	47ba04aa97	add script for pulling cluster timeseries	2021-05-10 18:24:22 -07:00
Nate E TeBlunthuis	4cb7eeec80	Refactor to make a decent api.	2021-05-10 13:46:49 -07:00
Nate E TeBlunthuis	f05cb962e0	refactor clustring in object oriented style	2021-05-07 22:33:26 -07:00
Nate E TeBlunthuis	8d1df5b26e	refactor clustering.py into method-specific files.	2021-05-03 11:28:48 -07:00
Nate E TeBlunthuis	e1c9d9af6f	Remove 'exclude phrases' parameter.	2021-05-03 10:37:09 -07:00
Nate E TeBlunthuis	7df8436067	Use Latent semantic indexing and hdbscan	2021-05-02 23:39:55 -07:00
Nate E TeBlunthuis	36b24ee933	reindex tfidf in memory instead of using spark	2021-04-30 12:48:19 -07:00
Nate E TeBlunthuis	a013f6718b	export timeseries functions	2021-03-24 17:18:30 -07:00
		`@@ -0,0 +1,2 @@`
							`from .timeseries import load_clusters, load_densities, build_cluster_timeseries`
		`@@ -1 +0,0 @@`
			`Try normalizing tf by the mean or std instead of the max to avoid penalizing subreddits with very active users.`
`@@ -26,4 +26,4 @@ df = df.toPandas()`

	`df = df.sort_values("n_comments")`	`df = df.sort_values("n_comments")`

	`df.to_csv('/gscratch/comdata/output/reddit_similarity/subreddits_by_num_comments.csv', index=False)`	`df.to_csv('/gscratch/scrubbed/comdata/reddit_similarity/subreddits_by_num_comments_nonsfw.csv', index=False)`
		`@@ -0,0 +1,2 @@`
							`from .choose_clusters import load_clusters, load_densities`
							`from .cluster_timeseries import build_cluster_timeseries`