13
0
Commit Graph

78 Commits

Author SHA1 Message Date
Nate E TeBlunthuis
a013f6718b export timeseries functions 2021-03-24 17:18:30 -07:00
Nate E TeBlunthuis
36cb0a5546 add code for pulling activity time series from parquet. 2021-03-24 16:08:57 -07:00
Nate E TeBlunthuis
06430903f0 add included_subreddits parameter to cosine similarities. 2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis
4dc949de5f Changes from hyak. 2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis
140d1bdd17 fix bug in viz. 2021-01-27 20:26:15 -08:00
Nate E TeBlunthuis
554660275f add visualization for 10000 subreddits based on author-tf similarities. 2021-01-27 20:22:24 -08:00
Nate E TeBlunthuis
b4dd9acbd8 Merge branch 'master' of code:cdsc_reddit 2021-01-27 20:09:23 -08:00
dbe4c87f8b add cluster selection to visualization 2021-01-27 20:08:07 -08:00
Nate E TeBlunthuis
3155600514 remove nsfw subs from topN 2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis
4e20dce188 Updating to support wang-style user overlaps. 2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis
56269deee3 Some improvements to run affinity clustering on larger dataset and
compute density.
2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis
e6294b5b90 Refactor and reorganze. 2020-12-08 17:32:20 -08:00
Nate E TeBlunthuis
a60747292e Add code for running tf-idf at the weekly level. 2020-12-01 22:54:48 -08:00
db5879d6c9 refactor visualization code. 2020-11-17 16:46:49 -08:00
13eb95b3b0 Merge remote-tracking branch 'refs/remotes/origin/master' into master 2020-11-17 16:33:14 -08:00
2cc897543a git-annex in nathante@nate-x1:~/cdsc_reddit 2020-11-17 16:33:13 -08:00
Nate E TeBlunthuis
1bf206d219 git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit 2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis
f8ff8b2d0f Update code for clustering + tsne. 2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis
82d184d9c6 Update code for building simlarity matrices. 2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis
e794214653 bugfix in completing tfidf similarity matrices. 2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis
220a540beb increase learning rate. 2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis
cd43a94865 increase iterations and perplectity and early_exaggeration 2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis
ca6a8f0896 increase learning rate 2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis
ed0e1a8235 Fix bug in tsne. 2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis
6baa08889b git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit 2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis
4447c60265 split fitting and plotting tsne. 2020-11-11 16:38:22 -08:00
db53c0138a Add file to plot related subreddits using tsne. 2020-11-11 16:05:36 -08:00
Nate E TeBlunthuis
4c8bd14992 Bugfix (typo) 2020-11-10 13:38:11 -08:00
Nate E TeBlunthuis
39c581bee9 Reuse code for term and author cosine similarity. 2020-11-10 13:18:57 -08:00
Nate E TeBlunthuis
5632a971c6 Refactor tfidf code to for code resuse. 2020-11-10 13:18:19 -08:00
Nate E TeBlunthuis
772f3a8fbd rename 'idf' files to 'tfidf' 2020-11-10 13:16:55 -08:00
Nate E TeBlunthuis
6edd155749 Improvements to idf code 2020-11-10 13:12:11 -08:00
Nate E TeBlunthuis
8b8c45ee2d Merge branch 'master' of code:cdsc_reddit 2020-11-02 10:40:12 -08:00
Nate E TeBlunthuis
3dc17bd27c add term_cosine_similarity.py 2020-11-02 10:40:02 -08:00
0882878166 Add Cosine similarities to README.md 2020-11-02 09:48:10 -08:00
b50b08a3ea Update Readme. 2020-11-02 08:42:13 -08:00
9075a8153c Merge branch 'master' of code:cdsc_reddit into master 2020-11-01 21:50:44 -08:00
4c78f2c527 Create README.md 2020-11-01 21:50:27 -08:00
Nate E TeBlunthuis
4ced659d19 Update reddit comments data with daily dumps. 2020-10-03 16:42:22 -07:00
Nate E TeBlunthuis
2740f55915 Compute IDF for terms and authors. 2020-08-23 11:57:55 -07:00
Nate E TeBlunthuis
2d425600a8 Update submissions to parse using the backfill queue. 2020-08-11 22:37:36 -07:00
Nate E TeBlunthuis
c92b50e050 bugfix in checking submission shas 2020-08-11 14:21:54 -07:00
Nate E TeBlunthuis
c0da8f4dbf Use multiword expressions in tf. 2020-08-10 16:57:46 -07:00
Nate E TeBlunthuis
57951050c0 Finish generating multiword expressions. 2020-08-09 22:43:48 -07:00
Nate E TeBlunthuis
529b7f0511 Bugfix 2020-08-09 02:34:42 -07:00
Nate E TeBlunthuis
2d1c8013f2 Use groupby - joins instead of windows 2020-08-09 00:21:50 -07:00
Nate E TeBlunthuis
f28effe2c3 renamte tf_comments part 2. 2020-08-04 13:39:49 -07:00
Nate E TeBlunthuis
39fde9884e rename tf_reddit_comments.py step1. 2020-08-04 13:39:20 -07:00
Nate E TeBlunthuis
78ab514d6b Improve tokenization following data. Generate author counts. 2020-08-04 13:24:37 -07:00
Nate E TeBlunthuis
b3ffaaba1d improve tokenizer. 2020-08-03 22:55:10 -07:00