13
0
Commit Graph

112 Commits

Author SHA1 Message Date
Nate E TeBlunthuis
1bf206d219 git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit 2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis
f8ff8b2d0f Update code for clustering + tsne. 2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis
82d184d9c6 Update code for building simlarity matrices. 2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis
e794214653 bugfix in completing tfidf similarity matrices. 2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis
220a540beb increase learning rate. 2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis
cd43a94865 increase iterations and perplectity and early_exaggeration 2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis
ca6a8f0896 increase learning rate 2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis
ed0e1a8235 Fix bug in tsne. 2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis
6baa08889b git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit 2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis
4447c60265 split fitting and plotting tsne. 2020-11-11 16:38:22 -08:00
db53c0138a Add file to plot related subreddits using tsne. 2020-11-11 16:05:36 -08:00
Nate E TeBlunthuis
4c8bd14992 Bugfix (typo) 2020-11-10 13:38:11 -08:00
Nate E TeBlunthuis
39c581bee9 Reuse code for term and author cosine similarity. 2020-11-10 13:18:57 -08:00
Nate E TeBlunthuis
5632a971c6 Refactor tfidf code to for code resuse. 2020-11-10 13:18:19 -08:00
Nate E TeBlunthuis
772f3a8fbd rename 'idf' files to 'tfidf' 2020-11-10 13:16:55 -08:00
Nate E TeBlunthuis
6edd155749 Improvements to idf code 2020-11-10 13:12:11 -08:00
Nate E TeBlunthuis
8b8c45ee2d Merge branch 'master' of code:cdsc_reddit 2020-11-02 10:40:12 -08:00
Nate E TeBlunthuis
3dc17bd27c add term_cosine_similarity.py 2020-11-02 10:40:02 -08:00
0882878166 Add Cosine similarities to README.md 2020-11-02 09:48:10 -08:00
b50b08a3ea Update Readme. 2020-11-02 08:42:13 -08:00
9075a8153c Merge branch 'master' of code:cdsc_reddit into master 2020-11-01 21:50:44 -08:00
4c78f2c527 Create README.md 2020-11-01 21:50:27 -08:00
Nate E TeBlunthuis
4ced659d19 Update reddit comments data with daily dumps. 2020-10-03 16:42:22 -07:00
Nate E TeBlunthuis
2740f55915 Compute IDF for terms and authors. 2020-08-23 11:57:55 -07:00
Nate E TeBlunthuis
2d425600a8 Update submissions to parse using the backfill queue. 2020-08-11 22:37:36 -07:00
Nate E TeBlunthuis
c92b50e050 bugfix in checking submission shas 2020-08-11 14:21:54 -07:00
Nate E TeBlunthuis
c0da8f4dbf Use multiword expressions in tf. 2020-08-10 16:57:46 -07:00
Nate E TeBlunthuis
57951050c0 Finish generating multiword expressions. 2020-08-09 22:43:48 -07:00
Nate E TeBlunthuis
529b7f0511 Bugfix 2020-08-09 02:34:42 -07:00
Nate E TeBlunthuis
2d1c8013f2 Use groupby - joins instead of windows 2020-08-09 00:21:50 -07:00
Nate E TeBlunthuis
f28effe2c3 renamte tf_comments part 2. 2020-08-04 13:39:49 -07:00
Nate E TeBlunthuis
39fde9884e rename tf_reddit_comments.py step1. 2020-08-04 13:39:20 -07:00
Nate E TeBlunthuis
78ab514d6b Improve tokenization following data. Generate author counts. 2020-08-04 13:24:37 -07:00
Nate E TeBlunthuis
b3ffaaba1d improve tokenizer. 2020-08-03 22:55:10 -07:00
Nate E TeBlunthuis
ddf2adb8a6 TF reddit comments. 2020-08-03 22:43:57 -07:00
Nate E TeBlunthuis
40be7bedb6 code to sort tf 2020-08-03 17:56:36 -07:00
Nate E TeBlunthuis
c666302b4a remove is_submitter field from submissions which doesn't exist. 2020-07-09 17:12:14 -07:00
Nate E TeBlunthuis
aa84a7df03 Bugfixes in scripts. 2020-07-07 23:29:36 -07:00
Nate E TeBlunthuis
06fd99e7cd clean up comments in streaming example. 2020-07-07 12:28:57 -07:00
Nate E TeBlunthuis
7d0e020f9d update .gitignore 2020-07-07 12:28:44 -07:00
Nate E TeBlunthuis
e22ddf23da update examples with working streaming 2020-07-07 11:47:17 -07:00
Nate E TeBlunthuis
40d4563770 Build comments dataset similarly to submissions and improve partitioning scheme 2020-07-07 11:45:43 -07:00
Nate E TeBlunthuis
fc6575a287 update .gitignore 2020-07-07 00:58:26 -07:00
Nate E TeBlunthuis
4efd72a916 Script for example of streaming pyarrow. 2020-07-07 00:57:05 -07:00
Nate E TeBlunthuis
90fe976b26 Script to demonstrate reading parquet. 2020-07-07 00:51:40 -07:00
Nate E TeBlunthuis
9cd0954288 Check the shas when we download dumps 2020-07-06 23:31:52 -07:00
Nate E TeBlunthuis
33e088492c Script to run both parts of submissions_2_parquet.sh 2020-07-06 23:27:18 -07:00
Nate E TeBlunthuis
fd3b615544 Cache before sorting so we don't extract twice. 2020-07-06 22:30:04 -07:00
Nate E TeBlunthuis
4ec9c14247 Move the spark part of submissions_2_parquet to a separate script. 2020-07-06 22:27:34 -07:00
Nate E TeBlunthuis
4eb82d2740 Fix whitespace at top of file. 2020-07-05 23:32:00 -07:00