|
13eb95b3b0
|
Merge remote-tracking branch 'refs/remotes/origin/master' into master
|
2020-11-17 16:33:14 -08:00 |
|
|
2cc897543a
|
git-annex in nathante@nate-x1:~/cdsc_reddit
|
2020-11-17 16:33:13 -08:00 |
|
Nate E TeBlunthuis
|
1bf206d219
|
git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
|
2020-11-17 16:31:48 -08:00 |
|
Nate E TeBlunthuis
|
f8ff8b2d0f
|
Update code for clustering + tsne.
|
2020-11-17 15:59:20 -08:00 |
|
Nate E TeBlunthuis
|
82d184d9c6
|
Update code for building simlarity matrices.
|
2020-11-17 12:52:48 -08:00 |
|
Nate E TeBlunthuis
|
e794214653
|
bugfix in completing tfidf similarity matrices.
|
2020-11-12 11:47:53 -08:00 |
|
Nate E TeBlunthuis
|
220a540beb
|
increase learning rate.
|
2020-11-11 16:58:39 -08:00 |
|
Nate E TeBlunthuis
|
cd43a94865
|
increase iterations and perplectity and early_exaggeration
|
2020-11-11 16:55:39 -08:00 |
|
Nate E TeBlunthuis
|
ca6a8f0896
|
increase learning rate
|
2020-11-11 16:48:41 -08:00 |
|
Nate E TeBlunthuis
|
ed0e1a8235
|
Fix bug in tsne.
|
2020-11-11 16:43:41 -08:00 |
|
Nate E TeBlunthuis
|
6baa08889b
|
git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
|
2020-11-11 16:39:44 -08:00 |
|
Nate E TeBlunthuis
|
4447c60265
|
split fitting and plotting tsne.
|
2020-11-11 16:38:22 -08:00 |
|
|
db53c0138a
|
Add file to plot related subreddits using tsne.
|
2020-11-11 16:05:36 -08:00 |
|
Nate E TeBlunthuis
|
4c8bd14992
|
Bugfix (typo)
|
2020-11-10 13:38:11 -08:00 |
|
Nate E TeBlunthuis
|
39c581bee9
|
Reuse code for term and author cosine similarity.
|
2020-11-10 13:18:57 -08:00 |
|
Nate E TeBlunthuis
|
5632a971c6
|
Refactor tfidf code to for code resuse.
|
2020-11-10 13:18:19 -08:00 |
|
Nate E TeBlunthuis
|
772f3a8fbd
|
rename 'idf' files to 'tfidf'
|
2020-11-10 13:16:55 -08:00 |
|
Nate E TeBlunthuis
|
6edd155749
|
Improvements to idf code
|
2020-11-10 13:12:11 -08:00 |
|
Nate E TeBlunthuis
|
8b8c45ee2d
|
Merge branch 'master' of code:cdsc_reddit
|
2020-11-02 10:40:12 -08:00 |
|
Nate E TeBlunthuis
|
3dc17bd27c
|
add term_cosine_similarity.py
|
2020-11-02 10:40:02 -08:00 |
|
|
0882878166
|
Add Cosine similarities to README.md
|
2020-11-02 09:48:10 -08:00 |
|
|
b50b08a3ea
|
Update Readme.
|
2020-11-02 08:42:13 -08:00 |
|
|
9075a8153c
|
Merge branch 'master' of code:cdsc_reddit into master
|
2020-11-01 21:50:44 -08:00 |
|
|
4c78f2c527
|
Create README.md
|
2020-11-01 21:50:27 -08:00 |
|
Nate E TeBlunthuis
|
4ced659d19
|
Update reddit comments data with daily dumps.
|
2020-10-03 16:42:22 -07:00 |
|
Nate E TeBlunthuis
|
2740f55915
|
Compute IDF for terms and authors.
|
2020-08-23 11:57:55 -07:00 |
|
Nate E TeBlunthuis
|
2d425600a8
|
Update submissions to parse using the backfill queue.
|
2020-08-11 22:37:36 -07:00 |
|
Nate E TeBlunthuis
|
c92b50e050
|
bugfix in checking submission shas
|
2020-08-11 14:21:54 -07:00 |
|
Nate E TeBlunthuis
|
c0da8f4dbf
|
Use multiword expressions in tf.
|
2020-08-10 16:57:46 -07:00 |
|
Nate E TeBlunthuis
|
57951050c0
|
Finish generating multiword expressions.
|
2020-08-09 22:43:48 -07:00 |
|
Nate E TeBlunthuis
|
529b7f0511
|
Bugfix
|
2020-08-09 02:34:42 -07:00 |
|
Nate E TeBlunthuis
|
2d1c8013f2
|
Use groupby - joins instead of windows
|
2020-08-09 00:21:50 -07:00 |
|
Nate E TeBlunthuis
|
f28effe2c3
|
renamte tf_comments part 2.
|
2020-08-04 13:39:49 -07:00 |
|
Nate E TeBlunthuis
|
39fde9884e
|
rename tf_reddit_comments.py step1.
|
2020-08-04 13:39:20 -07:00 |
|
Nate E TeBlunthuis
|
78ab514d6b
|
Improve tokenization following data. Generate author counts.
|
2020-08-04 13:24:37 -07:00 |
|
Nate E TeBlunthuis
|
b3ffaaba1d
|
improve tokenizer.
|
2020-08-03 22:55:10 -07:00 |
|
Nate E TeBlunthuis
|
ddf2adb8a6
|
TF reddit comments.
|
2020-08-03 22:43:57 -07:00 |
|
Nate E TeBlunthuis
|
40be7bedb6
|
code to sort tf
|
2020-08-03 17:56:36 -07:00 |
|
Nate E TeBlunthuis
|
c666302b4a
|
remove is_submitter field from submissions which doesn't exist.
|
2020-07-09 17:12:14 -07:00 |
|
Nate E TeBlunthuis
|
aa84a7df03
|
Bugfixes in scripts.
|
2020-07-07 23:29:36 -07:00 |
|
Nate E TeBlunthuis
|
06fd99e7cd
|
clean up comments in streaming example.
|
2020-07-07 12:28:57 -07:00 |
|
Nate E TeBlunthuis
|
7d0e020f9d
|
update .gitignore
|
2020-07-07 12:28:44 -07:00 |
|
Nate E TeBlunthuis
|
e22ddf23da
|
update examples with working streaming
|
2020-07-07 11:47:17 -07:00 |
|
Nate E TeBlunthuis
|
40d4563770
|
Build comments dataset similarly to submissions and improve partitioning scheme
|
2020-07-07 11:45:43 -07:00 |
|
Nate E TeBlunthuis
|
fc6575a287
|
update .gitignore
|
2020-07-07 00:58:26 -07:00 |
|
Nate E TeBlunthuis
|
4efd72a916
|
Script for example of streaming pyarrow.
|
2020-07-07 00:57:05 -07:00 |
|
Nate E TeBlunthuis
|
90fe976b26
|
Script to demonstrate reading parquet.
|
2020-07-07 00:51:40 -07:00 |
|
Nate E TeBlunthuis
|
9cd0954288
|
Check the shas when we download dumps
|
2020-07-06 23:31:52 -07:00 |
|
Nate E TeBlunthuis
|
33e088492c
|
Script to run both parts of submissions_2_parquet.sh
|
2020-07-06 23:27:18 -07:00 |
|
Nate E TeBlunthuis
|
fd3b615544
|
Cache before sorting so we don't extract twice.
|
2020-07-06 22:30:04 -07:00 |
|