13
0

Commit Graph

  • 8725d11394 merging code/git-annex into git-annex Nate E TeBlunthuis 2020-11-11 16:48:59 -0800
  • ca6a8f0896 increase learning rate Nate E TeBlunthuis 2020-11-11 16:48:41 -0800
  • c7cc86da57 update Nate E TeBlunthuis 2020-11-11 16:48:17 -0800
  • 12f5476a9e update Nathan TeBlunthuis 2020-11-11 16:44:53 -0800
  • 408ada46e8 merging code/git-annex into git-annex Nate E TeBlunthuis 2020-11-11 16:44:11 -0800
  • eafb0e91ce update Nate E TeBlunthuis 2020-11-11 16:44:02 -0800
  • 38d045ac34 update git repository hosting 2020-11-11 16:44:02 -0800
  • 78e09813f0 merging code/git-annex into git-annex Nate E TeBlunthuis 2020-11-11 16:44:01 -0800
  • ed0e1a8235 Fix bug in tsne. Nate E TeBlunthuis 2020-11-11 16:43:41 -0800
  • e3a93239bc update Nate E TeBlunthuis 2020-11-11 16:43:31 -0800
  • cd9abac453 update Nathan TeBlunthuis 2020-11-11 16:40:28 -0800
  • 6e4a4c0434 update Nathan TeBlunthuis 2020-11-11 16:40:26 -0800
  • 9434b6c6c2 merging origin/git-annex into git-annex Nathan TeBlunthuis 2020-11-11 16:40:26 -0800
  • 8c5b970d6b branch created Nathan TeBlunthuis 2020-11-11 16:40:26 -0800
  • 23174d52fd update git repository hosting 2020-11-11 16:39:46 -0800
  • 7eb64744ee update Nate E TeBlunthuis 2020-11-11 16:39:45 -0800
  • 6baa08889b git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit Nate E TeBlunthuis 2020-11-11 16:39:44 -0800
  • 4d2906734d merging code/git-annex into git-annex Nate E TeBlunthuis 2020-11-11 16:39:44 -0800
  • 47c1314f3b branch created git repository hosting 2020-11-11 16:39:43 -0800
  • 4447c60265 split fitting and plotting tsne. Nate E TeBlunthuis 2020-11-11 16:38:22 -0800
  • 4544f34101 update Nate E TeBlunthuis 2020-11-11 16:38:13 -0800
  • 79368b5d35 update Nate E TeBlunthuis 2020-11-11 16:38:08 -0800
  • 3317a47073 branch created Nate E TeBlunthuis 2020-11-11 16:38:03 -0800
  • db53c0138a Add file to plot related subreddits using tsne. Nathan TeBlunthuis 2020-11-11 16:05:36 -0800
  • 4c8bd14992 Bugfix (typo) Nate E TeBlunthuis 2020-11-10 13:38:11 -0800
  • 39c581bee9 Reuse code for term and author cosine similarity. Nate E TeBlunthuis 2020-11-10 13:18:57 -0800
  • 5632a971c6 Refactor tfidf code to for code resuse. Nate E TeBlunthuis 2020-11-10 13:18:19 -0800
  • 772f3a8fbd rename 'idf' files to 'tfidf' Nate E TeBlunthuis 2020-11-10 13:16:55 -0800
  • 6edd155749 Improvements to idf code Nate E TeBlunthuis 2020-11-10 13:12:11 -0800
  • 8b8c45ee2d Merge branch 'master' of code:cdsc_reddit Nate E TeBlunthuis 2020-11-02 10:40:12 -0800
  • 3dc17bd27c add term_cosine_similarity.py Nate E TeBlunthuis 2020-11-02 10:40:02 -0800
  • 0882878166 Add Cosine similarities to README.md Nathan TeBlunthuis 2020-11-02 09:48:10 -0800
  • b50b08a3ea Update Readme. Nathan TeBlunthuis 2020-11-02 08:42:13 -0800
  • 9075a8153c Merge branch 'master' of code:cdsc_reddit into master Nathan TeBlunthuis 2020-11-01 21:50:44 -0800
  • 4c78f2c527 Create README.md Nathan TeBlunthuis 2020-11-01 21:50:27 -0800
  • 4ced659d19 Update reddit comments data with daily dumps. Nate E TeBlunthuis 2020-10-03 16:42:22 -0700
  • 2740f55915 Compute IDF for terms and authors. Nate E TeBlunthuis 2020-08-23 11:57:55 -0700
  • 2d425600a8 Update submissions to parse using the backfill queue. Nate E TeBlunthuis 2020-08-11 22:37:36 -0700
  • c92b50e050 bugfix in checking submission shas Nate E TeBlunthuis 2020-08-11 14:21:54 -0700
  • c0da8f4dbf Use multiword expressions in tf. Nate E TeBlunthuis 2020-08-10 16:57:46 -0700
  • 57951050c0 Finish generating multiword expressions. Nate E TeBlunthuis 2020-08-09 22:42:23 -0700
  • 529b7f0511 Bugfix Nate E TeBlunthuis 2020-08-09 02:34:42 -0700
  • 2d1c8013f2 Use groupby - joins instead of windows Nate E TeBlunthuis 2020-08-09 00:21:50 -0700
  • f28effe2c3 renamte tf_comments part 2. Nate E TeBlunthuis 2020-08-04 13:39:49 -0700
  • 39fde9884e rename tf_reddit_comments.py step1. Nate E TeBlunthuis 2020-08-04 13:39:20 -0700
  • 78ab514d6b Improve tokenization following data. Generate author counts. Nate E TeBlunthuis 2020-08-04 13:24:37 -0700
  • b3ffaaba1d improve tokenizer. Nate E TeBlunthuis 2020-08-03 22:55:10 -0700
  • ddf2adb8a6 TF reddit comments. Nate E TeBlunthuis 2020-08-03 22:43:57 -0700
  • 40be7bedb6 code to sort tf Nate E TeBlunthuis 2020-08-03 17:56:36 -0700
  • c666302b4a remove is_submitter field from submissions which doesn't exist. Nate E TeBlunthuis 2020-07-09 17:12:14 -0700
  • aa84a7df03 Bugfixes in scripts. Nate E TeBlunthuis 2020-07-07 23:29:36 -0700
  • 06fd99e7cd clean up comments in streaming example. Nate E TeBlunthuis 2020-07-07 12:28:57 -0700
  • 7d0e020f9d update .gitignore Nate E TeBlunthuis 2020-07-07 12:28:44 -0700
  • e22ddf23da update examples with working streaming Nate E TeBlunthuis 2020-07-07 11:47:17 -0700
  • 40d4563770 Build comments dataset similarly to submissions and improve partitioning scheme Nate E TeBlunthuis 2020-07-07 11:45:43 -0700
  • fc6575a287 update .gitignore Nate E TeBlunthuis 2020-07-07 00:58:26 -0700
  • 4efd72a916 Script for example of streaming pyarrow. Nate E TeBlunthuis 2020-07-07 00:57:05 -0700
  • 90fe976b26 Script to demonstrate reading parquet. Nate E TeBlunthuis 2020-07-07 00:51:40 -0700
  • 9cd0954288 Check the shas when we download dumps Nate E TeBlunthuis 2020-07-06 23:31:52 -0700
  • 33e088492c Script to run both parts of submissions_2_parquet.sh Nate E TeBlunthuis 2020-07-06 23:27:18 -0700
  • fd3b615544 Cache before sorting so we don't extract twice. Nate E TeBlunthuis 2020-07-06 22:30:04 -0700
  • 4ec9c14247 Move the spark part of submissions_2_parquet to a separate script. Nate E TeBlunthuis 2020-07-06 22:26:29 -0700
  • 4eb82d2740 Fix whitespace at top of file. Nate E TeBlunthuis 2020-07-05 23:32:00 -0700
  • 34185337c9 Secondary sort for the by_author dataset should be CreatedAt. Nate E TeBlunthuis 2020-07-05 23:27:18 -0700
  • 67857a3b05 Create a second dataset sorted by author. Nate E TeBlunthuis 2020-07-05 23:24:40 -0700
  • 6d4344355b Create parquet datasets of reddit submissions from pushshift. Nate E TeBlunthuis 2020-07-05 23:20:17 -0700
  • 6dca79a41f Rename spark script to reflect that it is for comments. Nate E TeBlunthuis 2020-07-03 14:00:36 -0700
  • 94c7a74bd9 update .gitignore Nate E TeBlunthuis 2020-07-03 13:55:25 -0700
  • 4dd9a210e6 bugfix in retrieving old data and rename file. Nate E TeBlunthuis 2020-07-03 13:54:55 -0700
  • c972d828b3 Script for checking shas for submissions. Nate E TeBlunthuis 2020-07-03 13:35:46 -0700
  • 7da18e33ba Bugfix: use timestamp types Nate E TeBlunthuis 2020-07-03 11:38:43 -0700
  • db2d6248fc update the reddit comment dumps Nate E TeBlunthuis 2020-07-03 10:41:13 -0700
  • d05da6441f Don't clobber old dumps so that we can just download the new ones. Nate E TeBlunthuis 2020-07-03 10:40:43 -0700
  • 592d2c7dda script for getting submissions dumps from pushshift. Nate E TeBlunthuis 2020-07-02 17:40:17 -0700
  • 64e9408a65 Extract variables from pushshift comment to parquet Nate E TeBlunthuis 2020-07-02 14:06:36 -0700