13
0
Commit Graph

37 Commits

Author SHA1 Message Date
Nate E TeBlunthuis
c92b50e050 bugfix in checking submission shas 2020-08-11 14:21:54 -07:00
Nate E TeBlunthuis
c0da8f4dbf Use multiword expressions in tf. 2020-08-10 16:57:46 -07:00
Nate E TeBlunthuis
57951050c0 Finish generating multiword expressions. 2020-08-09 22:43:48 -07:00
Nate E TeBlunthuis
529b7f0511 Bugfix 2020-08-09 02:34:42 -07:00
Nate E TeBlunthuis
2d1c8013f2 Use groupby - joins instead of windows 2020-08-09 00:21:50 -07:00
Nate E TeBlunthuis
f28effe2c3 renamte tf_comments part 2. 2020-08-04 13:39:49 -07:00
Nate E TeBlunthuis
39fde9884e rename tf_reddit_comments.py step1. 2020-08-04 13:39:20 -07:00
Nate E TeBlunthuis
78ab514d6b Improve tokenization following data. Generate author counts. 2020-08-04 13:24:37 -07:00
Nate E TeBlunthuis
b3ffaaba1d improve tokenizer. 2020-08-03 22:55:10 -07:00
Nate E TeBlunthuis
ddf2adb8a6 TF reddit comments. 2020-08-03 22:43:57 -07:00
Nate E TeBlunthuis
40be7bedb6 code to sort tf 2020-08-03 17:56:36 -07:00
Nate E TeBlunthuis
c666302b4a remove is_submitter field from submissions which doesn't exist. 2020-07-09 17:12:14 -07:00
Nate E TeBlunthuis
aa84a7df03 Bugfixes in scripts. 2020-07-07 23:29:36 -07:00
Nate E TeBlunthuis
06fd99e7cd clean up comments in streaming example. 2020-07-07 12:28:57 -07:00
Nate E TeBlunthuis
7d0e020f9d update .gitignore 2020-07-07 12:28:44 -07:00
Nate E TeBlunthuis
e22ddf23da update examples with working streaming 2020-07-07 11:47:17 -07:00
Nate E TeBlunthuis
40d4563770 Build comments dataset similarly to submissions and improve partitioning scheme 2020-07-07 11:45:43 -07:00
Nate E TeBlunthuis
fc6575a287 update .gitignore 2020-07-07 00:58:26 -07:00
Nate E TeBlunthuis
4efd72a916 Script for example of streaming pyarrow. 2020-07-07 00:57:05 -07:00
Nate E TeBlunthuis
90fe976b26 Script to demonstrate reading parquet. 2020-07-07 00:51:40 -07:00
Nate E TeBlunthuis
9cd0954288 Check the shas when we download dumps 2020-07-06 23:31:52 -07:00
Nate E TeBlunthuis
33e088492c Script to run both parts of submissions_2_parquet.sh 2020-07-06 23:27:18 -07:00
Nate E TeBlunthuis
fd3b615544 Cache before sorting so we don't extract twice. 2020-07-06 22:30:04 -07:00
Nate E TeBlunthuis
4ec9c14247 Move the spark part of submissions_2_parquet to a separate script. 2020-07-06 22:27:34 -07:00
Nate E TeBlunthuis
4eb82d2740 Fix whitespace at top of file. 2020-07-05 23:32:00 -07:00
Nate E TeBlunthuis
34185337c9 Secondary sort for the by_author dataset should be CreatedAt. 2020-07-05 23:29:35 -07:00
Nate E TeBlunthuis
67857a3b05 Create a second dataset sorted by author. 2020-07-05 23:27:05 -07:00
Nate E TeBlunthuis
6d4344355b Create parquet datasets of reddit submissions from pushshift. 2020-07-05 23:20:17 -07:00
Nate E TeBlunthuis
6dca79a41f Rename spark script to reflect that it is for comments. 2020-07-03 14:00:36 -07:00
Nate E TeBlunthuis
94c7a74bd9 update .gitignore 2020-07-03 13:55:25 -07:00
Nate E TeBlunthuis
4dd9a210e6 bugfix in retrieving old data and rename file. 2020-07-03 13:54:55 -07:00
Nate E TeBlunthuis
c972d828b3 Script for checking shas for submissions. 2020-07-03 13:35:46 -07:00
Nate E TeBlunthuis
7da18e33ba Bugfix: use timestamp types
Also change the canonical file path.
2020-07-03 11:38:43 -07:00
Nate E TeBlunthuis
db2d6248fc update the reddit comment dumps 2020-07-03 10:41:13 -07:00
Nate E TeBlunthuis
d05da6441f Don't clobber old dumps so that we can just download the new ones. 2020-07-03 10:40:43 -07:00
Nate E TeBlunthuis
592d2c7dda script for getting submissions dumps from pushshift. 2020-07-02 17:40:17 -07:00
Nate E TeBlunthuis
64e9408a65 Extract variables from pushshift comment to parquet
A spark script
2020-07-02 14:35:55 -07:00