Nate E TeBlunthuis
|
57951050c0
|
Finish generating multiword expressions.
|
2020-08-09 22:43:48 -07:00 |
|
Nate E TeBlunthuis
|
529b7f0511
|
Bugfix
|
2020-08-09 02:34:42 -07:00 |
|
Nate E TeBlunthuis
|
2d1c8013f2
|
Use groupby - joins instead of windows
|
2020-08-09 00:21:50 -07:00 |
|
Nate E TeBlunthuis
|
f28effe2c3
|
renamte tf_comments part 2.
|
2020-08-04 13:39:49 -07:00 |
|
Nate E TeBlunthuis
|
39fde9884e
|
rename tf_reddit_comments.py step1.
|
2020-08-04 13:39:20 -07:00 |
|
Nate E TeBlunthuis
|
78ab514d6b
|
Improve tokenization following data. Generate author counts.
|
2020-08-04 13:24:37 -07:00 |
|
Nate E TeBlunthuis
|
b3ffaaba1d
|
improve tokenizer.
|
2020-08-03 22:55:10 -07:00 |
|
Nate E TeBlunthuis
|
ddf2adb8a6
|
TF reddit comments.
|
2020-08-03 22:43:57 -07:00 |
|
Nate E TeBlunthuis
|
40be7bedb6
|
code to sort tf
|
2020-08-03 17:56:36 -07:00 |
|
Nate E TeBlunthuis
|
c666302b4a
|
remove is_submitter field from submissions which doesn't exist.
|
2020-07-09 17:12:14 -07:00 |
|
Nate E TeBlunthuis
|
aa84a7df03
|
Bugfixes in scripts.
|
2020-07-07 23:29:36 -07:00 |
|
Nate E TeBlunthuis
|
06fd99e7cd
|
clean up comments in streaming example.
|
2020-07-07 12:28:57 -07:00 |
|
Nate E TeBlunthuis
|
7d0e020f9d
|
update .gitignore
|
2020-07-07 12:28:44 -07:00 |
|
Nate E TeBlunthuis
|
e22ddf23da
|
update examples with working streaming
|
2020-07-07 11:47:17 -07:00 |
|
Nate E TeBlunthuis
|
40d4563770
|
Build comments dataset similarly to submissions and improve partitioning scheme
|
2020-07-07 11:45:43 -07:00 |
|
Nate E TeBlunthuis
|
fc6575a287
|
update .gitignore
|
2020-07-07 00:58:26 -07:00 |
|
Nate E TeBlunthuis
|
4efd72a916
|
Script for example of streaming pyarrow.
|
2020-07-07 00:57:05 -07:00 |
|
Nate E TeBlunthuis
|
90fe976b26
|
Script to demonstrate reading parquet.
|
2020-07-07 00:51:40 -07:00 |
|
Nate E TeBlunthuis
|
9cd0954288
|
Check the shas when we download dumps
|
2020-07-06 23:31:52 -07:00 |
|
Nate E TeBlunthuis
|
33e088492c
|
Script to run both parts of submissions_2_parquet.sh
|
2020-07-06 23:27:18 -07:00 |
|
Nate E TeBlunthuis
|
fd3b615544
|
Cache before sorting so we don't extract twice.
|
2020-07-06 22:30:04 -07:00 |
|
Nate E TeBlunthuis
|
4ec9c14247
|
Move the spark part of submissions_2_parquet to a separate script.
|
2020-07-06 22:27:34 -07:00 |
|
Nate E TeBlunthuis
|
4eb82d2740
|
Fix whitespace at top of file.
|
2020-07-05 23:32:00 -07:00 |
|
Nate E TeBlunthuis
|
34185337c9
|
Secondary sort for the by_author dataset should be CreatedAt.
|
2020-07-05 23:29:35 -07:00 |
|
Nate E TeBlunthuis
|
67857a3b05
|
Create a second dataset sorted by author.
|
2020-07-05 23:27:05 -07:00 |
|
Nate E TeBlunthuis
|
6d4344355b
|
Create parquet datasets of reddit submissions from pushshift.
|
2020-07-05 23:20:17 -07:00 |
|
Nate E TeBlunthuis
|
6dca79a41f
|
Rename spark script to reflect that it is for comments.
|
2020-07-03 14:00:36 -07:00 |
|
Nate E TeBlunthuis
|
94c7a74bd9
|
update .gitignore
|
2020-07-03 13:55:25 -07:00 |
|
Nate E TeBlunthuis
|
4dd9a210e6
|
bugfix in retrieving old data and rename file.
|
2020-07-03 13:54:55 -07:00 |
|
Nate E TeBlunthuis
|
c972d828b3
|
Script for checking shas for submissions.
|
2020-07-03 13:35:46 -07:00 |
|
Nate E TeBlunthuis
|
7da18e33ba
|
Bugfix: use timestamp types
Also change the canonical file path.
|
2020-07-03 11:38:43 -07:00 |
|
Nate E TeBlunthuis
|
db2d6248fc
|
update the reddit comment dumps
|
2020-07-03 10:41:13 -07:00 |
|
Nate E TeBlunthuis
|
d05da6441f
|
Don't clobber old dumps so that we can just download the new ones.
|
2020-07-03 10:40:43 -07:00 |
|
Nate E TeBlunthuis
|
592d2c7dda
|
script for getting submissions dumps from pushshift.
|
2020-07-02 17:40:17 -07:00 |
|
Nate E TeBlunthuis
|
64e9408a65
|
Extract variables from pushshift comment to parquet
A spark script
|
2020-07-02 14:35:55 -07:00 |
|