1
0
Commit Graph

187 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
74ee86e443 add weekly_cosine_similarities script. 2024-12-25 21:15:38 -08:00
Nathan TeBlunthuis
a8a92d30df bugfix 2024-12-19 23:34:55 -08:00
Nathan TeBlunthuis
638ab78375 comment out config. 2024-12-19 23:32:16 -08:00
Nathan TeBlunthuis
8cb75c8354 typo fix. 2024-12-19 20:10:34 -08:00
Nathan TeBlunthuis
0bbdc6bd5e typo fix. 2024-12-19 20:09:00 -08:00
Nathan TeBlunthuis
8b69801c8d correct number of partitions. 2024-12-19 19:39:18 -08:00
Nathan TeBlunthuis
189330198c repartition for parallelism. 2024-12-19 17:53:27 -08:00
Nathan TeBlunthuis
c6c9ec173b add shebang 2024-12-15 18:47:07 -08:00
Nathan TeBlunthuis
52694e0498 typofix 2024-12-15 08:23:06 -08:00
Nathan TeBlunthuis
cb2f2c9717 make executable. 2024-12-15 08:18:42 -08:00
Nathan TeBlunthuis
9a852b9300 was renamed to 'term_frequencies' prior to merge. 2024-12-12 07:54:28 -08:00
Nathan TeBlunthuis
3d192ab82f Merge remote-tracking branch 'origin/icwsm_dataverse' 2024-12-12 07:45:06 -08:00
Nathan TeBlunthuis
e2b6c1b481 configure to use the g2-cpu node. 2024-12-12 07:17:10 -08:00
Nathan TeBlunthuis
f38ec6c129 smaller outchunk size. 2024-12-07 13:23:44 -08:00
Nathan TeBlunthuis
25bfc57baf change path 2024-12-06 08:18:20 -08:00
Nathan TeBlunthuis
c3d2834110 use pyarrow instead of spark to write data 2024-12-06 08:09:02 -08:00
Nathan TeBlunthuis
8224195432 bugfix. 2024-12-05 11:08:18 -08:00
Nathan TeBlunthuis
5d70d3eb6d improve spark configuration. 2024-12-04 10:43:13 -08:00
Nathan TeBlunthuis
89d03dd956 consistent naming and bugfix. 2024-12-04 09:24:45 -08:00
Nathan TeBlunthuis
472849ebd9 correct output path. 2024-12-04 09:07:10 -08:00
Nathan TeBlunthuis
85945eae90 correct paths. 2024-12-04 09:06:02 -08:00
Nathan TeBlunthuis
1cca01fb69 use Path to make directories not os. 2024-12-04 07:47:47 -08:00
Nathan TeBlunthuis
39c0fa7a29 bugfix. 2024-12-03 19:18:38 -08:00
Nathan TeBlunthuis
0436450ea8 typo fix 2024-12-03 19:16:49 -08:00
Nathan TeBlunthuis
4be8bb6bf5 bugfix 2024-12-03 19:15:07 -08:00
Nathan TeBlunthuis
ec5859c311 pass ngram_output through. 2024-12-03 19:05:44 -08:00
Nathan TeBlunthuis
a179d608eb bugfix. 2024-12-03 19:02:26 -08:00
Nathan TeBlunthuis
73dd2a96a6 it's selftext not body 2024-12-03 18:59:27 -08:00
Nathan TeBlunthuis
5045d6052e use post title and body in terms 2024-12-03 18:53:41 -08:00
Nathan TeBlunthuis
51234f1070 add inpath param for tfidf_authors_weekly. 2024-12-03 10:16:23 -08:00
Nathan TeBlunthuis
0a6ad65baf add shebang 2024-12-03 09:06:40 -08:00
Nathan TeBlunthuis
7096785cb9 make exe 2024-12-03 09:05:44 -08:00
Nathan TeBlunthuis
355d014d5f pass path into tfidf function. 2024-12-02 08:03:19 -08:00
Nathan TeBlunthuis
5a131053af spark config tweaks. 2024-12-01 15:41:47 -08:00
Nathan TeBlunthuis
224fb89317 bugfix. 2024-12-01 15:28:25 -08:00
Nathan TeBlunthuis
b25c332cea typo fix. 2024-12-01 15:27:16 -08:00
Nathan TeBlunthuis
613059737a set os environment for big machine 2024-12-01 15:25:18 -08:00
Nathan TeBlunthuis
abe217d2d5 fix configuration code 2024-12-01 15:21:51 -08:00
Nathan TeBlunthuis
9911f758f9 set memory usage. 2024-12-01 14:55:38 -08:00
Nathan TeBlunthuis
a31d8b26eb correct tf_name 2024-12-01 14:38:48 -08:00
Nathan TeBlunthuis
e40cc45d40 bugfix. 2024-12-01 14:10:47 -08:00
Nathan TeBlunthuis
d61746c9f7 make the output authors path. 2024-12-01 13:58:13 -08:00
Nathan TeBlunthuis
9df9a8b8ff rename function. 2024-12-01 13:44:19 -08:00
Nathan TeBlunthuis
3fea1f9388 sort and partition the term frequencies using spark. 2024-12-01 13:42:13 -08:00
Nathan TeBlunthuis
2b023fea8d bugfix 2024-12-01 09:58:09 -08:00
Nathan TeBlunthuis
88fca0f82b allow posts schemas to be nullable. 2024-12-01 09:55:12 -08:00
Nathan TeBlunthuis
271cbea7d9 add a 'limit' parameter for testing. 2024-12-01 09:51:49 -08:00
Nathan TeBlunthuis
4218bf864b debugging. 2024-12-01 09:39:50 -08:00
Nathan TeBlunthuis
22d6a6961c allow authors to be null in submissions. 2024-11-27 20:04:05 -08:00
Nathan TeBlunthuis
a5ca25dd6e bugfix. 2024-11-27 19:56:06 -08:00