1
0
Commit Graph

126 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
355d014d5f pass path into tfidf function. 2024-12-02 08:03:19 -08:00
Nathan TeBlunthuis
5a131053af spark config tweaks. 2024-12-01 15:41:47 -08:00
Nathan TeBlunthuis
224fb89317 bugfix. 2024-12-01 15:28:25 -08:00
Nathan TeBlunthuis
b25c332cea typo fix. 2024-12-01 15:27:16 -08:00
Nathan TeBlunthuis
613059737a set os environment for big machine 2024-12-01 15:25:18 -08:00
Nathan TeBlunthuis
abe217d2d5 fix configuration code 2024-12-01 15:21:51 -08:00
Nathan TeBlunthuis
9911f758f9 set memory usage. 2024-12-01 14:55:38 -08:00
Nathan TeBlunthuis
a31d8b26eb correct tf_name 2024-12-01 14:38:48 -08:00
Nathan TeBlunthuis
e40cc45d40 bugfix. 2024-12-01 14:10:47 -08:00
Nathan TeBlunthuis
d61746c9f7 make the output authors path. 2024-12-01 13:58:13 -08:00
Nathan TeBlunthuis
9df9a8b8ff rename function. 2024-12-01 13:44:19 -08:00
Nathan TeBlunthuis
3fea1f9388 sort and partition the term frequencies using spark. 2024-12-01 13:42:13 -08:00
Nathan TeBlunthuis
2b023fea8d bugfix 2024-12-01 09:58:09 -08:00
Nathan TeBlunthuis
88fca0f82b allow posts schemas to be nullable. 2024-12-01 09:55:12 -08:00
Nathan TeBlunthuis
271cbea7d9 add a 'limit' parameter for testing. 2024-12-01 09:51:49 -08:00
Nathan TeBlunthuis
4218bf864b debugging. 2024-12-01 09:39:50 -08:00
Nathan TeBlunthuis
22d6a6961c allow authors to be null in submissions. 2024-11-27 20:04:05 -08:00
Nathan TeBlunthuis
a5ca25dd6e bugfix. 2024-11-27 19:56:06 -08:00
Nathan TeBlunthuis
2e5181602b bugfix. 2024-11-27 19:53:04 -08:00
Nathan TeBlunthuis
0d7f4d3cec pass through stopWords. 2024-11-27 19:33:28 -08:00
Nathan TeBlunthuis
5d48c0eb55 pass through mwe_tokenize 2024-11-27 19:31:59 -08:00
Nathan TeBlunthuis
91cc1edf02 pass through mwe_pass 2024-11-27 19:20:49 -08:00
Nathan TeBlunthuis
2decdc9750 move function to outer scope. 2024-11-27 19:13:49 -08:00
Nathan TeBlunthuis
7da046735b move function to outer scope. 2024-11-27 19:10:34 -08:00
Nathan TeBlunthuis
0631256956 make the output directory. 2024-11-27 19:06:24 -08:00
Nathan TeBlunthuis
8cb9683bc2 bugfix 2024-11-27 19:03:52 -08:00
Nathan TeBlunthuis
587e1c0022 bugfix. 2024-11-27 18:56:22 -08:00
Nathan TeBlunthuis
78eb16f4d6 more path munging. 2024-11-27 18:53:16 -08:00
Nathan TeBlunthuis
a0a6a08bf2 handle case where we're in a parent directory. 2024-11-27 18:49:03 -08:00
Nathan TeBlunthuis
a84b633641 add absolute path to call. 2024-11-27 18:42:29 -08:00
Nathan TeBlunthuis
ce7b5f92eb bugfix. 2024-11-27 17:20:04 -08:00
Nathan TeBlunthuis
fbf905c740 rename file 2024-11-27 11:55:31 -08:00
Nathan TeBlunthuis
dd894ebf61 support posts in ngrams 2024-11-27 11:51:22 -08:00
53f5b8c03c add note to try other tf normalization strategies. 2022-03-31 12:17:16 -07:00
14ab979f59 Merge branch 'master' of code:cdsc_reddit 2021-08-03 15:03:40 -07:00
Nate E TeBlunthuis
c6122bb429 Merge branch 'master' of code:cdsc_reddit 2021-07-28 15:32:21 -07:00
Nate E TeBlunthuis
596e1ff339 no longer do we need to get daily dumps 2021-07-28 15:32:04 -07:00
Nate E TeBlunthuis
6a3bfa26ee bugfix 2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis
3a758f1fc8 Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch 2021-04-26 13:58:25 -07:00
Nate E TeBlunthuis
806cfc948f support passing in list of tfidf vectors.
Also lowercases included subreddits.
2021-04-26 13:20:43 -07:00
Nate E TeBlunthuis
0fe120e4ab support passing in list of tfidf vectors.
Also lowercases included subreddits.
2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis
f20365c07e Merge branch 'master' of code:cdsc_reddit 2021-04-22 10:46:26 -07:00
Nate E TeBlunthuis
34e0a0a30d version of weekly_cosine_similarities.py from klone 2021-04-22 10:38:10 -07:00
Nate E TeBlunthuis
003a48aea5 bugfix in weekly similarities 2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis
37dd0ef55f bugfixes in clustering selection. 2021-04-21 16:56:25 -07:00
Nate E TeBlunthuis
ac06a8757a calculate some user-level attributes to detect bots 2021-04-20 11:34:36 -07:00
Nate E TeBlunthuis
01a4c35358 grid sweep selection for clustering hyperparameters 2021-04-20 11:33:54 -07:00
Nate E TeBlunthuis
628a70734b Merge branch 'master' of code:cdsc_reddit 2021-04-05 23:21:35 -07:00
Nate E TeBlunthuis
f0176d9f0d Changes for cosine similarities on klone. 2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis
36cb0a5546 add code for pulling activity time series from parquet. 2021-03-24 16:08:57 -07:00