1
0
Commit Graph

58 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
0613193e9d support passing in a model object. 2025-01-11 18:59:25 -08:00
Nathan TeBlunthuis
79d1826ba4 enforce min_df constraint in counting lsi features. 2024-12-30 16:17:31 -08:00
Nathan TeBlunthuis
3555542862 use min/max df constraints in counting nterms. 2024-12-30 16:10:50 -08:00
Nathan TeBlunthuis
a9b296dd73 bugfix 2024-12-28 20:18:53 -08:00
Nathan TeBlunthuis
d9db21686d remove unnecessary isoformat 2024-12-28 20:08:12 -08:00
Nathan TeBlunthuis
41fea31fce bugfix 2024-12-28 20:04:38 -08:00
Nathan TeBlunthuis
7aa22c7385 bugfix 2024-12-28 20:02:24 -08:00
Nathan TeBlunthuis
f11d4cfc72 use static tfidf (not weekly) to create tfidf matrix 2024-12-28 20:00:53 -08:00
Nathan TeBlunthuis
7b5ac73b2c use static tfidf (not weekly) to create tfidf matrix 2024-12-28 19:58:14 -08:00
Nathan TeBlunthuis
e2e7d7dbb1 more print debugging 2024-12-28 19:27:42 -08:00
Nathan TeBlunthuis
c317ef6475 debugging: print the shape 2024-12-28 19:21:24 -08:00
Nathan TeBlunthuis
c3cce0817e bugfix 2024-12-28 14:31:24 -08:00
Nathan TeBlunthuis
c9464f86f7 interface fix. 2024-12-28 14:27:56 -08:00
Nathan TeBlunthuis
f3db4efbb1 pass nterms as int. 2024-12-28 14:24:24 -08:00
Nathan TeBlunthuis
27f29e63fa typo fix. 2024-12-28 14:18:58 -08:00
Nathan TeBlunthuis
3f277ad99e pass weeks as strings. 2024-12-28 14:10:55 -08:00
Nathan TeBlunthuis
02ec11f726 no longer need to convert from spark dates into isoformat. 2024-12-28 13:55:54 -08:00
Nathan TeBlunthuis
104b708ff6 use duckdb not spark to prepare for weekly similarities. 2024-12-28 13:45:17 -08:00
Nathan TeBlunthuis
74ee86e443 add weekly_cosine_similarities script. 2024-12-25 21:15:38 -08:00
Nathan TeBlunthuis
a8a92d30df bugfix 2024-12-19 23:34:55 -08:00
Nathan TeBlunthuis
638ab78375 comment out config. 2024-12-19 23:32:16 -08:00
Nathan TeBlunthuis
8cb75c8354 typo fix. 2024-12-19 20:10:34 -08:00
Nathan TeBlunthuis
0bbdc6bd5e typo fix. 2024-12-19 20:09:00 -08:00
Nathan TeBlunthuis
8b69801c8d correct number of partitions. 2024-12-19 19:39:18 -08:00
Nathan TeBlunthuis
189330198c repartition for parallelism. 2024-12-19 17:53:27 -08:00
Nathan TeBlunthuis
c6c9ec173b add shebang 2024-12-15 18:47:07 -08:00
Nathan TeBlunthuis
52694e0498 typofix 2024-12-15 08:23:06 -08:00
Nathan TeBlunthuis
cb2f2c9717 make executable. 2024-12-15 08:18:42 -08:00
Nathan TeBlunthuis
3d192ab82f Merge remote-tracking branch 'origin/icwsm_dataverse' 2024-12-12 07:45:06 -08:00
Nathan TeBlunthuis
e2b6c1b481 configure to use the g2-cpu node. 2024-12-12 07:17:10 -08:00
Nathan TeBlunthuis
51234f1070 add inpath param for tfidf_authors_weekly. 2024-12-03 10:16:23 -08:00
Nathan TeBlunthuis
0a6ad65baf add shebang 2024-12-03 09:06:40 -08:00
Nathan TeBlunthuis
7096785cb9 make exe 2024-12-03 09:05:44 -08:00
Nathan TeBlunthuis
355d014d5f pass path into tfidf function. 2024-12-02 08:03:19 -08:00
07b0dff9bc changes for archiving. 2023-05-23 17:18:19 -07:00
55b75ea6fc Merge remote-tracking branch 'refs/remotes/origin/excise_reindex' into excise_reindex 2022-04-06 11:14:13 -07:00
197518a222 git-annex in 2022-04-06 11:11:11 -07:00
53f5b8c03c add note to try other tf normalization strategies. 2022-03-31 12:17:16 -07:00
7b130a30af commit changes from smap project. 2022-01-19 13:57:02 -08:00
541e125b28 lsi support for weekly similarities 2021-08-11 22:48:33 -07:00
ce549c6c97 Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex 2021-08-03 15:13:21 -07:00
6e43294a41 Updates to similarities code for smap project. 2021-08-03 15:06:48 -07:00
2d21ff1137 Merge branch 'master' of code:cdsc_reddit into excise_reindex 2021-08-03 15:02:08 -07:00
Nate E TeBlunthuis
cf86c7492c update clustering scripts 2021-08-03 14:55:02 -07:00
Nate E TeBlunthuis
0b95bea30e support isolates in visualization 2021-05-13 22:26:58 -07:00
Nate E TeBlunthuis
e1c9d9af6f Remove 'exclude phrases' parameter. 2021-05-03 10:37:09 -07:00
Nate E TeBlunthuis
7df8436067 Use Latent semantic indexing and hdbscan 2021-05-02 23:39:55 -07:00
Nate E TeBlunthuis
36b24ee933 reindex tfidf in memory instead of using spark 2021-04-30 12:48:19 -07:00
Nate E TeBlunthuis
6a3bfa26ee bugfix 2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis
0fe120e4ab support passing in list of tfidf vectors.
Also lowercases included subreddits.
2021-04-26 11:44:56 -07:00