d201930951
rewrite README, remove dead pushshift scripts and old/
...
Pushshift's files.pushshift.io archive is gone since Reddit cut off
third-party API access in 2023, so the dumps/ pull and SHA-check scripts
no longer work. The old/ directory of pre-refactor scripts was likewise
superseded by current versions in similarities/.
README rewritten to credit Nate as original developer, name current
maintainers, document the directory layout, point at the CDSC wiki for
the ArcticShift/torrent-based workflow, fix several stale script paths,
and correct an incorrect tf-normalization formula (max, not sum).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com >
2026-05-25 15:53:33 -07:00
53f5b8c03c
add note to try other tf normalization strategies.
2022-03-31 12:17:16 -07:00
14ab979f59
Merge branch 'master' of code:cdsc_reddit
2021-08-03 15:03:40 -07:00
Nate E TeBlunthuis
c6122bb429
Merge branch 'master' of code:cdsc_reddit
2021-07-28 15:32:21 -07:00
Nate E TeBlunthuis
596e1ff339
no longer do we need to get daily dumps
2021-07-28 15:32:04 -07:00
Nate E TeBlunthuis
6a3bfa26ee
bugfix
2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis
3a758f1fc8
Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch
2021-04-26 13:58:25 -07:00
Nate E TeBlunthuis
806cfc948f
support passing in list of tfidf vectors.
...
Also lowercases included subreddits.
2021-04-26 13:20:43 -07:00
Nate E TeBlunthuis
0fe120e4ab
support passing in list of tfidf vectors.
...
Also lowercases included subreddits.
2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis
f20365c07e
Merge branch 'master' of code:cdsc_reddit
2021-04-22 10:46:26 -07:00
Nate E TeBlunthuis
34e0a0a30d
version of weekly_cosine_similarities.py from klone
2021-04-22 10:38:10 -07:00
Nate E TeBlunthuis
003a48aea5
bugfix in weekly similarities
2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis
37dd0ef55f
bugfixes in clustering selection.
2021-04-21 16:56:25 -07:00
Nate E TeBlunthuis
ac06a8757a
calculate some user-level attributes to detect bots
2021-04-20 11:34:36 -07:00
Nate E TeBlunthuis
01a4c35358
grid sweep selection for clustering hyperparameters
2021-04-20 11:33:54 -07:00
Nate E TeBlunthuis
628a70734b
Merge branch 'master' of code:cdsc_reddit
2021-04-05 23:21:35 -07:00
Nate E TeBlunthuis
f0176d9f0d
Changes for cosine similarities on klone.
2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis
36cb0a5546
add code for pulling activity time series from parquet.
2021-03-24 16:08:57 -07:00
Nate E TeBlunthuis
06430903f0
add included_subreddits parameter to cosine similarities.
2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis
4dc949de5f
Changes from hyak.
2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis
140d1bdd17
fix bug in viz.
2021-01-27 20:26:15 -08:00
Nate E TeBlunthuis
554660275f
add visualization for 10000 subreddits based on author-tf similarities.
2021-01-27 20:22:24 -08:00
Nate E TeBlunthuis
b4dd9acbd8
Merge branch 'master' of code:cdsc_reddit
2021-01-27 20:09:23 -08:00
dbe4c87f8b
add cluster selection to visualization
2021-01-27 20:08:07 -08:00
Nate E TeBlunthuis
3155600514
remove nsfw subs from topN
2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis
4e20dce188
Updating to support wang-style user overlaps.
2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis
56269deee3
Some improvements to run affinity clustering on larger dataset and
...
compute density.
2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis
e6294b5b90
Refactor and reorganze.
2020-12-08 17:32:20 -08:00
Nate E TeBlunthuis
a60747292e
Add code for running tf-idf at the weekly level.
2020-12-01 22:54:48 -08:00
db5879d6c9
refactor visualization code.
2020-11-17 16:46:49 -08:00
13eb95b3b0
Merge remote-tracking branch 'refs/remotes/origin/master' into master
2020-11-17 16:33:14 -08:00
2cc897543a
git-annex in nathante@nate-x1:~/cdsc_reddit
2020-11-17 16:33:13 -08:00
Nate E TeBlunthuis
1bf206d219
git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis
f8ff8b2d0f
Update code for clustering + tsne.
2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis
82d184d9c6
Update code for building simlarity matrices.
2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis
e794214653
bugfix in completing tfidf similarity matrices.
2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis
220a540beb
increase learning rate.
2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis
cd43a94865
increase iterations and perplectity and early_exaggeration
2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis
ca6a8f0896
increase learning rate
2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis
ed0e1a8235
Fix bug in tsne.
2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis
6baa08889b
git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis
4447c60265
split fitting and plotting tsne.
2020-11-11 16:38:22 -08:00
db53c0138a
Add file to plot related subreddits using tsne.
2020-11-11 16:05:36 -08:00
Nate E TeBlunthuis
4c8bd14992
Bugfix (typo)
2020-11-10 13:38:11 -08:00
Nate E TeBlunthuis
39c581bee9
Reuse code for term and author cosine similarity.
2020-11-10 13:18:57 -08:00
Nate E TeBlunthuis
5632a971c6
Refactor tfidf code to for code resuse.
2020-11-10 13:18:19 -08:00
Nate E TeBlunthuis
772f3a8fbd
rename 'idf' files to 'tfidf'
2020-11-10 13:16:55 -08:00
Nate E TeBlunthuis
6edd155749
Improvements to idf code
2020-11-10 13:12:11 -08:00
Nate E TeBlunthuis
8b8c45ee2d
Merge branch 'master' of code:cdsc_reddit
2020-11-02 10:40:12 -08:00
Nate E TeBlunthuis
3dc17bd27c
add term_cosine_similarity.py
2020-11-02 10:40:02 -08:00