18
0
Commit Graph

12 Commits

Author SHA1 Message Date
1851132a06 move dataset + similarity docs from wiki into repo READMEs
The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:

- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
  prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
  adapted to the new script names and dropping obsolete notes about
  pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
  section — TF-IDF math, PMI phrase detection, cosine similarity — with
  MediaWiki math converted to markdown LaTeX and script references
  updated to current paths.

The wiki page has been trimmed to a landing page that points at these
README files in gitea.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:20:21 -07:00
53f5b8c03c add note to try other tf normalization strategies. 2022-03-31 12:17:16 -07:00
Nate E TeBlunthuis
6a3bfa26ee bugfix 2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis
0fe120e4ab support passing in list of tfidf vectors.
Also lowercases included subreddits.
2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis
003a48aea5 bugfix in weekly similarities 2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis
f0176d9f0d Changes for cosine similarities on klone. 2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis
06430903f0 add included_subreddits parameter to cosine similarities. 2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis
4dc949de5f Changes from hyak. 2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis
3155600514 remove nsfw subs from topN 2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis
4e20dce188 Updating to support wang-style user overlaps. 2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis
56269deee3 Some improvements to run affinity clustering on larger dataset and
compute density.
2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis
e6294b5b90 Refactor and reorganze. 2020-12-08 17:32:20 -08:00