The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:
- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
adapted to the new script names and dropping obsolete notes about
pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
section — TF-IDF math, PMI phrase detection, cosine similarity — with
MediaWiki math converted to markdown LaTeX and script references
updated to current paths.
The wiki page has been trimmed to a landing page that points at these
README files in gitea.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>