cdsc_reddit

Author	SHA1	Message	Date
Benjamin Mako Hill	526dc03732	datasets/add_months.sh: stop before copy step to force manual verification The script now exits after Part 2 so the copy and cleanup commands must be run manually. This prevents the live datasets from being touched without a deliberate verification step in between. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:22:03 -07:00
Benjamin Mako Hill	6b18840604	datasets/: stage new layer before touching live datasets in add_months Replace mode='append'-direct-to-live approach with a safer staging workflow: Part 2 writes the new sorted layer to temp staging directories, the user verifies, then a separate copy step adds the files to the live datasets. Live datasets are never touched until the copy step, and the copy only adds files — nothing is deleted or overwritten. - sort_and_write gains out_by_subreddit/out_by_author params (replaces mode param) so Part 2 can target staging paths - comments_part2.py, submissions_part2.py: expose new params via Fire - add_months.sh: rewritten with explicit staging dirs, verify checkpoint, and find-based copy step Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:17:38 -07:00
Benjamin Mako Hill	2d1d760142	datasets/: replace add_new_month with layered append workflow Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 17:59:36 -07:00
Benjamin Mako Hill	1851132a06	move dataset + similarity docs from wiki into repo READMEs The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_.sh / check__shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 17:20:21 -07:00
Benjamin Mako Hill	33150243cd	datasets/: split parquet scripts; share logic in dumps_helper.py Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 16:51:41 -07:00
Benjamin Mako Hill	8965a251b6	refactor datasets/ pipeline; add build/add-month workflows Replace the four per-type scripts (comments/submissions x part1/part2) with two merged scripts that share all of their plumbing — only the schema and JSON parser differ between types. Drop the per-source part rolling; one parquet per input zst, since Spark handles big parquet files via internal row groups. Add two thin runner scripts for the two common workflows: build_from_scratch.sh wipes the temp dirs and processes everything, add_new_month.sh takes YYYY-MM and parses just that month before re-running the Spark sort. Every step in the runners is a separate command so individual stages can be copied out and run standalone for debugging. Also fixes several lurking bugs in the original code: the hardcoded /gscratch/comdata/users/nathante/ output path in comments Part 2; the df2 = df.sortWithinPartitions typo in submissions Part 2 that threw away the preceding global sort; references to a missing parse_submissions.sh in the old .sh runners; and the asymmetry where comments_2_parquet_part1.py wasn't per-file/fire-driven the way submissions_2_parquet_part1.py was. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 16:30:54 -07:00
Benjamin Mako Hill	d201930951	rewrite README, remove dead pushshift scripts and old/ Pushshift's files.pushshift.io archive is gone since Reddit cut off third-party API access in 2023, so the dumps/ pull and SHA-check scripts no longer work. The old/ directory of pre-refactor scripts was likewise superseded by current versions in similarities/. README rewritten to credit Nate as original developer, name current maintainers, document the directory layout, point at the CDSC wiki for the ArcticShift/torrent-based workflow, fix several stale script paths, and correct an incorrect tf-normalization formula (max, not sum). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 15:53:33 -07:00
Nathan TeBlunthuis	53f5b8c03c	add note to try other tf normalization strategies.	2022-03-31 12:17:16 -07:00
Nathan TeBlunthuis	14ab979f59	Merge branch 'master' of code:cdsc_reddit	2021-08-03 15:03:40 -07:00
Nate E TeBlunthuis	c6122bb429	Merge branch 'master' of code:cdsc_reddit	2021-07-28 15:32:21 -07:00
Nate E TeBlunthuis	596e1ff339	no longer do we need to get daily dumps	2021-07-28 15:32:04 -07:00
Nate E TeBlunthuis	6a3bfa26ee	bugfix	2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis	3a758f1fc8	Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch	2021-04-26 13:58:25 -07:00
Nate E TeBlunthuis	806cfc948f	support passing in list of tfidf vectors. Also lowercases included subreddits.	2021-04-26 13:20:43 -07:00
Nate E TeBlunthuis	0fe120e4ab	support passing in list of tfidf vectors. Also lowercases included subreddits.	2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis	f20365c07e	Merge branch 'master' of code:cdsc_reddit	2021-04-22 10:46:26 -07:00
Nate E TeBlunthuis	34e0a0a30d	version of weekly_cosine_similarities.py from klone	2021-04-22 10:38:10 -07:00
Nate E TeBlunthuis	003a48aea5	bugfix in weekly similarities	2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis	37dd0ef55f	bugfixes in clustering selection.	2021-04-21 16:56:25 -07:00
Nate E TeBlunthuis	ac06a8757a	calculate some user-level attributes to detect bots	2021-04-20 11:34:36 -07:00
Nate E TeBlunthuis	01a4c35358	grid sweep selection for clustering hyperparameters	2021-04-20 11:33:54 -07:00
Nate E TeBlunthuis	628a70734b	Merge branch 'master' of code:cdsc_reddit	2021-04-05 23:21:35 -07:00
Nate E TeBlunthuis	f0176d9f0d	Changes for cosine similarities on klone.	2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis	36cb0a5546	add code for pulling activity time series from parquet.	2021-03-24 16:08:57 -07:00
Nate E TeBlunthuis	06430903f0	add included_subreddits parameter to cosine similarities.	2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis	4dc949de5f	Changes from hyak.	2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis	140d1bdd17	fix bug in viz.	2021-01-27 20:26:15 -08:00
Nate E TeBlunthuis	554660275f	add visualization for 10000 subreddits based on author-tf similarities.	2021-01-27 20:22:24 -08:00
Nate E TeBlunthuis	b4dd9acbd8	Merge branch 'master' of code:cdsc_reddit	2021-01-27 20:09:23 -08:00
Nathan TeBlunthuis	dbe4c87f8b	add cluster selection to visualization	2021-01-27 20:08:07 -08:00
Nate E TeBlunthuis	3155600514	remove nsfw subs from topN	2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis	4e20dce188	Updating to support wang-style user overlaps.	2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis	56269deee3	Some improvements to run affinity clustering on larger dataset and compute density.	2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis	e6294b5b90	Refactor and reorganze.	2020-12-08 17:32:20 -08:00
Nate E TeBlunthuis	a60747292e	Add code for running tf-idf at the weekly level.	2020-12-01 22:54:48 -08:00
Nathan TeBlunthuis	db5879d6c9	refactor visualization code.	2020-11-17 16:46:49 -08:00
Nathan TeBlunthuis	13eb95b3b0	Merge remote-tracking branch 'refs/remotes/origin/master' into master	2020-11-17 16:33:14 -08:00
Nathan TeBlunthuis	2cc897543a	git-annex in nathante@nate-x1:~/cdsc_reddit	2020-11-17 16:33:13 -08:00
Nate E TeBlunthuis	1bf206d219	git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit	2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis	f8ff8b2d0f	Update code for clustering + tsne.	2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis	82d184d9c6	Update code for building simlarity matrices.	2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis	e794214653	bugfix in completing tfidf similarity matrices.	2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis	220a540beb	increase learning rate.	2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis	cd43a94865	increase iterations and perplectity and early_exaggeration	2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis	ca6a8f0896	increase learning rate	2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis	ed0e1a8235	Fix bug in tsne.	2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis	6baa08889b	git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit	2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis	4447c60265	split fitting and plotting tsne.	2020-11-11 16:38:22 -08:00
Nathan TeBlunthuis	db53c0138a	Add file to plot related subreddits using tsne.	2020-11-11 16:05:36 -08:00
Nate E TeBlunthuis	4c8bd14992	Bugfix (typo)	2020-11-10 13:38:11 -08:00

1 2

100 Commits