cdsc_reddit

Author	SHA1	Message	Date
Benjamin Mako Hill	18925dfe5b	datasets/: add PYTHON variable to add_months scripts GNU parallel spawns fresh shells that don't inherit the active venv. Using an explicit PYTHON path ensures the right interpreter is used in parallel tasks. Defaults to python3 but can be overridden: PYTHON=/path/to/venv/bin/python3 ./add_months.sh ... Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:42:05 -07:00
Benjamin Mako Hill	926e9bc364	datasets/: fat-node add_months.sh; multinode variant as separate script add_months.sh now targets a single fat node directly: starts a local Spark cluster via start_spark_cluster.sh, submits jobs, stops the cluster. No salloc needed. add_months_multinode.sh is a new script for the multi-node case using start_spark_and_run.sh from a login node. Usage takes NODES as first arg. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:35:12 -07:00
Benjamin Mako Hill	526dc03732	datasets/add_months.sh: stop before copy step to force manual verification The script now exits after Part 2 so the copy and cleanup commands must be run manually. This prevents the live datasets from being touched without a deliberate verification step in between. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:22:03 -07:00
Benjamin Mako Hill	6b18840604	datasets/: stage new layer before touching live datasets in add_months Replace mode='append'-direct-to-live approach with a safer staging workflow: Part 2 writes the new sorted layer to temp staging directories, the user verifies, then a separate copy step adds the files to the live datasets. Live datasets are never touched until the copy step, and the copy only adds files — nothing is deleted or overwritten. - sort_and_write gains out_by_subreddit/out_by_author params (replaces mode param) so Part 2 can target staging paths - comments_part2.py, submissions_part2.py: expose new params via Fire - add_months.sh: rewritten with explicit staging dirs, verify checkpoint, and find-based copy step Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:17:38 -07:00
Benjamin Mako Hill	2d1d760142	datasets/: replace add_new_month with layered append workflow Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 17:59:36 -07:00
Benjamin Mako Hill	1851132a06	move dataset + similarity docs from wiki into repo READMEs The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough (Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods section, both of which duplicated or risked drifting from the actual code. Move both into the repo so they stay in sync with the scripts they describe: - datasets/README.md: expand with the wiki's "Building Parquet Datasets" prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible, adapted to the new script names and dropping obsolete notes about pull_pushshift_.sh / check__shas.py). - similarities/README.md (new): port the wiki's Subreddit Similarity section — TF-IDF math, PMI phrase detection, cosine similarity — with MediaWiki math converted to markdown LaTeX and script references updated to current paths. The wiki page has been trimmed to a landing page that points at these README files in gitea. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 17:20:21 -07:00
Benjamin Mako Hill	33150243cd	datasets/: split parquet scripts; share logic in dumps_helper.py Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 16:51:41 -07:00
Benjamin Mako Hill	8965a251b6	refactor datasets/ pipeline; add build/add-month workflows Replace the four per-type scripts (comments/submissions x part1/part2) with two merged scripts that share all of their plumbing — only the schema and JSON parser differ between types. Drop the per-source part rolling; one parquet per input zst, since Spark handles big parquet files via internal row groups. Add two thin runner scripts for the two common workflows: build_from_scratch.sh wipes the temp dirs and processes everything, add_new_month.sh takes YYYY-MM and parses just that month before re-running the Spark sort. Every step in the runners is a separate command so individual stages can be copied out and run standalone for debugging. Also fixes several lurking bugs in the original code: the hardcoded /gscratch/comdata/users/nathante/ output path in comments Part 2; the df2 = df.sortWithinPartitions typo in submissions Part 2 that threw away the preceding global sort; references to a missing parse_submissions.sh in the old .sh runners; and the asymmetry where comments_2_parquet_part1.py wasn't per-file/fire-driven the way submissions_2_parquet_part1.py was. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 16:30:54 -07:00
Benjamin Mako Hill	d201930951	rewrite README, remove dead pushshift scripts and old/ Pushshift's files.pushshift.io archive is gone since Reddit cut off third-party API access in 2023, so the dumps/ pull and SHA-check scripts no longer work. The old/ directory of pre-refactor scripts was likewise superseded by current versions in similarities/. README rewritten to credit Nate as original developer, name current maintainers, document the directory layout, point at the CDSC wiki for the ArcticShift/torrent-based workflow, fix several stale script paths, and correct an incorrect tf-normalization formula (max, not sum). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 15:53:33 -07:00
Nathan TeBlunthuis	53f5b8c03c	add note to try other tf normalization strategies.	2022-03-31 12:17:16 -07:00
Nathan TeBlunthuis	14ab979f59	Merge branch 'master' of code:cdsc_reddit	2021-08-03 15:03:40 -07:00
Nate E TeBlunthuis	c6122bb429	Merge branch 'master' of code:cdsc_reddit	2021-07-28 15:32:21 -07:00
Nate E TeBlunthuis	596e1ff339	no longer do we need to get daily dumps	2021-07-28 15:32:04 -07:00
Nate E TeBlunthuis	6a3bfa26ee	bugfix	2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis	3a758f1fc8	Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch	2021-04-26 13:58:25 -07:00
Nate E TeBlunthuis	806cfc948f	support passing in list of tfidf vectors. Also lowercases included subreddits.	2021-04-26 13:20:43 -07:00
Nate E TeBlunthuis	0fe120e4ab	support passing in list of tfidf vectors. Also lowercases included subreddits.	2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis	f20365c07e	Merge branch 'master' of code:cdsc_reddit	2021-04-22 10:46:26 -07:00
Nate E TeBlunthuis	34e0a0a30d	version of weekly_cosine_similarities.py from klone	2021-04-22 10:38:10 -07:00
Nate E TeBlunthuis	003a48aea5	bugfix in weekly similarities	2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis	37dd0ef55f	bugfixes in clustering selection.	2021-04-21 16:56:25 -07:00
Nate E TeBlunthuis	ac06a8757a	calculate some user-level attributes to detect bots	2021-04-20 11:34:36 -07:00
Nate E TeBlunthuis	01a4c35358	grid sweep selection for clustering hyperparameters	2021-04-20 11:33:54 -07:00
Nate E TeBlunthuis	628a70734b	Merge branch 'master' of code:cdsc_reddit	2021-04-05 23:21:35 -07:00
Nate E TeBlunthuis	f0176d9f0d	Changes for cosine similarities on klone.	2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis	36cb0a5546	add code for pulling activity time series from parquet.	2021-03-24 16:08:57 -07:00
Nate E TeBlunthuis	06430903f0	add included_subreddits parameter to cosine similarities.	2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis	4dc949de5f	Changes from hyak.	2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis	140d1bdd17	fix bug in viz.	2021-01-27 20:26:15 -08:00
Nate E TeBlunthuis	554660275f	add visualization for 10000 subreddits based on author-tf similarities.	2021-01-27 20:22:24 -08:00
Nate E TeBlunthuis	b4dd9acbd8	Merge branch 'master' of code:cdsc_reddit	2021-01-27 20:09:23 -08:00
Nathan TeBlunthuis	dbe4c87f8b	add cluster selection to visualization	2021-01-27 20:08:07 -08:00
Nate E TeBlunthuis	3155600514	remove nsfw subs from topN	2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis	4e20dce188	Updating to support wang-style user overlaps.	2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis	56269deee3	Some improvements to run affinity clustering on larger dataset and compute density.	2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis	e6294b5b90	Refactor and reorganze.	2020-12-08 17:32:20 -08:00
Nate E TeBlunthuis	a60747292e	Add code for running tf-idf at the weekly level.	2020-12-01 22:54:48 -08:00
Nathan TeBlunthuis	db5879d6c9	refactor visualization code.	2020-11-17 16:46:49 -08:00
Nathan TeBlunthuis	13eb95b3b0	Merge remote-tracking branch 'refs/remotes/origin/master' into master	2020-11-17 16:33:14 -08:00
Nathan TeBlunthuis	2cc897543a	git-annex in nathante@nate-x1:~/cdsc_reddit	2020-11-17 16:33:13 -08:00
Nate E TeBlunthuis	1bf206d219	git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit	2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis	f8ff8b2d0f	Update code for clustering + tsne.	2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis	82d184d9c6	Update code for building simlarity matrices.	2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis	e794214653	bugfix in completing tfidf similarity matrices.	2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis	220a540beb	increase learning rate.	2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis	cd43a94865	increase iterations and perplectity and early_exaggeration	2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis	ca6a8f0896	increase learning rate	2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis	ed0e1a8235	Fix bug in tsne.	2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis	6baa08889b	git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit	2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis	4447c60265	split fitting and plotting tsne.	2020-11-11 16:38:22 -08:00

1 2 3

102 Commits