cdsc_reddit

Author	SHA1	Message	Date
Benjamin Mako Hill	8965a251b6	refactor datasets/ pipeline; add build/add-month workflows Replace the four per-type scripts (comments/submissions x part1/part2) with two merged scripts that share all of their plumbing — only the schema and JSON parser differ between types. Drop the per-source part rolling; one parquet per input zst, since Spark handles big parquet files via internal row groups. Add two thin runner scripts for the two common workflows: build_from_scratch.sh wipes the temp dirs and processes everything, add_new_month.sh takes YYYY-MM and parses just that month before re-running the Spark sort. Every step in the runners is a separate command so individual stages can be copied out and run standalone for debugging. Also fixes several lurking bugs in the original code: the hardcoded /gscratch/comdata/users/nathante/ output path in comments Part 2; the df2 = df.sortWithinPartitions typo in submissions Part 2 that threw away the preceding global sort; references to a missing parse_submissions.sh in the old .sh runners; and the asymmetry where comments_2_parquet_part1.py wasn't per-file/fire-driven the way submissions_2_parquet_part1.py was. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 16:30:54 -07:00
Benjamin Mako Hill	d201930951	rewrite README, remove dead pushshift scripts and old/ Pushshift's files.pushshift.io archive is gone since Reddit cut off third-party API access in 2023, so the dumps/ pull and SHA-check scripts no longer work. The old/ directory of pre-refactor scripts was likewise superseded by current versions in similarities/. README rewritten to credit Nate as original developer, name current maintainers, document the directory layout, point at the CDSC wiki for the ArcticShift/torrent-based workflow, fix several stale script paths, and correct an incorrect tf-normalization formula (max, not sum). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 15:53:33 -07:00
Nathan TeBlunthuis	53f5b8c03c	add note to try other tf normalization strategies.	2022-03-31 12:17:16 -07:00
Nathan TeBlunthuis	14ab979f59	Merge branch 'master' of code:cdsc_reddit	2021-08-03 15:03:40 -07:00
Nate E TeBlunthuis	c6122bb429	Merge branch 'master' of code:cdsc_reddit	2021-07-28 15:32:21 -07:00
Nate E TeBlunthuis	596e1ff339	no longer do we need to get daily dumps	2021-07-28 15:32:04 -07:00
Nate E TeBlunthuis	6a3bfa26ee	bugfix	2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis	3a758f1fc8	Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch	2021-04-26 13:58:25 -07:00
Nate E TeBlunthuis	806cfc948f	support passing in list of tfidf vectors. Also lowercases included subreddits.	2021-04-26 13:20:43 -07:00
Nate E TeBlunthuis	0fe120e4ab	support passing in list of tfidf vectors. Also lowercases included subreddits.	2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis	f20365c07e	Merge branch 'master' of code:cdsc_reddit	2021-04-22 10:46:26 -07:00
Nate E TeBlunthuis	34e0a0a30d	version of weekly_cosine_similarities.py from klone	2021-04-22 10:38:10 -07:00
Nate E TeBlunthuis	003a48aea5	bugfix in weekly similarities	2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis	37dd0ef55f	bugfixes in clustering selection.	2021-04-21 16:56:25 -07:00
Nate E TeBlunthuis	ac06a8757a	calculate some user-level attributes to detect bots	2021-04-20 11:34:36 -07:00
Nate E TeBlunthuis	01a4c35358	grid sweep selection for clustering hyperparameters	2021-04-20 11:33:54 -07:00
Nate E TeBlunthuis	628a70734b	Merge branch 'master' of code:cdsc_reddit	2021-04-05 23:21:35 -07:00
Nate E TeBlunthuis	f0176d9f0d	Changes for cosine similarities on klone.	2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis	36cb0a5546	add code for pulling activity time series from parquet.	2021-03-24 16:08:57 -07:00
Nate E TeBlunthuis	06430903f0	add included_subreddits parameter to cosine similarities.	2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis	4dc949de5f	Changes from hyak.	2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis	140d1bdd17	fix bug in viz.	2021-01-27 20:26:15 -08:00
Nate E TeBlunthuis	554660275f	add visualization for 10000 subreddits based on author-tf similarities.	2021-01-27 20:22:24 -08:00
Nate E TeBlunthuis	b4dd9acbd8	Merge branch 'master' of code:cdsc_reddit	2021-01-27 20:09:23 -08:00
Nathan TeBlunthuis	dbe4c87f8b	add cluster selection to visualization	2021-01-27 20:08:07 -08:00
Nate E TeBlunthuis	3155600514	remove nsfw subs from topN	2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis	4e20dce188	Updating to support wang-style user overlaps.	2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis	56269deee3	Some improvements to run affinity clustering on larger dataset and compute density.	2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis	e6294b5b90	Refactor and reorganze.	2020-12-08 17:32:20 -08:00
Nate E TeBlunthuis	a60747292e	Add code for running tf-idf at the weekly level.	2020-12-01 22:54:48 -08:00
Nathan TeBlunthuis	db5879d6c9	refactor visualization code.	2020-11-17 16:46:49 -08:00
Nathan TeBlunthuis	13eb95b3b0	Merge remote-tracking branch 'refs/remotes/origin/master' into master	2020-11-17 16:33:14 -08:00
Nathan TeBlunthuis	2cc897543a	git-annex in nathante@nate-x1:~/cdsc_reddit	2020-11-17 16:33:13 -08:00
Nate E TeBlunthuis	1bf206d219	git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit	2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis	f8ff8b2d0f	Update code for clustering + tsne.	2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis	82d184d9c6	Update code for building simlarity matrices.	2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis	e794214653	bugfix in completing tfidf similarity matrices.	2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis	220a540beb	increase learning rate.	2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis	cd43a94865	increase iterations and perplectity and early_exaggeration	2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis	ca6a8f0896	increase learning rate	2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis	ed0e1a8235	Fix bug in tsne.	2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis	6baa08889b	git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit	2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis	4447c60265	split fitting and plotting tsne.	2020-11-11 16:38:22 -08:00
Nathan TeBlunthuis	db53c0138a	Add file to plot related subreddits using tsne.	2020-11-11 16:05:36 -08:00
Nate E TeBlunthuis	4c8bd14992	Bugfix (typo)	2020-11-10 13:38:11 -08:00
Nate E TeBlunthuis	39c581bee9	Reuse code for term and author cosine similarity.	2020-11-10 13:18:57 -08:00
Nate E TeBlunthuis	5632a971c6	Refactor tfidf code to for code resuse.	2020-11-10 13:18:19 -08:00
Nate E TeBlunthuis	772f3a8fbd	rename 'idf' files to 'tfidf'	2020-11-10 13:16:55 -08:00
Nate E TeBlunthuis	6edd155749	Improvements to idf code	2020-11-10 13:12:11 -08:00
Nate E TeBlunthuis	8b8c45ee2d	Merge branch 'master' of code:cdsc_reddit	2020-11-02 10:40:12 -08:00

1 2

95 Commits