18
0
Commit Graph

100 Commits

Author SHA1 Message Date
526dc03732 datasets/add_months.sh: stop before copy step to force manual verification
The script now exits after Part 2 so the copy and cleanup commands must
be run manually. This prevents the live datasets from being touched
without a deliberate verification step in between.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:22:03 -07:00
6b18840604 datasets/: stage new layer before touching live datasets in add_months
Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.

- sort_and_write gains out_by_subreddit/out_by_author params (replaces
  mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
  and find-based copy step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:17:38 -07:00
2d1d760142 datasets/: replace add_new_month with layered append workflow
Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:59:36 -07:00
1851132a06 move dataset + similarity docs from wiki into repo READMEs
The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:

- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
  prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
  adapted to the new script names and dropping obsolete notes about
  pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
  section — TF-IDF math, PMI phrase detection, cosine similarity — with
  MediaWiki math converted to markdown LaTeX and script references
  updated to current paths.

The wiki page has been trimmed to a landing page that points at these
README files in gitea.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:20:21 -07:00
33150243cd datasets/: split parquet scripts; share logic in dumps_helper.py
Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:

- dumps_helper.py — schemas, simdjson parser, a generic parse_record
  loop with per-field handler dispatch, and parse_dump / gen_task_list
  / sort_and_write workers. The only per-type code is the field-handler
  dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
  with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
  for the Spark sort. pyspark is imported lazily inside sort_and_write
  so Part 1 callers don't pay the import cost.

Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.

Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.

Runners and README updated for the new CLIs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:51:41 -07:00
8965a251b6 refactor datasets/ pipeline; add build/add-month workflows
Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.

Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.

Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:30:54 -07:00
d201930951 rewrite README, remove dead pushshift scripts and old/
Pushshift's files.pushshift.io archive is gone since Reddit cut off
third-party API access in 2023, so the dumps/ pull and SHA-check scripts
no longer work. The old/ directory of pre-refactor scripts was likewise
superseded by current versions in similarities/.

README rewritten to credit Nate as original developer, name current
maintainers, document the directory layout, point at the CDSC wiki for
the ArcticShift/torrent-based workflow, fix several stale script paths,
and correct an incorrect tf-normalization formula (max, not sum).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 15:53:33 -07:00
53f5b8c03c add note to try other tf normalization strategies. 2022-03-31 12:17:16 -07:00
14ab979f59 Merge branch 'master' of code:cdsc_reddit 2021-08-03 15:03:40 -07:00
Nate E TeBlunthuis
c6122bb429 Merge branch 'master' of code:cdsc_reddit 2021-07-28 15:32:21 -07:00
Nate E TeBlunthuis
596e1ff339 no longer do we need to get daily dumps 2021-07-28 15:32:04 -07:00
Nate E TeBlunthuis
6a3bfa26ee bugfix 2021-04-26 22:31:05 -07:00
Nate E TeBlunthuis
3a758f1fc8 Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch 2021-04-26 13:58:25 -07:00
Nate E TeBlunthuis
806cfc948f support passing in list of tfidf vectors.
Also lowercases included subreddits.
2021-04-26 13:20:43 -07:00
Nate E TeBlunthuis
0fe120e4ab support passing in list of tfidf vectors.
Also lowercases included subreddits.
2021-04-26 11:44:56 -07:00
Nate E TeBlunthuis
f20365c07e Merge branch 'master' of code:cdsc_reddit 2021-04-22 10:46:26 -07:00
Nate E TeBlunthuis
34e0a0a30d version of weekly_cosine_similarities.py from klone 2021-04-22 10:38:10 -07:00
Nate E TeBlunthuis
003a48aea5 bugfix in weekly similarities 2021-04-22 10:37:04 -07:00
Nate E TeBlunthuis
37dd0ef55f bugfixes in clustering selection. 2021-04-21 16:56:25 -07:00
Nate E TeBlunthuis
ac06a8757a calculate some user-level attributes to detect bots 2021-04-20 11:34:36 -07:00
Nate E TeBlunthuis
01a4c35358 grid sweep selection for clustering hyperparameters 2021-04-20 11:33:54 -07:00
Nate E TeBlunthuis
628a70734b Merge branch 'master' of code:cdsc_reddit 2021-04-05 23:21:35 -07:00
Nate E TeBlunthuis
f0176d9f0d Changes for cosine similarities on klone. 2021-04-05 23:21:06 -07:00
Nate E TeBlunthuis
36cb0a5546 add code for pulling activity time series from parquet. 2021-03-24 16:08:57 -07:00
Nate E TeBlunthuis
06430903f0 add included_subreddits parameter to cosine similarities. 2021-02-22 18:38:34 -08:00
Nate E TeBlunthuis
4dc949de5f Changes from hyak. 2021-02-22 16:03:48 -08:00
Nate E TeBlunthuis
140d1bdd17 fix bug in viz. 2021-01-27 20:26:15 -08:00
Nate E TeBlunthuis
554660275f add visualization for 10000 subreddits based on author-tf similarities. 2021-01-27 20:22:24 -08:00
Nate E TeBlunthuis
b4dd9acbd8 Merge branch 'master' of code:cdsc_reddit 2021-01-27 20:09:23 -08:00
dbe4c87f8b add cluster selection to visualization 2021-01-27 20:08:07 -08:00
Nate E TeBlunthuis
3155600514 remove nsfw subs from topN 2020-12-28 21:11:44 -08:00
Nate E TeBlunthuis
4e20dce188 Updating to support wang-style user overlaps. 2020-12-24 22:38:04 -08:00
Nate E TeBlunthuis
56269deee3 Some improvements to run affinity clustering on larger dataset and
compute density.
2020-12-12 20:42:47 -08:00
Nate E TeBlunthuis
e6294b5b90 Refactor and reorganze. 2020-12-08 17:32:20 -08:00
Nate E TeBlunthuis
a60747292e Add code for running tf-idf at the weekly level. 2020-12-01 22:54:48 -08:00
db5879d6c9 refactor visualization code. 2020-11-17 16:46:49 -08:00
13eb95b3b0 Merge remote-tracking branch 'refs/remotes/origin/master' into master 2020-11-17 16:33:14 -08:00
2cc897543a git-annex in nathante@nate-x1:~/cdsc_reddit 2020-11-17 16:33:13 -08:00
Nate E TeBlunthuis
1bf206d219 git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit 2020-11-17 16:31:48 -08:00
Nate E TeBlunthuis
f8ff8b2d0f Update code for clustering + tsne. 2020-11-17 15:59:20 -08:00
Nate E TeBlunthuis
82d184d9c6 Update code for building simlarity matrices. 2020-11-17 12:52:48 -08:00
Nate E TeBlunthuis
e794214653 bugfix in completing tfidf similarity matrices. 2020-11-12 11:47:53 -08:00
Nate E TeBlunthuis
220a540beb increase learning rate. 2020-11-11 16:58:39 -08:00
Nate E TeBlunthuis
cd43a94865 increase iterations and perplectity and early_exaggeration 2020-11-11 16:55:39 -08:00
Nate E TeBlunthuis
ca6a8f0896 increase learning rate 2020-11-11 16:48:41 -08:00
Nate E TeBlunthuis
ed0e1a8235 Fix bug in tsne. 2020-11-11 16:43:41 -08:00
Nate E TeBlunthuis
6baa08889b git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit 2020-11-11 16:39:44 -08:00
Nate E TeBlunthuis
4447c60265 split fitting and plotting tsne. 2020-11-11 16:38:22 -08:00
db53c0138a Add file to plot related subreddits using tsne. 2020-11-11 16:05:36 -08:00
Nate E TeBlunthuis
4c8bd14992 Bugfix (typo) 2020-11-10 13:38:11 -08:00