cdsc_reddit

collective/cdsc_reddit

Fork 0

Commit Graph

Author	SHA1	Message	Date
Benjamin Mako Hill	6b18840604	datasets/: stage new layer before touching live datasets in add_months Replace mode='append'-direct-to-live approach with a safer staging workflow: Part 2 writes the new sorted layer to temp staging directories, the user verifies, then a separate copy step adds the files to the live datasets. Live datasets are never touched until the copy step, and the copy only adds files — nothing is deleted or overwritten. - sort_and_write gains out_by_subreddit/out_by_author params (replaces mode param) so Part 2 can target staging paths - comments_part2.py, submissions_part2.py: expose new params via Fire - add_months.sh: rewritten with explicit staging dirs, verify checkpoint, and find-based copy step Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 18:17:38 -07:00
Benjamin Mako Hill	2d1d760142	datasets/: replace add_new_month with layered append workflow Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-25 17:59:36 -07:00

Author

SHA1

Message

Date

Benjamin Mako Hill

6b18840604

datasets/: stage new layer before touching live datasets in add_months

Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.

- sort_and_write gains out_by_subreddit/out_by_author params (replaces
  mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
  and find-based copy step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-25 18:17:38 -07:00

Benjamin Mako Hill

2d1d760142

datasets/: replace add_new_month with layered append workflow

Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-25 17:59:36 -07:00

2 Commits