cdsc_reddit/datasets/add_months.sh at 2d1d76014279e69ceb2c3d48a3c0bb3834a89fd2

Files

Benjamin Mako Hill 2d1d760142 datasets/: replace add_new_month with layered append workflow

Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-25 17:59:36 -07:00

3.0 KiB

Executable File

Raw Blame History

View Raw

3.0 KiB Executable File Raw Blame History

3.0 KiB

Executable File

Raw Blame History