datasets/: replace add_new_month with layered append workflow

Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:59:36 -07:00
parent 1851132a06
commit 2d1d760142
10 changed files with 273 additions and 85 deletions
--- a/datasets/build_from_scratch.sh
+++ b/datasets/build_from_scratch.sh
@@ -16,8 +16,8 @@
 # - GNU parallel installed
 # - start_spark_and_run.sh on PATH (Hyak-provided wrapper)
 #
-# To add one new month to an existing build instead of rebuilding from
-# scratch, use add_new_month.sh.
+# To add new months to an existing build without rebuilding from scratch,
+# use add_months.sh.

 set -e
 cd "$(dirname "$0")"