18
0

datasets/: replace add_new_month with layered append workflow

Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 17:59:36 -07:00
parent 1851132a06
commit 2d1d760142
10 changed files with 273 additions and 85 deletions

View File

@@ -1,14 +1,18 @@
#!/usr/bin/env python3
"""Part 2 for comments: Spark sort + repartition the per-source parquets
produced by comments_part1.py into the final by_subreddit / by_author
datasets.
"""Part 2 for comments: Spark sort + repartition into the final datasets.
Launched via the Hyak-provided start_spark_and_run.sh wrapper:
Must be launched from a login node via the Hyak-provided wrapper:
start_spark_and_run.sh 1 comments_part2.py
start_spark_and_run.sh 1 comments_part2.py --indir=/path/to/parquets --mode=append
--indir defaults to the temp comments dir in dumps_helper.py.
--mode defaults to 'overwrite'; use 'append' to add a new layer without
touching existing partition files (see add_months.sh).
"""
import fire
from dumps_helper import COMMENTS, sort_and_write
if __name__ == "__main__":
sort_and_write(COMMENTS)
fire.Fire(lambda indir=None, mode='overwrite': sort_and_write(COMMENTS, indir=indir, mode=mode))