18
0

datasets/: stage new layer before touching live datasets in add_months

Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.

- sort_and_write gains out_by_subreddit/out_by_author params (replaces
  mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
  and find-based copy step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 18:17:38 -07:00
parent 2d1d760142
commit 6b18840604
4 changed files with 89 additions and 33 deletions

View File

@@ -6,8 +6,8 @@ Must be launched from a login node via the Hyak-provided wrapper:
start_spark_and_run.sh 1 comments_part2.py --indir=/path/to/parquets --mode=append
--indir defaults to the temp comments dir in dumps_helper.py.
--mode defaults to 'overwrite'; use 'append' to add a new layer without
touching existing partition files (see add_months.sh).
--out_by_subreddit and --out_by_author default to the live dataset paths;
override them to write to staging directories first (see add_months.sh).
"""
import fire
@@ -15,4 +15,7 @@ from dumps_helper import COMMENTS, sort_and_write
if __name__ == "__main__":
fire.Fire(lambda indir=None, mode='overwrite': sort_and_write(COMMENTS, indir=indir, mode=mode))
fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
sort_and_write(COMMENTS, indir=indir,
out_by_subreddit=out_by_subreddit,
out_by_author=out_by_author))