datasets/: stage new layer before touching live datasets in add_months
Replace mode='append'-direct-to-live approach with a safer staging workflow: Part 2 writes the new sorted layer to temp staging directories, the user verifies, then a separate copy step adds the files to the live datasets. Live datasets are never touched until the copy step, and the copy only adds files — nothing is deleted or overwritten. - sort_and_write gains out_by_subreddit/out_by_author params (replaces mode param) so Part 2 can target staging paths - comments_part2.py, submissions_part2.py: expose new params via Fire - add_months.sh: rewritten with explicit staging dirs, verify checkpoint, and find-based copy step Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -6,8 +6,8 @@ Must be launched from a login node via the Hyak-provided wrapper:
|
||||
start_spark_and_run.sh 1 comments_part2.py --indir=/path/to/parquets --mode=append
|
||||
|
||||
--indir defaults to the temp comments dir in dumps_helper.py.
|
||||
--mode defaults to 'overwrite'; use 'append' to add a new layer without
|
||||
touching existing partition files (see add_months.sh).
|
||||
--out_by_subreddit and --out_by_author default to the live dataset paths;
|
||||
override them to write to staging directories first (see add_months.sh).
|
||||
"""
|
||||
|
||||
import fire
|
||||
@@ -15,4 +15,7 @@ from dumps_helper import COMMENTS, sort_and_write
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
fire.Fire(lambda indir=None, mode='overwrite': sort_and_write(COMMENTS, indir=indir, mode=mode))
|
||||
fire.Fire(lambda indir=None, out_by_subreddit=None, out_by_author=None:
|
||||
sort_and_write(COMMENTS, indir=indir,
|
||||
out_by_subreddit=out_by_subreddit,
|
||||
out_by_author=out_by_author))
|
||||
|
||||
Reference in New Issue
Block a user