datasets/: replace add_new_month with layered append workflow
Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
14
datasets/comments_merge.py
Normal file
14
datasets/comments_merge.py
Normal file
@@ -0,0 +1,14 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Collapse all layers in the comments final datasets into a single clean layer.
|
||||
|
||||
Must be launched from a login node via the Hyak-provided wrapper:
|
||||
start_spark_and_run.sh 1 comments_merge.py
|
||||
|
||||
See merge_layers.sh and dumps_helper.merge_layers for details.
|
||||
"""
|
||||
|
||||
from dumps_helper import COMMENTS, merge_layers
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
merge_layers(COMMENTS)
|
||||
Reference in New Issue
Block a user