Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
33 lines
1.3 KiB
Bash
Executable File
33 lines
1.3 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
#
|
|
# Collapse all accumulated layers in the final parquet datasets into a
|
|
# single clean layer. Use this after several incremental adds via
|
|
# add_months.sh when you want to reduce the number of partition files.
|
|
#
|
|
# Reads the existing by_subreddit / by_author datasets, re-sorts everything,
|
|
# writes to temp paths, then atomically replaces the originals via rename.
|
|
# The old directories are removed once the new ones are in place.
|
|
#
|
|
# If the process is interrupted after writing the .merging directories but
|
|
# before the renames complete, re-run — the .merging directories will be
|
|
# overwritten and the originals are still intact. If interrupted after the
|
|
# renames, the .old directories are left behind; delete them manually once
|
|
# satisfied with the output.
|
|
#
|
|
# To add new months without merging, use add_months.sh.
|
|
# To rebuild everything from raw dumps, use build_from_scratch.sh.
|
|
#
|
|
# NOTE: This script and its workflow are written but not yet tested.
|
|
# Remove this notice after a successful end-to-end run.
|
|
#
|
|
# Every command below is independently runnable for debugging.
|
|
|
|
set -e
|
|
cd "$(dirname "$0")"
|
|
|
|
# merge and collapse comments layers
|
|
start_spark_and_run.sh 1 comments_merge.py
|
|
|
|
# merge and collapse submissions layers
|
|
start_spark_and_run.sh 1 submissions_merge.py
|