Without --clean, the script now exits with a clear error if temp or
staging directories from a previous run exist. Pass --clean to remove
them automatically before starting. README example updated to include
the flag.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Combine task lists and run a single parallel call so all 32 files
(16 comments + 16 submissions) parse simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GNU parallel spawns fresh shells that don't inherit the active venv.
Using an explicit PYTHON path ensures the right interpreter is used in
parallel tasks. Defaults to python3 but can be overridden:
PYTHON=/path/to/venv/bin/python3 ./add_months.sh ...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
add_months.sh now targets a single fat node directly: starts a local
Spark cluster via start_spark_cluster.sh, submits jobs, stops the
cluster. No salloc needed.
add_months_multinode.sh is a new script for the multi-node case using
start_spark_and_run.sh from a login node. Usage takes NODES as first arg.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The script now exits after Part 2 so the copy and cleanup commands must
be run manually. This prevents the live datasets from being touched
without a deliberate verification step in between.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.
- sort_and_write gains out_by_subreddit/out_by_author params (replaces
mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
and find-based copy step
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.
- dumps_helper.py: sort_and_write gains indir/mode params; new
merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>