datasets/README: fix stale add_new_month references

After the rename to add_months.sh and addition of merge_layers.sh / *_merge.py, the Hyak walkthrough section still pointed at the old script names. Update the Step 2 inventory and the "for incremental updates" aside to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 19:24:38 -07:00
parent 0ea57b2377
commit 2390d2d10c
1 changed files with 4 additions and 3 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -172,8 +172,8 @@ launch convention.
 This walkthrough describes the process we went through updating Reddit
 data from the PushShift cutoff up to the end of 2024. Adapting it for
 newer data should just involve using different academic torrent files
-that start from 2025 onwards. For a single-month update, the
-`add_new_month.sh` workflow above is much shorter; this walkthrough is
+that start from 2025 onwards. For incremental updates, the
+`add_months.sh` workflow above is much shorter; this walkthrough is
 for the bulk-refresh case.

 ### Prerequisites
@@ -233,7 +233,8 @@ code lives entirely in `datasets/`:
 - `helper.py` — file-open helpers
 - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
 - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
+- `comments_merge.py`, `submissions_merge.py` — merge entry points
+- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts

 The Spark wrapper scripts (`start_spark_and_run.sh`,
 `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;