From 2390d2d10c91ece4c298c87ea0668539ca5708b3 Mon Sep 17 00:00:00 2001
From: Benjamin Mako Hill <mako@atdot.cc>
Date: Mon, 25 May 2026 19:24:38 -0700
Subject: [PATCH] datasets/README: fix stale add_new_month references

After the rename to add_months.sh and addition of merge_layers.sh /
*_merge.py, the Hyak walkthrough section still pointed at the old script
names. Update the Step 2 inventory and the "for incremental updates"
aside to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
---
 datasets/README.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/datasets/README.md b/datasets/README.md
index ac37c79..7868cb9 100644
--- a/datasets/README.md
+++ b/datasets/README.md
@@ -172,8 +172,8 @@ launch convention.
 This walkthrough describes the process we went through updating Reddit
 data from the PushShift cutoff up to the end of 2024. Adapting it for
 newer data should just involve using different academic torrent files
-that start from 2025 onwards. For a single-month update, the
-`add_new_month.sh` workflow above is much shorter; this walkthrough is
+that start from 2025 onwards. For incremental updates, the
+`add_months.sh` workflow above is much shorter; this walkthrough is
 for the bulk-refresh case.
 
 ### Prerequisites
@@ -233,7 +233,8 @@ code lives entirely in `datasets/`:
 - `helper.py` — file-open helpers
 - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
 - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
-- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
+- `comments_merge.py`, `submissions_merge.py` — merge entry points
+- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts
 
 The Spark wrapper scripts (`start_spark_and_run.sh`,
 `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;