From 2390d2d10c91ece4c298c87ea0668539ca5708b3 Mon Sep 17 00:00:00 2001 From: Benjamin Mako Hill Date: Mon, 25 May 2026 19:24:38 -0700 Subject: [PATCH] datasets/README: fix stale add_new_month references After the rename to add_months.sh and addition of merge_layers.sh / *_merge.py, the Hyak walkthrough section still pointed at the old script names. Update the Step 2 inventory and the "for incremental updates" aside to match. Co-Authored-By: Claude Sonnet 4.6 --- datasets/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/datasets/README.md b/datasets/README.md index ac37c79..7868cb9 100644 --- a/datasets/README.md +++ b/datasets/README.md @@ -172,8 +172,8 @@ launch convention. This walkthrough describes the process we went through updating Reddit data from the PushShift cutoff up to the end of 2024. Adapting it for newer data should just involve using different academic torrent files -that start from 2025 onwards. For a single-month update, the -`add_new_month.sh` workflow above is much shorter; this walkthrough is +that start from 2025 onwards. For incremental updates, the +`add_months.sh` workflow above is much shorter; this walkthrough is for the bulk-refresh case. ### Prerequisites @@ -233,7 +233,8 @@ code lives entirely in `datasets/`: - `helper.py` — file-open helpers - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points -- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts +- `comments_merge.py`, `submissions_merge.py` — merge entry points +- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts The Spark wrapper scripts (`start_spark_and_run.sh`, `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;