18
0

datasets/README: fix stale add_new_month references

After the rename to add_months.sh and addition of merge_layers.sh /
*_merge.py, the Hyak walkthrough section still pointed at the old script
names. Update the Step 2 inventory and the "for incremental updates"
aside to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 19:24:38 -07:00
parent 0ea57b2377
commit 2390d2d10c

View File

@@ -172,8 +172,8 @@ launch convention.
This walkthrough describes the process we went through updating Reddit This walkthrough describes the process we went through updating Reddit
data from the PushShift cutoff up to the end of 2024. Adapting it for data from the PushShift cutoff up to the end of 2024. Adapting it for
newer data should just involve using different academic torrent files newer data should just involve using different academic torrent files
that start from 2025 onwards. For a single-month update, the that start from 2025 onwards. For incremental updates, the
`add_new_month.sh` workflow above is much shorter; this walkthrough is `add_months.sh` workflow above is much shorter; this walkthrough is
for the bulk-refresh case. for the bulk-refresh case.
### Prerequisites ### Prerequisites
@@ -233,7 +233,8 @@ code lives entirely in `datasets/`:
- `helper.py` — file-open helpers - `helper.py` — file-open helpers
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts - `comments_merge.py`, `submissions_merge.py` — merge entry points
- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts
The Spark wrapper scripts (`start_spark_and_run.sh`, The Spark wrapper scripts (`start_spark_and_run.sh`,
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo; `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;