18
0

1 Commits

Author SHA1 Message Date
2390d2d10c datasets/README: fix stale add_new_month references
After the rename to add_months.sh and addition of merge_layers.sh /
*_merge.py, the Hyak walkthrough section still pointed at the old script
names. Update the Step 2 inventory and the "for incremental updates"
aside to match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 19:24:38 -07:00

View File

@@ -172,8 +172,8 @@ launch convention.
This walkthrough describes the process we went through updating Reddit
data from the PushShift cutoff up to the end of 2024. Adapting it for
newer data should just involve using different academic torrent files
that start from 2025 onwards. For a single-month update, the
`add_new_month.sh` workflow above is much shorter; this walkthrough is
that start from 2025 onwards. For incremental updates, the
`add_months.sh` workflow above is much shorter; this walkthrough is
for the bulk-refresh case.
### Prerequisites
@@ -233,7 +233,8 @@ code lives entirely in `datasets/`:
- `helper.py` — file-open helpers
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
- `comments_merge.py`, `submissions_merge.py` — merge entry points
- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts
The Spark wrapper scripts (`start_spark_and_run.sh`,
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;