datasets/README: fix stale add_new_month references
After the rename to add_months.sh and addition of merge_layers.sh / *_merge.py, the Hyak walkthrough section still pointed at the old script names. Update the Step 2 inventory and the "for incremental updates" aside to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -172,8 +172,8 @@ launch convention.
|
||||
This walkthrough describes the process we went through updating Reddit
|
||||
data from the PushShift cutoff up to the end of 2024. Adapting it for
|
||||
newer data should just involve using different academic torrent files
|
||||
that start from 2025 onwards. For a single-month update, the
|
||||
`add_new_month.sh` workflow above is much shorter; this walkthrough is
|
||||
that start from 2025 onwards. For incremental updates, the
|
||||
`add_months.sh` workflow above is much shorter; this walkthrough is
|
||||
for the bulk-refresh case.
|
||||
|
||||
### Prerequisites
|
||||
@@ -233,7 +233,8 @@ code lives entirely in `datasets/`:
|
||||
- `helper.py` — file-open helpers
|
||||
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
|
||||
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
|
||||
- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
|
||||
- `comments_merge.py`, `submissions_merge.py` — merge entry points
|
||||
- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts
|
||||
|
||||
The Spark wrapper scripts (`start_spark_and_run.sh`,
|
||||
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
|
||||
|
||||
Reference in New Issue
Block a user