datasets/README: fix stale add_new_month references
After the rename to add_months.sh and addition of merge_layers.sh / *_merge.py, the Hyak walkthrough section still pointed at the old script names. Update the Step 2 inventory and the "for incremental updates" aside to match. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -172,8 +172,8 @@ launch convention.
|
|||||||
This walkthrough describes the process we went through updating Reddit
|
This walkthrough describes the process we went through updating Reddit
|
||||||
data from the PushShift cutoff up to the end of 2024. Adapting it for
|
data from the PushShift cutoff up to the end of 2024. Adapting it for
|
||||||
newer data should just involve using different academic torrent files
|
newer data should just involve using different academic torrent files
|
||||||
that start from 2025 onwards. For a single-month update, the
|
that start from 2025 onwards. For incremental updates, the
|
||||||
`add_new_month.sh` workflow above is much shorter; this walkthrough is
|
`add_months.sh` workflow above is much shorter; this walkthrough is
|
||||||
for the bulk-refresh case.
|
for the bulk-refresh case.
|
||||||
|
|
||||||
### Prerequisites
|
### Prerequisites
|
||||||
@@ -233,7 +233,8 @@ code lives entirely in `datasets/`:
|
|||||||
- `helper.py` — file-open helpers
|
- `helper.py` — file-open helpers
|
||||||
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
|
- `comments_part1.py`, `submissions_part1.py` — Part 1 entry points
|
||||||
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
|
- `comments_part2.py`, `submissions_part2.py` — Part 2 entry points
|
||||||
- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts
|
- `comments_merge.py`, `submissions_merge.py` — merge entry points
|
||||||
|
- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts
|
||||||
|
|
||||||
The Spark wrapper scripts (`start_spark_and_run.sh`,
|
The Spark wrapper scripts (`start_spark_and_run.sh`,
|
||||||
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
|
`start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;
|
||||||
|
|||||||
Reference in New Issue
Block a user