diff --git a/datasets/README.md b/datasets/README.md index ac37c79..7868cb9 100644 --- a/datasets/README.md +++ b/datasets/README.md @@ -172,8 +172,8 @@ launch convention. This walkthrough describes the process we went through updating Reddit data from the PushShift cutoff up to the end of 2024. Adapting it for newer data should just involve using different academic torrent files -that start from 2025 onwards. For a single-month update, the -`add_new_month.sh` workflow above is much shorter; this walkthrough is +that start from 2025 onwards. For incremental updates, the +`add_months.sh` workflow above is much shorter; this walkthrough is for the bulk-refresh case. ### Prerequisites @@ -233,7 +233,8 @@ code lives entirely in `datasets/`: - `helper.py` — file-open helpers - `comments_part1.py`, `submissions_part1.py` — Part 1 entry points - `comments_part2.py`, `submissions_part2.py` — Part 2 entry points -- `build_from_scratch.sh`, `add_new_month.sh` — the two runner scripts +- `comments_merge.py`, `submissions_merge.py` — merge entry points +- `build_from_scratch.sh`, `add_months.sh`, `merge_layers.sh` — the runner scripts The Spark wrapper scripts (`start_spark_and_run.sh`, `start_spark_cluster.sh`, `start_spark_worker.sh`) are not in this repo;