datasets/: replace add_new_month with layered append workflow
Add add_months.sh and merge_layers.sh implementing a layered append strategy for incremental dataset updates. Each incremental run appends new sorted partition files alongside existing ones rather than re-sorting the full corpus, which is prohibitively slow at this dataset scale. - dumps_helper.py: sort_and_write gains indir/mode params; new merge_layers function collapses accumulated layers via atomic rename - comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire - add_months.sh: new layered append script (not yet tested) - merge_layers.sh: new layer collapse script (not yet tested) - comments_merge.py, submissions_merge.py: Spark entry points for merge - add_new_month.sh: deleted (full re-sort each add is redundant with build_from_scratch at corpus scale) - README.md: document three workflows; flag untested sections Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -43,34 +43,71 @@ The final datasets are in `/gscratch/comdata/output`:
|
||||
| Script | Role |
|
||||
|---|---|
|
||||
| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
|
||||
| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads the directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Launched via `start_spark_and_run.sh`. |
|
||||
| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
|
||||
| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads a directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Accepts `--indir` and `--mode` to support layered appends; defaults match the build-from-scratch workflow. |
|
||||
| `comments_merge.py`, `submissions_merge.py` | Merge entry points. Each is a Spark job that collapses all accumulated layers in the final datasets into a single clean layer. Launched via `start_spark_and_run.sh`. |
|
||||
| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` / `merge_layers` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
|
||||
| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |
|
||||
|
||||
## The two workflows
|
||||
|
||||
There are two ways to run the pipeline; pick the one that matches your
|
||||
situation.
|
||||
## The three workflows
|
||||
|
||||
### Build from scratch — `build_from_scratch.sh`
|
||||
|
||||
Use this when there is no existing parquet output, or when the upstream
|
||||
data has changed in a way that requires reparsing everything. Wipes the
|
||||
per-source temp directories, processes every `RC_*` / `RS_*` dump in the
|
||||
raw dumps directory through Part 1, then runs the Part 2 Spark sort.
|
||||
raw dumps directory through Part 1 (in parallel via GNU parallel), then
|
||||
runs the Part 2 Spark sort.
|
||||
|
||||
### Add a new month — `add_new_month.sh YYYY-MM`
|
||||
### Add new months — `add_months.sh YYYY-MM [YYYY-MM ...]`
|
||||
|
||||
Use this when one or more months of new dump files have arrived and you
|
||||
just want to bring the existing datasets up to date. Processes only the
|
||||
specified month's `RC_<MONTH>.zst` and `RS_<MONTH>.zst` files through
|
||||
Part 1 (the existing per-source parquet files are left in place), then
|
||||
re-runs the Part 2 Spark sort over the full temp directory so the final
|
||||
datasets pick up the new data.
|
||||
> **NOTE: written but not yet tested. Remove this notice after a
|
||||
> successful end-to-end run.**
|
||||
|
||||
The Part 2 sort is global and not incremental, so each monthly add
|
||||
re-sorts the entire corpus. That's fine for a monthly cadence; it would
|
||||
need a rearchitecture if the cost became a problem.
|
||||
Use this for routine incremental updates. Runs Part 1 on only the
|
||||
specified months, then appends the sorted output as a new layer of
|
||||
partition files alongside the existing ones. No existing data is
|
||||
rewritten.
|
||||
|
||||
Each run adds one layer to each final dataset directory. Spark and DuckDB
|
||||
read all layers together correctly. At a yearly update cadence the number
|
||||
of layers stays small; use `merge_layers.sh` to collapse them when
|
||||
needed.
|
||||
|
||||
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
|
||||
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
|
||||
via environment variables if the files are not in the standard locations:
|
||||
|
||||
```sh
|
||||
COMMENTS_DUMPDIR=/path/to/new/comments \
|
||||
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
|
||||
./add_months.sh 2025-01 2025-02 2025-03
|
||||
```
|
||||
|
||||
Part 1 runs directly on a compute node. For Part 2 there are two options:
|
||||
|
||||
- **Single fat node** (simpler, often faster for smaller sorts): `salloc`
|
||||
a `cpu-g2` node (128 cores, ~1 TB RAM) and run the Part 2 script
|
||||
directly with `spark-submit` or `python3`. See Step 6 of the walkthrough
|
||||
below for the `salloc` invocation.
|
||||
- **Multi-node Spark cluster**: use `start_spark_and_run.sh` from a login
|
||||
node. It allocates nodes via `salloc` and handles cluster coordination.
|
||||
Pass the number of nodes as the first argument.
|
||||
|
||||
### Merge layers — `merge_layers.sh`
|
||||
|
||||
> **NOTE: written but not yet tested. Remove this notice after a
|
||||
> successful end-to-end run.**
|
||||
|
||||
Use this to collapse accumulated layers from incremental adds into a
|
||||
single clean layer. Reads the existing final datasets, re-sorts
|
||||
everything, writes to `.merging` temp paths, then atomically replaces the
|
||||
originals via rename.
|
||||
|
||||
Run this when query performance has degraded due to many layers, or any
|
||||
time you want a clean single-file-per-partition layout. The existing
|
||||
datasets are safe until the rename step completes; see `merge_layers.sh`
|
||||
for recovery notes if interrupted. As with `add_months.sh`, Part 2 can
|
||||
run on a single fat node or via `start_spark_and_run.sh`.
|
||||
|
||||
## Running steps individually
|
||||
|
||||
|
||||
Reference in New Issue
Block a user