18
0

datasets/: replace add_new_month with layered append workflow

Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 17:59:36 -07:00
parent 1851132a06
commit 2d1d760142
10 changed files with 273 additions and 85 deletions

View File

@@ -43,34 +43,71 @@ The final datasets are in `/gscratch/comdata/output`:
| Script | Role |
|---|---|
| `comments_part1.py`, `submissions_part1.py` | Part 1 entry points. Each parses one compressed dump into one parquet file. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads the directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Launched via `start_spark_and_run.sh`. |
| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
| `comments_part2.py`, `submissions_part2.py` | Part 2 entry points. Each is a Spark job that reads a directory of per-source parquets and writes the final `*_by_subreddit.parquet` and `*_by_author.parquet` datasets. Accepts `--indir` and `--mode` to support layered appends; defaults match the build-from-scratch workflow. |
| `comments_merge.py`, `submissions_merge.py` | Merge entry points. Each is a Spark job that collapses all accumulated layers in the final datasets into a single clean layer. Launched via `start_spark_and_run.sh`. |
| `dumps_helper.py` | Shared module. Schemas, the simdjson parser, a generic parse loop with per-field handler dispatch, and the `parse_dump` / `gen_task_list` / `sort_and_write` / `merge_layers` workers that the entry-point scripts wrap. Adding a new dump type or a new field is a one-place edit. |
| `helper.py` | Lower-level helpers for opening compressed dump files (`.zst`, `.xz`, `.bz2`, `.gz`). |
## The two workflows
There are two ways to run the pipeline; pick the one that matches your
situation.
## The three workflows
### Build from scratch — `build_from_scratch.sh`
Use this when there is no existing parquet output, or when the upstream
data has changed in a way that requires reparsing everything. Wipes the
per-source temp directories, processes every `RC_*` / `RS_*` dump in the
raw dumps directory through Part 1, then runs the Part 2 Spark sort.
raw dumps directory through Part 1 (in parallel via GNU parallel), then
runs the Part 2 Spark sort.
### Add a new month — `add_new_month.sh YYYY-MM`
### Add new months — `add_months.sh YYYY-MM [YYYY-MM ...]`
Use this when one or more months of new dump files have arrived and you
just want to bring the existing datasets up to date. Processes only the
specified month's `RC_<MONTH>.zst` and `RS_<MONTH>.zst` files through
Part 1 (the existing per-source parquet files are left in place), then
re-runs the Part 2 Spark sort over the full temp directory so the final
datasets pick up the new data.
> **NOTE: written but not yet tested. Remove this notice after a
> successful end-to-end run.**
The Part 2 sort is global and not incremental, so each monthly add
re-sorts the entire corpus. That's fine for a monthly cadence; it would
need a rearchitecture if the cost became a problem.
Use this for routine incremental updates. Runs Part 1 on only the
specified months, then appends the sorted output as a new layer of
partition files alongside the existing ones. No existing data is
rewritten.
Each run adds one layer to each final dataset directory. Spark and DuckDB
read all layers together correctly. At a yearly update cadence the number
of layers stays small; use `merge_layers.sh` to collapse them when
needed.
The new `.zst` dump files must be accessible at `COMMENTS_DUMPDIR` and
`SUBMISSIONS_DUMPDIR`. Override the defaults (which match `dumps_helper.py`)
via environment variables if the files are not in the standard locations:
```sh
COMMENTS_DUMPDIR=/path/to/new/comments \
SUBMISSIONS_DUMPDIR=/path/to/new/submissions \
./add_months.sh 2025-01 2025-02 2025-03
```
Part 1 runs directly on a compute node. For Part 2 there are two options:
- **Single fat node** (simpler, often faster for smaller sorts): `salloc`
a `cpu-g2` node (128 cores, ~1 TB RAM) and run the Part 2 script
directly with `spark-submit` or `python3`. See Step 6 of the walkthrough
below for the `salloc` invocation.
- **Multi-node Spark cluster**: use `start_spark_and_run.sh` from a login
node. It allocates nodes via `salloc` and handles cluster coordination.
Pass the number of nodes as the first argument.
### Merge layers — `merge_layers.sh`
> **NOTE: written but not yet tested. Remove this notice after a
> successful end-to-end run.**
Use this to collapse accumulated layers from incremental adds into a
single clean layer. Reads the existing final datasets, re-sorts
everything, writes to `.merging` temp paths, then atomically replaces the
originals via rename.
Run this when query performance has degraded due to many layers, or any
time you want a clean single-file-per-partition layout. The existing
datasets are safe until the rename step completes; see `merge_layers.sh`
for recovery notes if interrupted. As with `add_months.sh`, Part 2 can
run on a single fat node or via `start_spark_and_run.sh`.
## Running steps individually