Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
78 lines
3.3 KiB
Markdown
78 lines
3.3 KiB
Markdown
# Reddit dumps → sorted parquet datasets
|
|
|
|
This directory holds the pipeline that turns compressed Reddit dump files
|
|
(`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the
|
|
sorted, repartitioned parquet datasets that the rest of the project
|
|
consumes.
|
|
|
|
The pipeline has two stages:
|
|
|
|
| Stage | What it does |
|
|
|---|---|
|
|
| Part 1 | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
|
|
| Part 2 | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
|
|
|
|
Each stage has a thin entry-point script per dump type:
|
|
|
|
| Script | Notes |
|
|
|---|---|
|
|
| `comments_part1.py`, `submissions_part1.py` | Per-file parse. `parse_dump <file>` and `gen_task_list` subcommands via fire. |
|
|
| `comments_part2.py`, `submissions_part2.py` | Spark sort. Launched via `start_spark_and_run.sh`. |
|
|
| `dumps_helper.py` | Shared module: schemas, simdjson parser, generic parse loop, parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the two field-handler dicts and the configuration dicts at the top. |
|
|
|
|
## The two workflows
|
|
|
|
There are two ways to run the pipeline; pick the one that matches your
|
|
situation.
|
|
|
|
### Build from scratch — `build_from_scratch.sh`
|
|
|
|
Use this when there is no existing parquet output, or when the upstream
|
|
data has changed in a way that requires reparsing everything. Wipes the
|
|
per-source temp directories, processes every `RC_*` / `RS_*` dump in the
|
|
raw dumps directory through Part 1, then runs the Part 2 Spark sort.
|
|
|
|
### Add a new month — `add_new_month.sh YYYY-MM`
|
|
|
|
Use this when one or more months of new dump files have arrived and you
|
|
just want to bring the existing datasets up to date. Processes only the
|
|
specified month's `RC_<MONTH>.zst` and `RS_<MONTH>.zst` files through
|
|
Part 1 (the existing per-source parquet files are left in place), then
|
|
re-runs the Part 2 Spark sort over the full temp directory so the final
|
|
datasets pick up the new data.
|
|
|
|
The Part 2 sort is global and not incremental, so each monthly add
|
|
re-sorts the entire corpus. That's fine for a monthly cadence; it would
|
|
need a rearchitecture if the cost became a problem.
|
|
|
|
## Running steps individually
|
|
|
|
Both `.sh` runners are written so that every meaningful step is a separate,
|
|
self-contained command. If something fails partway through, or you want
|
|
to inspect intermediate state, you can copy any single line out of the
|
|
runner and execute it standalone. For example:
|
|
|
|
```sh
|
|
# parse one specific file (skipping the rest of the workflow)
|
|
python3 comments_part1.py parse_dump RC_2025-03.zst
|
|
|
|
# override default dump/output paths from the CLI
|
|
python3 comments_part1.py parse_dump RC_2025-03.zst \
|
|
--dumpdir=/tmp/test --outdir=/tmp/out
|
|
|
|
# regenerate just the task list
|
|
python3 submissions_part1.py gen_task_list
|
|
```
|
|
|
|
The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
|
|
Hyak-provided wrapper not included in this repo); see the wiki for the
|
|
launch convention.
|
|
|
|
## See also
|
|
|
|
The CDSC wiki page
|
|
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
|
|
documents the surrounding workflow — where the raw dump files come from
|
|
(currently ArcticShift via academic torrents), how to stage them on
|
|
Hyak, and how to run Spark jobs on the cluster.
|