refactor datasets/ pipeline; add build/add-month workflows
Replace the four per-type scripts (comments/submissions x part1/part2) with two merged scripts that share all of their plumbing — only the schema and JSON parser differ between types. Drop the per-source part rolling; one parquet per input zst, since Spark handles big parquet files via internal row groups. Add two thin runner scripts for the two common workflows: build_from_scratch.sh wipes the temp dirs and processes everything, add_new_month.sh takes YYYY-MM and parses just that month before re-running the Spark sort. Every step in the runners is a separate command so individual stages can be copied out and run standalone for debugging. Also fixes several lurking bugs in the original code: the hardcoded /gscratch/comdata/users/nathante/ output path in comments Part 2; the df2 = df.sortWithinPartitions typo in submissions Part 2 that threw away the preceding global sort; references to a missing parse_submissions.sh in the old .sh runners; and the asymmetry where comments_2_parquet_part1.py wasn't per-file/fire-driven the way submissions_2_parquet_part1.py was. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
70
datasets/README.md
Normal file
70
datasets/README.md
Normal file
@@ -0,0 +1,70 @@
|
||||
# Reddit dumps → sorted parquet datasets
|
||||
|
||||
This directory holds the pipeline that turns compressed Reddit dump files
|
||||
(`RC_YYYY-MM.zst` for comments, `RS_YYYY-MM.zst` for submissions) into the
|
||||
sorted, repartitioned parquet datasets that the rest of the project
|
||||
consumes.
|
||||
|
||||
The pipeline has two stages:
|
||||
|
||||
| Script | What it does |
|
||||
|---|---|
|
||||
| `parquet_part1.py` | Reads one compressed dump and writes one parquet file. Per-file, parallelizable. Runs without Spark. |
|
||||
| `parquet_part2.py` | Reads the directory of per-file parquets in Spark, sorts and repartitions by subreddit, then by author, and writes the final `reddit_*_by_*.parquet` datasets. Always re-sorts the full corpus. |
|
||||
|
||||
Both scripts use a single fire CLI with `comments` and `submissions`
|
||||
subcommands, so the comments and submissions paths share all of their
|
||||
plumbing — only the schema and the JSON parser differ.
|
||||
|
||||
## The two workflows
|
||||
|
||||
There are two ways to run the pipeline; pick the one that matches your
|
||||
situation.
|
||||
|
||||
### Build from scratch — `build_from_scratch.sh`
|
||||
|
||||
Use this when there is no existing parquet output, or when the upstream
|
||||
data has changed in a way that requires reparsing everything. Wipes the
|
||||
per-source temp directories, processes every `RC_*` / `RS_*` dump in the
|
||||
raw dumps directory through Part 1, then runs the Part 2 Spark sort.
|
||||
|
||||
### Add a new month — `add_new_month.sh YYYY-MM`
|
||||
|
||||
Use this when one or more months of new dump files have arrived and you
|
||||
just want to bring the existing datasets up to date. Processes only the
|
||||
specified month's `RC_<MONTH>.zst` and `RS_<MONTH>.zst` files through
|
||||
Part 1 (the existing per-source parquet files are left in place), then
|
||||
re-runs the Part 2 Spark sort over the full temp directory so the final
|
||||
datasets pick up the new data.
|
||||
|
||||
The Part 2 sort is global and not incremental, so each monthly add
|
||||
re-sorts the entire corpus. That's fine for a monthly cadence; it would
|
||||
need a rearchitecture if the cost became a problem.
|
||||
|
||||
## Running steps individually
|
||||
|
||||
Both `.sh` runners are written so that every meaningful step is a separate,
|
||||
self-contained command. If something fails partway through, or you want
|
||||
to inspect intermediate state, you can copy any single line out of the
|
||||
runner and execute it standalone. For example:
|
||||
|
||||
```sh
|
||||
# parse one specific file (skipping the rest of the workflow)
|
||||
python3 parquet_part1.py comments parse_dump RC_2025-03.zst
|
||||
|
||||
# override default dump/output paths from the CLI
|
||||
python3 parquet_part1.py comments parse_dump RC_2025-03.zst \
|
||||
--dumpdir=/tmp/test --outdir=/tmp/out
|
||||
```
|
||||
|
||||
The Spark Part 2 step is launched via `start_spark_and_run.sh` (a
|
||||
Hyak-provided wrapper not included in this repo); see the wiki for the
|
||||
launch convention.
|
||||
|
||||
## See also
|
||||
|
||||
The CDSC wiki page
|
||||
[CommunityData:CDSC_Reddit](https://wiki.communitydata.science/CommunityData:CDSC_Reddit)
|
||||
documents the surrounding workflow — where the raw dump files come from
|
||||
(currently ArcticShift via academic torrents), how to stage them on
|
||||
Hyak, and how to run Spark jobs on the cluster.
|
||||
Reference in New Issue
Block a user