After the rename to add_months.sh and addition of merge_layers.sh /
*_merge.py, the Hyak walkthrough section still pointed at the old script
names. Update the Step 2 inventory and the "for incremental updates"
aside to match.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Without --clean, the script now exits with a clear error if temp or
staging directories from a previous run exist. Pass --clean to remove
them automatically before starting. README example updated to include
the flag.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update the add_months and Step 6 sections with lessons learned from the
first run attempt:
- Replace salloc with srun (releases node automatically on completion)
- Document the PYTHON variable override needed for parallel/venv
- Note that .zst decompression uses the zstandard library due to
Singularity container restrictions on the system zstd binary
- Add full srun invocation with bash -l, tee logging, and tmux guidance
- Update Step 6 walkthrough to use srun instead of salloc
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.
- dumps_helper.py: sort_and_write gains indir/mode params; new
merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:
- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
adapted to the new script names and dropping obsolete notes about
pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
section — TF-IDF math, PMI phrase detection, cosine similarity — with
MediaWiki math converted to markdown LaTeX and script references
updated to current paths.
The wiki page has been trimmed to a landing page that points at these
README files in gitea.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:
- dumps_helper.py — schemas, simdjson parser, a generic parse_record
loop with per-field handler dispatch, and parse_dump / gen_task_list
/ sort_and_write workers. The only per-type code is the field-handler
dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
for the Spark sort. pyspark is imported lazily inside sort_and_write
so Part 1 callers don't pay the import cost.
Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.
Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.
Runners and README updated for the new CLIs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.
Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.
Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>