Combine task lists and run a single parallel call so all 32 files
(16 comments + 16 submissions) parse simultaneously.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Python environment runs inside a Singularity container that cannot
exec the host's /usr/bin/zstd via subprocess. Replace the subprocess
call with the zstandard Python library, which was already a dependency.
Other formats (bz2, xz, gz) still use subprocess as before.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
GNU parallel spawns fresh shells that don't inherit the active venv.
Using an explicit PYTHON path ensures the right interpreter is used in
parallel tasks. Defaults to python3 but can be overridden:
PYTHON=/path/to/venv/bin/python3 ./add_months.sh ...
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
add_months.sh now targets a single fat node directly: starts a local
Spark cluster via start_spark_cluster.sh, submits jobs, stops the
cluster. No salloc needed.
add_months_multinode.sh is a new script for the multi-node case using
start_spark_and_run.sh from a login node. Usage takes NODES as first arg.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The script now exits after Part 2 so the copy and cleanup commands must
be run manually. This prevents the live datasets from being touched
without a deliberate verification step in between.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.
- sort_and_write gains out_by_subreddit/out_by_author params (replaces
mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
and find-based copy step
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.
- dumps_helper.py: sort_and_write gains indir/mode params; new
merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:
- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
adapted to the new script names and dropping obsolete notes about
pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
section — TF-IDF math, PMI phrase detection, cosine similarity — with
MediaWiki math converted to markdown LaTeX and script references
updated to current paths.
The wiki page has been trimmed to a landing page that points at these
README files in gitea.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:
- dumps_helper.py — schemas, simdjson parser, a generic parse_record
loop with per-field handler dispatch, and parse_dump / gen_task_list
/ sort_and_write workers. The only per-type code is the field-handler
dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
for the Spark sort. pyspark is imported lazily inside sort_and_write
so Part 1 callers don't pay the import cost.
Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.
Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.
Runners and README updated for the new CLIs.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.
Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.
Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>