Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.
Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.
Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Pushshift's files.pushshift.io archive is gone since Reddit cut off
third-party API access in 2023, so the dumps/ pull and SHA-check scripts
no longer work. The old/ directory of pre-refactor scripts was likewise
superseded by current versions in similarities/.
README rewritten to credit Nate as original developer, name current
maintainers, document the directory layout, point at the CDSC wiki for
the ArcticShift/torrent-based workflow, fix several stale script paths,
and correct an incorrect tf-normalization formula (max, not sum).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>