18
0
Commit Graph

10 Commits

Author SHA1 Message Date
bf6ccbc84a datasets/helper.py: use zstandard library for .zst decompression
The Python environment runs inside a Singularity container that cannot
exec the host's /usr/bin/zstd via subprocess. Replace the subprocess
call with the zstandard Python library, which was already a dependency.
Other formats (bz2, xz, gz) still use subprocess as before.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:48:04 -07:00
18925dfe5b datasets/: add PYTHON variable to add_months scripts
GNU parallel spawns fresh shells that don't inherit the active venv.
Using an explicit PYTHON path ensures the right interpreter is used in
parallel tasks. Defaults to python3 but can be overridden:

  PYTHON=/path/to/venv/bin/python3 ./add_months.sh ...

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:42:05 -07:00
926e9bc364 datasets/: fat-node add_months.sh; multinode variant as separate script
add_months.sh now targets a single fat node directly: starts a local
Spark cluster via start_spark_cluster.sh, submits jobs, stops the
cluster. No salloc needed.

add_months_multinode.sh is a new script for the multi-node case using
start_spark_and_run.sh from a login node. Usage takes NODES as first arg.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:35:12 -07:00
526dc03732 datasets/add_months.sh: stop before copy step to force manual verification
The script now exits after Part 2 so the copy and cleanup commands must
be run manually. This prevents the live datasets from being touched
without a deliberate verification step in between.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:22:03 -07:00
6b18840604 datasets/: stage new layer before touching live datasets in add_months
Replace mode='append'-direct-to-live approach with a safer staging
workflow: Part 2 writes the new sorted layer to temp staging directories,
the user verifies, then a separate copy step adds the files to the live
datasets. Live datasets are never touched until the copy step, and the
copy only adds files — nothing is deleted or overwritten.

- sort_and_write gains out_by_subreddit/out_by_author params (replaces
  mode param) so Part 2 can target staging paths
- comments_part2.py, submissions_part2.py: expose new params via Fire
- add_months.sh: rewritten with explicit staging dirs, verify checkpoint,
  and find-based copy step

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 18:17:38 -07:00
2d1d760142 datasets/: replace add_new_month with layered append workflow
Add add_months.sh and merge_layers.sh implementing a layered append
strategy for incremental dataset updates. Each incremental run appends
new sorted partition files alongside existing ones rather than re-sorting
the full corpus, which is prohibitively slow at this dataset scale.

- dumps_helper.py: sort_and_write gains indir/mode params; new
  merge_layers function collapses accumulated layers via atomic rename
- comments_part2.py, submissions_part2.py: expose --indir/--mode via Fire
- add_months.sh: new layered append script (not yet tested)
- merge_layers.sh: new layer collapse script (not yet tested)
- comments_merge.py, submissions_merge.py: Spark entry points for merge
- add_new_month.sh: deleted (full re-sort each add is redundant with
  build_from_scratch at corpus scale)
- README.md: document three workflows; flag untested sections

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:59:36 -07:00
1851132a06 move dataset + similarity docs from wiki into repo READMEs
The wiki page CommunityData:CDSC Reddit had a detailed Hyak walkthrough
(Steps 1-7) for refreshing the parquet datasets and a long TF-IDF methods
section, both of which duplicated or risked drifting from the actual code.
Move both into the repo so they stay in sync with the scripts they
describe:

- datasets/README.md: expand with the wiki's "Building Parquet Datasets"
  prose and the Step 1-7 Hyak walkthrough (ported verbatim where possible,
  adapted to the new script names and dropping obsolete notes about
  pull_pushshift_*.sh / check_*_shas.py).
- similarities/README.md (new): port the wiki's Subreddit Similarity
  section — TF-IDF math, PMI phrase detection, cosine similarity — with
  MediaWiki math converted to markdown LaTeX and script references
  updated to current paths.

The wiki page has been trimmed to a landing page that points at these
README files in gitea.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 17:20:21 -07:00
33150243cd datasets/: split parquet scripts; share logic in dumps_helper.py
Follows the helper-module pattern used in similarities/. Replaces
parquet_part1.py and parquet_part2.py (the merged single-file versions
from the previous commit) with:

- dumps_helper.py — schemas, simdjson parser, a generic parse_record
  loop with per-field handler dispatch, and parse_dump / gen_task_list
  / sort_and_write workers. The only per-type code is the field-handler
  dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top.
- comments_part1.py, submissions_part1.py — thin Part 1 entry points
  with fire CLIs (parse_dump, gen_task_list).
- comments_part2.py, submissions_part2.py — thin Part 2 entry points
  for the Spark sort. pyspark is imported lazily inside sort_and_write
  so Part 1 callers don't pay the import cost.

Unifies on simdjson for both types (drops the json import), which is
faster on the comments dumps. Field-handler dicts make adding a new
type or field a one-place edit.

Also fixes a latent bug in the original: the FIELDS lists didn't
include time_edited (only the schema did), so error-path rows were
short by one element vs. the schema and would have failed pandas /
pyarrow alignment for any row that hit a JSON parse error. The new
FIELDS lists match the schemas exactly, and the _edited handler
returns a (edited, time_edited) tuple that the generic parse loop
expands.

Runners and README updated for the new CLIs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:51:41 -07:00
8965a251b6 refactor datasets/ pipeline; add build/add-month workflows
Replace the four per-type scripts (comments/submissions x part1/part2)
with two merged scripts that share all of their plumbing — only the
schema and JSON parser differ between types. Drop the per-source part
rolling; one parquet per input zst, since Spark handles big parquet
files via internal row groups.

Add two thin runner scripts for the two common workflows:
build_from_scratch.sh wipes the temp dirs and processes everything,
add_new_month.sh takes YYYY-MM and parses just that month before
re-running the Spark sort. Every step in the runners is a separate
command so individual stages can be copied out and run standalone
for debugging.

Also fixes several lurking bugs in the original code: the hardcoded
/gscratch/comdata/users/nathante/ output path in comments Part 2;
the df2 = df.sortWithinPartitions typo in submissions Part 2 that
threw away the preceding global sort; references to a missing
parse_submissions.sh in the old .sh runners; and the asymmetry where
comments_2_parquet_part1.py wasn't per-file/fire-driven the way
submissions_2_parquet_part1.py was.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:30:54 -07:00
Nate E TeBlunthuis
e6294b5b90 Refactor and reorganze. 2020-12-08 17:32:20 -08:00