refactor datasets/ pipeline; add build/add-month workflows

Replace the four per-type scripts (comments/submissions x part1/part2) with two merged scripts that share all of their plumbing — only the schema and JSON parser differ between types. Drop the per-source part rolling; one parquet per input zst, since Spark handles big parquet files via internal row groups. Add two thin runner scripts for the two common workflows: build_from_scratch.sh wipes the temp dirs and processes everything, add_new_month.sh takes YYYY-MM and parses just that month before re-running the Spark sort. Every step in the runners is a separate command so individual stages can be copied out and run standalone for debugging. Also fixes several lurking bugs in the original code: the hardcoded /gscratch/comdata/users/nathante/ output path in comments Part 2; the df2 = df.sortWithinPartitions typo in submissions Part 2 that threw away the preceding global sort; references to a missing parse_submissions.sh in the old .sh runners; and the asymmetry where comments_2_parquet_part1.py wasn't per-file/fire-driven the way submissions_2_parquet_part1.py was. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-25 16:30:54 -07:00
parent d201930951
commit 8965a251b6
12 changed files with 485 additions and 327 deletions
--- a/datasets/comments_2_parquet.sh
+++ b/datasets/comments_2_parquet.sh
@@ -1,10 +0,0 @@
-## needs to be run by hand since i don't have a nice way of waiting on a parallel-sql job to complete 
-
-#!/usr/bin/env bash
-echo "#!/usr/bin/bash" > job_script.sh
-#echo "source $(pwd)/../bin/activate" >> job_script.sh
-echo "python3 $(pwd)/comments_2_parquet_part1.py" >> job_script.sh
-
-srun -p comdata -A comdata --nodes=1 --mem=120G --time=48:00:00 --pty job_script.sh
-
-start_spark_and_run.sh 1 $(pwd)/comments_2_parquet_part2.py