Follows the helper-module pattern used in similarities/. Replaces parquet_part1.py and parquet_part2.py (the merged single-file versions from the previous commit) with: - dumps_helper.py — schemas, simdjson parser, a generic parse_record loop with per-field handler dispatch, and parse_dump / gen_task_list / sort_and_write workers. The only per-type code is the field-handler dicts and the type-config dicts (COMMENTS, SUBMISSIONS) at the top. - comments_part1.py, submissions_part1.py — thin Part 1 entry points with fire CLIs (parse_dump, gen_task_list). - comments_part2.py, submissions_part2.py — thin Part 2 entry points for the Spark sort. pyspark is imported lazily inside sort_and_write so Part 1 callers don't pay the import cost. Unifies on simdjson for both types (drops the json import), which is faster on the comments dumps. Field-handler dicts make adding a new type or field a one-place edit. Also fixes a latent bug in the original: the FIELDS lists didn't include time_edited (only the schema did), so error-path rows were short by one element vs. the schema and would have failed pandas / pyarrow alignment for any row that hit a JSON parse error. The new FIELDS lists match the schemas exactly, and the _edited handler returns a (edited, time_edited) tuple that the generic parse loop expands. Runners and README updated for the new CLIs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
25 lines
719 B
Python
Executable File
25 lines
719 B
Python
Executable File
#!/usr/bin/env python3
|
|
"""Part 1 for comments: parse one RC_*.zst dump into a parquet file.
|
|
|
|
CLI:
|
|
comments_part1.py parse_dump RC_2018-08.zst
|
|
comments_part1.py gen_task_list
|
|
comments_part1.py parse_dump RC_2018-08.zst --dumpdir=/tmp/in --outdir=/tmp/out
|
|
"""
|
|
|
|
import fire
|
|
from dumps_helper import COMMENTS, parse_dump, gen_task_list
|
|
|
|
|
|
def _parse_dump(partition, dumpdir=None, outdir=None):
|
|
parse_dump(COMMENTS, partition, dumpdir=dumpdir, outdir=outdir)
|
|
|
|
|
|
def _gen_task_list(dumpdir=None, tasklist=None):
|
|
gen_task_list(COMMENTS, 'comments_part1.py', dumpdir=dumpdir, tasklist=tasklist)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
fire.Fire({'parse_dump': _parse_dump,
|
|
'gen_task_list': _gen_task_list})
|