Commit Graph

13 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
5ebdb26d82 make resume with jsonl output fault tolerant. 2025-12-23 09:09:51 -08:00
Nathan TeBlunthuis
3f1a9ba862 refactor and enable jsonl output. 2025-12-21 23:42:18 -08:00
Nathan TeBlunthuis
6988a281dc output parquet files in chunks to avoid memory issues with parquet. 2025-12-20 21:45:39 -08:00
Nathan TeBlunthuis
38dabd0547 only merge the correct partitioned files. 2025-12-19 11:47:18 -08:00
Nathan TeBlunthuis
5d1a246898 don't try to remove files that don't exist. 2025-12-13 11:57:47 -08:00
Nathan TeBlunthuis
6b4f3939a5 more work on resuming. 2025-12-10 21:07:52 -08:00
Nathan TeBlunthuis
c3d31b4ab5 handle case when we have a valid resume file, but a corrupted original. 2025-12-10 20:33:04 -08:00
Nathan TeBlunthuis
f4a9491ff2 improve print debugging. 2025-12-10 19:50:47 -08:00
Nathan TeBlunthuis
c6e96c2f54 try/catch opening original file in resume. 2025-12-10 19:49:29 -08:00
Nathan TeBlunthuis
f427291fd8 add logic for resuming after a resume. 2025-12-10 19:26:54 -08:00
Nathan TeBlunthuis
d1fc094c96 don't put checkpoint files inside namespace directories. 2025-12-07 06:24:04 -08:00
Nathan TeBlunthuis
783f5fd8bc improve resume logic. 2025-12-07 06:06:26 -08:00
Nathan TeBlunthuis
577ddc87f5 Add per-namespace resume support for partitioned parquet output.
- Implement per-namespace resume points (dict mapping namespace -> (pageid, revid))
  to correctly handle interleaved dump ordering in partitioned output
- Extract resume functionality to dedicated resume.py module
- Add graceful shutdown handling via shutdown_requested flag (CLI-level only)
- Use lazy ParquetWriter creation to avoid empty files on early exit
- Refactor writing logic to _write_batch() helper method
- Simplify control flow by replacing continue statements with should_write flag
2025-12-06 06:56:19 -08:00