Nathan TeBlunthuis
4b8288c016
add some debug lines.
2026-01-06 19:58:18 -08:00
Nathan TeBlunthuis
93f6ed0ff5
fix bug by truncating corrupted jsonl lines.
2025-12-23 19:52:37 -08:00
Nathan TeBlunthuis
5ebdb26d82
make resume with jsonl output fault tolerant.
2025-12-23 09:09:51 -08:00
Nathan TeBlunthuis
3f1a9ba862
refactor and enable jsonl output.
2025-12-21 23:42:18 -08:00
Nathan TeBlunthuis
6988a281dc
output parquet files in chunks to avoid memory issues with parquet.
2025-12-20 21:45:39 -08:00
Nathan TeBlunthuis
38dabd0547
only merge the correct partitioned files.
2025-12-19 11:47:18 -08:00
Nathan TeBlunthuis
5d1a246898
don't try to remove files that don't exist.
2025-12-13 11:57:47 -08:00
Nathan TeBlunthuis
6b4f3939a5
more work on resuming.
2025-12-10 21:07:52 -08:00
Nathan TeBlunthuis
c3d31b4ab5
handle case when we have a valid resume file, but a corrupted original.
2025-12-10 20:33:04 -08:00
Nathan TeBlunthuis
f4a9491ff2
improve print debugging.
2025-12-10 19:50:47 -08:00
Nathan TeBlunthuis
c6e96c2f54
try/catch opening original file in resume.
2025-12-10 19:49:29 -08:00
Nathan TeBlunthuis
f427291fd8
add logic for resuming after a resume.
2025-12-10 19:26:54 -08:00
Nathan TeBlunthuis
d1fc094c96
don't put checkpoint files inside namespace directories.
2025-12-07 06:24:04 -08:00
Nathan TeBlunthuis
783f5fd8bc
improve resume logic.
2025-12-07 06:06:26 -08:00
Nathan TeBlunthuis
577ddc87f5
Add per-namespace resume support for partitioned parquet output.
...
- Implement per-namespace resume points (dict mapping namespace -> (pageid, revid))
to correctly handle interleaved dump ordering in partitioned output
- Extract resume functionality to dedicated resume.py module
- Add graceful shutdown handling via shutdown_requested flag (CLI-level only)
- Use lazy ParquetWriter creation to avoid empty files on early exit
- Refactor writing logic to _write_batch() helper method
- Simplify control flow by replacing continue statements with should_write flag
2025-12-06 06:56:19 -08:00