Commit Graph

185 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
2c54425726 use the wikidiff2 diff timeout instead of async. 2025-12-13 14:29:16 -08:00
Nathan TeBlunthuis
5d1a246898 don't try to remove files that don't exist. 2025-12-13 11:57:47 -08:00
Nathan TeBlunthuis
70a10db228 save work after a time limit. 2025-12-11 08:30:32 -08:00
Nathan TeBlunthuis
1001c780fa start fresh if output and resume are both broken. 2025-12-10 21:20:52 -08:00
Nathan TeBlunthuis
6b4f3939a5 more work on resuming. 2025-12-10 21:07:52 -08:00
Nathan TeBlunthuis
c3d31b4ab5 handle case when we have a valid resume file, but a corrupted original. 2025-12-10 20:33:04 -08:00
Nathan TeBlunthuis
f4a9491ff2 improve print debugging. 2025-12-10 19:50:47 -08:00
Nathan TeBlunthuis
c6e96c2f54 try/catch opening original file in resume. 2025-12-10 19:49:29 -08:00
Nathan TeBlunthuis
f427291fd8 add logic for resuming after a resume. 2025-12-10 19:26:54 -08:00
Nathan TeBlunthuis
d1fc094c96 don't put checkpoint files inside namespace directories. 2025-12-07 06:24:04 -08:00
Nathan TeBlunthuis
783f5fd8bc improve resume logic. 2025-12-07 06:06:26 -08:00
Nathan TeBlunthuis
577ddc87f5 Add per-namespace resume support for partitioned parquet output.
- Implement per-namespace resume points (dict mapping namespace -> (pageid, revid))
  to correctly handle interleaved dump ordering in partitioned output
- Extract resume functionality to dedicated resume.py module
- Add graceful shutdown handling via shutdown_requested flag (CLI-level only)
- Use lazy ParquetWriter creation to avoid empty files on early exit
- Refactor writing logic to _write_batch() helper method
- Simplify control flow by replacing continue statements with should_write flag
2025-12-06 06:56:19 -08:00
Nathan TeBlunthuis
d69d8b0df2 fix baseline output for new columns. 2025-12-02 19:22:08 -08:00
Nathan TeBlunthuis
5ce9808b50 add templates and headings to wikiq. 2025-12-02 17:51:08 -08:00
Nathan TeBlunthuis
d3517ed5ca extract wikilinks. 2025-12-02 14:09:29 -08:00
Nathan TeBlunthuis
329341efb6 improve tests. 2025-12-02 13:52:12 -08:00
Nathan TeBlunthuis
76626a2785 Start working on adding columns from mwparserfromhell. 2025-12-02 12:26:03 -08:00
Nathan TeBlunthuis
b46f98a875 make --resume work with partitioned namespaces. 2025-12-01 07:19:52 -08:00
Nathan TeBlunthuis
3c26185739 enable --resuming from interrupted jobs. 2025-11-30 20:36:31 -08:00
Nathan TeBlunthuis
95b33123e3 revert previous and decrease timeout. 2025-11-28 20:29:51 -08:00
Nathan TeBlunthuis
5c4fc6d5a0 let cache capacity be large. 2025-11-28 19:22:43 -08:00
Nathan TeBlunthuis
77c7d2ba97 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs 2025-11-24 11:03:24 -08:00
Nathan TeBlunthuis
c40930d7d2 use ssh for gitea. 2025-11-24 11:01:36 -08:00
Nathan TeBlunthuis
3b4c9c4441 index particular commit in pywikidiff2. 2025-11-18 20:17:52 -08:00
Nathan TeBlunthuis
3a480940e9 fix error loggging. 2025-08-07 09:38:42 -07:00
Nathan TeBlunthuis
a1f94078c4 improve style. 2025-08-07 09:20:49 -07:00
Nathan TeBlunthuis
329d682f4c fix asyncio bug. 2025-08-07 09:10:16 -07:00
Nathan TeBlunthuis
19f67b3679 try fixing coro issue. 2025-08-07 08:58:45 -07:00
Nathan TeBlunthuis
9b3237014d fix a couple possible bugs. 2025-08-05 23:20:04 -07:00
Nathan TeBlunthuis
bd8c30d80f fix raising exception. 2025-08-04 07:57:31 -07:00
Nathan TeBlunthuis
adf02310ef bugfix 2025-08-03 21:54:41 -07:00
Nathan TeBlunthuis
a563eaf6fc Timeout diffs. 2025-08-03 20:04:51 -07:00
Nathan TeBlunthuis
730c678f51 disable cache limits. 2025-08-03 11:50:57 -07:00
Nathan TeBlunthuis
77f367d95e Revert changes related to row-buffering to just "increase cache size."
This reverts commit 1f08c01cf1.
2025-08-03 09:37:35 -07:00
Nathan TeBlunthuis
1f08c01cf1 increase cache size. 2025-08-03 09:24:35 -07:00
Nathan TeBlunthuis
2f853a879d reduce memory a tich more. 2025-08-01 20:10:38 -07:00
Nathan TeBlunthuis
9799919470 reduce memory even more. 2025-08-01 19:59:36 -07:00
Nathan TeBlunthuis
7528dc8b8e try reducing memory more. 2025-08-01 19:52:18 -07:00
Nathan TeBlunthuis
615d630ff0 reduce memory usage. 2025-08-01 19:45:21 -07:00
Nathan TeBlunthuis
32bc05ddfd set cache limits. 2025-08-01 19:30:46 -07:00
Nathan TeBlunthuis
ef78310580 reduce cache more. 2025-08-01 19:25:50 -07:00
Nathan TeBlunthuis
40a92d2db6 reduce wd2 cache size 2025-08-01 19:18:26 -07:00
Nathan TeBlunthuis
6bec0de9b2 configure wikidiff2. 2025-08-01 18:53:07 -07:00
Nathan TeBlunthuis
54e996b910 configure pywikidiff2 cache limits. 2025-08-01 09:24:54 -07:00
Nathan TeBlunthuis
83c92d1a37 decrease moved paragraph detection cutoff to see if that fixes memory issue. 2025-07-22 13:29:01 -07:00
Nathan TeBlunthuis
076df15740 force garbage collection. 2025-07-22 13:13:18 -07:00
Nathan TeBlunthuis
6557e25af7 make a new pywikidiff2 object for each revision to reduce memory. 2025-07-22 09:50:30 -07:00
Nathan TeBlunthuis
d20075b323 add memray for debugging memory usage. 2025-07-17 15:17:23 -07:00
Nathan TeBlunthuis
6d03cac28d decrease batch_size. 2025-07-15 19:37:26 -07:00
Nathan TeBlunthuis
3a44cfd4da increase batch size. 2025-07-15 19:09:36 -07:00