Commit Graph

  • c7eb374ceb use signalling to timeout mwparserfromhell instead of asyncio. jsonl-output Nathan TeBlunthuis 2026-01-07 12:42:37 -08:00
  • 4b8288c016 add some debug lines. Nathan TeBlunthuis 2026-01-06 19:58:18 -08:00
  • 8590e5f920 fix jsonl.d output. Nathan TeBlunthuis 2025-12-30 11:26:24 -08:00
  • 93f6ed0ff5 fix bug by truncating corrupted jsonl lines. Nathan TeBlunthuis 2025-12-23 19:52:37 -08:00
  • 5ebdb26d82 make resume with jsonl output fault tolerant. Nathan TeBlunthuis 2025-12-23 09:09:51 -08:00
  • 9e6b0fb64c make updating the checkpoint files atomic. Nathan TeBlunthuis 2025-12-23 08:41:38 -08:00
  • d822085698 support .jsonl.d Nathan TeBlunthuis 2025-12-22 20:13:04 -08:00
  • 618c343898 allow output dir to be jsonl.d. Nathan TeBlunthuis 2025-12-21 23:50:00 -08:00
  • 3f1a9ba862 refactor and enable jsonl output. Nathan TeBlunthuis 2025-12-21 23:42:18 -08:00
  • 6988a281dc output parquet files in chunks to avoid memory issues with parquet. Nathan TeBlunthuis 2025-12-20 21:45:39 -08:00
  • 6a4bf81e1a add test for two wikiq jobs in the same directory. Nathan TeBlunthuis 2025-12-19 11:50:56 -08:00
  • 38dabd0547 only merge the correct partitioned files. Nathan TeBlunthuis 2025-12-19 11:47:18 -08:00
  • 006feb795c fix interruption handling by breaking the diff loop. Nathan TeBlunthuis 2025-12-18 18:00:30 -08:00
  • d7f5abef2d resume starts fresh if the first run didn't happen Nathan TeBlunthuis 2025-12-13 15:41:44 -08:00
  • 2c54425726 use the wikidiff2 diff timeout instead of async. Nathan TeBlunthuis 2025-12-13 14:29:16 -08:00
  • 5d1a246898 don't try to remove files that don't exist. Nathan TeBlunthuis 2025-12-13 11:57:47 -08:00
  • 70a10db228 save work after a time limit. Nathan TeBlunthuis 2025-12-11 08:30:32 -08:00
  • 1001c780fa start fresh if output and resume are both broken. Nathan TeBlunthuis 2025-12-10 21:20:52 -08:00
  • 6b4f3939a5 more work on resuming. Nathan TeBlunthuis 2025-12-10 21:07:52 -08:00
  • c3d31b4ab5 handle case when we have a valid resume file, but a corrupted original. Nathan TeBlunthuis 2025-12-10 20:33:04 -08:00
  • f4a9491ff2 improve print debugging. Nathan TeBlunthuis 2025-12-10 19:50:47 -08:00
  • c6e96c2f54 try/catch opening original file in resume. Nathan TeBlunthuis 2025-12-10 19:49:29 -08:00
  • f427291fd8 add logic for resuming after a resume. Nathan TeBlunthuis 2025-12-10 19:26:54 -08:00
  • d1fc094c96 don't put checkpoint files inside namespace directories. Nathan TeBlunthuis 2025-12-07 06:24:04 -08:00
  • 783f5fd8bc improve resume logic. Nathan TeBlunthuis 2025-12-07 06:06:26 -08:00
  • 577ddc87f5 Add per-namespace resume support for partitioned parquet output. Nathan TeBlunthuis 2025-12-06 06:56:19 -08:00
  • d69d8b0df2 fix baseline output for new columns. Nathan TeBlunthuis 2025-12-02 19:22:08 -08:00
  • 5ce9808b50 add templates and headings to wikiq. Nathan TeBlunthuis 2025-12-02 17:51:08 -08:00
  • d3517ed5ca extract wikilinks. Nathan TeBlunthuis 2025-12-02 14:09:29 -08:00
  • 329341efb6 improve tests. Nathan TeBlunthuis 2025-12-02 13:52:12 -08:00
  • 76626a2785 Start working on adding columns from mwparserfromhell. Nathan TeBlunthuis 2025-12-02 12:26:03 -08:00
  • b46f98a875 make --resume work with partitioned namespaces. Nathan TeBlunthuis 2025-12-01 07:19:52 -08:00
  • 3c26185739 enable --resuming from interrupted jobs. Nathan TeBlunthuis 2025-11-30 20:36:31 -08:00
  • 95b33123e3 revert previous and decrease timeout. Nathan TeBlunthuis 2025-11-28 20:29:51 -08:00
  • 5c4fc6d5a0 let cache capacity be large. Nathan TeBlunthuis 2025-11-28 19:22:43 -08:00
  • 77c7d2ba97 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs Nathan TeBlunthuis 2025-11-24 11:03:24 -08:00
  • c40930d7d2 use ssh for gitea. Nathan TeBlunthuis 2025-11-24 11:01:36 -08:00
  • 3b4c9c4441 index particular commit in pywikidiff2. compute-diffs Nathan TeBlunthuis 2025-11-18 20:17:52 -08:00
  • 3a480940e9 fix error loggging. Nathan TeBlunthuis 2025-08-07 09:38:42 -07:00
  • a1f94078c4 improve style. Nathan TeBlunthuis 2025-08-07 09:20:49 -07:00
  • 329d682f4c fix asyncio bug. Nathan TeBlunthuis 2025-08-07 09:10:16 -07:00
  • 19f67b3679 try fixing coro issue. Nathan TeBlunthuis 2025-08-07 08:58:45 -07:00
  • 9b3237014d fix a couple possible bugs. Nathan TeBlunthuis 2025-08-05 23:20:04 -07:00
  • bd8c30d80f fix raising exception. Nathan TeBlunthuis 2025-08-04 07:57:31 -07:00
  • adf02310ef bugfix Nathan TeBlunthuis 2025-08-03 21:54:41 -07:00
  • a563eaf6fc Timeout diffs. Nathan TeBlunthuis 2025-08-03 20:02:18 -07:00
  • 730c678f51 disable cache limits. Nathan TeBlunthuis 2025-08-03 11:50:57 -07:00
  • 77f367d95e Revert changes related to row-buffering to just "increase cache size." Nathan TeBlunthuis 2025-08-03 09:36:42 -07:00
  • 1f08c01cf1 increase cache size. Nathan TeBlunthuis 2025-08-03 09:24:35 -07:00
  • 2f853a879d reduce memory a tich more. Nathan TeBlunthuis 2025-08-01 20:10:38 -07:00
  • 9799919470 reduce memory even more. Nathan TeBlunthuis 2025-08-01 19:59:36 -07:00
  • 7528dc8b8e try reducing memory more. Nathan TeBlunthuis 2025-08-01 19:52:18 -07:00
  • 615d630ff0 reduce memory usage. Nathan TeBlunthuis 2025-08-01 19:45:21 -07:00
  • 32bc05ddfd set cache limits. Nathan TeBlunthuis 2025-08-01 19:30:46 -07:00
  • ef78310580 reduce cache more. Nathan TeBlunthuis 2025-08-01 19:25:50 -07:00
  • 40a92d2db6 reduce wd2 cache size Nathan TeBlunthuis 2025-08-01 19:18:26 -07:00
  • 6bec0de9b2 configure wikidiff2. Nathan TeBlunthuis 2025-08-01 18:52:18 -07:00
  • 54e996b910 configure pywikidiff2 cache limits. Nathan TeBlunthuis 2025-08-01 09:24:54 -07:00
  • 83c92d1a37 decrease moved paragraph detection cutoff to see if that fixes memory issue. Nathan TeBlunthuis 2025-07-22 13:29:01 -07:00
  • 076df15740 force garbage collection. Nathan TeBlunthuis 2025-07-22 13:13:18 -07:00
  • 6557e25af7 make a new pywikidiff2 object for each revision to reduce memory. Nathan TeBlunthuis 2025-07-22 09:50:30 -07:00
  • d20075b323 add memray for debugging memory usage. Nathan TeBlunthuis 2025-07-17 15:17:23 -07:00
  • 6d03cac28d decrease batch_size. Nathan TeBlunthuis 2025-07-15 19:37:26 -07:00
  • 3a44cfd4da increase batch size. Nathan TeBlunthuis 2025-07-15 19:09:36 -07:00
  • 0fbe788e31 use ichunked instead of chunked. Nathan TeBlunthuis 2025-07-15 18:25:44 -07:00
  • 37d095199a inc version. Nathan TeBlunthuis 2025-07-15 15:37:55 -07:00
  • 6b04791de2 reduce batch size. Nathan TeBlunthuis 2025-07-15 15:31:00 -07:00
  • 507335941d Revert "Merge branch 'compute-diffs' into HEAD" Nathan TeBlunthuis 2025-07-15 15:23:50 -07:00
  • 907a35323e Merge branch 'compute-diffs' into HEAD Nathan TeBlunthuis 2025-07-15 15:23:13 -07:00
  • c40506137b make wikiq memory efficient again via batch processing. Nathan TeBlunthuis 2025-07-15 15:20:17 -07:00
  • e53e7ada5d try fixing the memory problem. Nathan TeBlunthuis 2025-07-14 18:58:27 -07:00
  • 76d54ae597 support partitioning output parquet by namespace. Nathan TeBlunthuis 2025-07-07 20:58:43 -07:00
  • c9fb94ccc0 fix tests. Nathan TeBlunthuis 2025-07-07 20:25:00 -07:00
  • ac1dd47b08 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs Nathan TeBlunthuis 2025-07-07 20:16:38 -07:00
  • c597a6b7f4 refactor into src-layout package. Nathan TeBlunthuis 2025-07-07 20:13:17 -07:00
  • a2984bc656 refactor into src-layout package. Nathan TeBlunthuis 2025-07-07 20:13:17 -07:00
  • 56c90fe1cc add missing files + add sorted_columns metadata. Nathan TeBlunthuis 2025-07-07 19:08:31 -07:00
  • d6c4c0a416 add (optional) diff and text columns to output. Nathan TeBlunthuis 2025-07-07 14:39:52 -07:00
  • a8e9e7f4fd wikidiff2 integration: pwr complete. Nathan TeBlunthuis 2025-07-07 12:06:43 -07:00
  • 58c595bf0b add test files. Nathan TeBlunthuis 2025-07-07 11:22:56 -07:00
  • cc96bb5f3f remove server. Nathan TeBlunthuis 2025-07-07 11:21:16 -07:00
  • 14e819e565 compare pywikidiff2 to making requests to wikidiff2. Nathan TeBlunthuis 2025-07-07 10:51:11 -07:00
  • 4654911533 almost there. working out edge cases. Nathan TeBlunthuis 2025-07-03 21:32:44 -07:00
  • cf1fb61a84 WIP: fixing bugs and adding newlines to output. Nathan TeBlunthuis 2025-07-02 13:31:32 -07:00
  • c4acc711d2 finish support for paragraph move. Nathan TeBlunthuis 2025-07-01 11:16:08 -07:00
  • 20de5b93f9 Merge branch 'tmp' into compute-diffs Nathan TeBlunthuis 2025-06-30 20:52:23 -05:00
  • 37734ed092 add test. Nathan TeBlunthuis 2025-06-30 15:45:56 -07:00
  • 5a3e4102b5 got wikidiff2 persistence working except for paragraph moves. Nathan TeBlunthuis 2025-06-30 15:37:54 -07:00
  • 186cb82fb8 some work on wiki_diff_matcher.py Nathan TeBlunthuis 2025-06-27 07:13:41 -07:00
  • bc7f186112 Start interoperability between wikidiff2 and deltas Will Beason 2025-06-26 16:08:50 -05:00
  • 1ec8bfaad4 Add php.ini file Will Beason 2025-06-24 09:24:35 -05:00
  • 94454ffca3 Add PHP server file Will Beason 2025-06-23 14:17:53 -05:00
  • 62db384aa4 Pass arrays of diffs instead of incremental Will Beason 2025-06-23 14:17:01 -05:00
  • 96915a074b Add call to compute diffs via local PHP server Will Beason 2025-06-23 13:09:27 -05:00
  • 0d9ab003f0 Fix tests for new field Will Beason 2025-06-17 12:44:07 -05:00
  • 4bbed4a196 Merge branch 'parquet_support' into test-parquet Will Beason 2025-06-17 12:20:19 -05:00
  • 11d2587471 Add docs and rename import pc -> pacsv Will Beason 2025-06-17 11:46:16 -05:00
  • 586ae85c65 Conform to 3.9 union type formatting Will Beason 2025-06-17 11:41:46 -05:00
  • 390499dd90 Pin to python 3.9 Will Beason 2025-06-17 11:37:20 -05:00
  • 84d464ea38 Remove unnecessary re-conversion to list(revs) Will Beason 2025-06-17 11:23:24 -05:00