Commit Graph

  • d7f5abef2d resume starts fresh if the first run didn't happen compute-diffs Nathan TeBlunthuis 2025-12-13 15:41:44 -0800
  • 2c54425726 use the wikidiff2 diff timeout instead of async. Nathan TeBlunthuis 2025-12-13 14:29:16 -0800
  • 5d1a246898 don't try to remove files that don't exist. Nathan TeBlunthuis 2025-12-13 11:57:47 -0800
  • 70a10db228 save work after a time limit. Nathan TeBlunthuis 2025-12-11 08:30:32 -0800
  • 1001c780fa start fresh if output and resume are both broken. Nathan TeBlunthuis 2025-12-10 21:20:52 -0800
  • 6b4f3939a5 more work on resuming. Nathan TeBlunthuis 2025-12-10 21:07:52 -0800
  • c3d31b4ab5 handle case when we have a valid resume file, but a corrupted original. Nathan TeBlunthuis 2025-12-10 20:33:04 -0800
  • f4a9491ff2 improve print debugging. Nathan TeBlunthuis 2025-12-10 19:50:47 -0800
  • c6e96c2f54 try/catch opening original file in resume. Nathan TeBlunthuis 2025-12-10 19:49:29 -0800
  • f427291fd8 add logic for resuming after a resume. Nathan TeBlunthuis 2025-12-10 19:26:54 -0800
  • d1fc094c96 don't put checkpoint files inside namespace directories. Nathan TeBlunthuis 2025-12-07 06:24:04 -0800
  • 783f5fd8bc improve resume logic. Nathan TeBlunthuis 2025-12-07 06:06:26 -0800
  • 577ddc87f5 Add per-namespace resume support for partitioned parquet output. Nathan TeBlunthuis 2025-12-06 06:56:19 -0800
  • d69d8b0df2 fix baseline output for new columns. Nathan TeBlunthuis 2025-12-02 19:22:08 -0800
  • 5ce9808b50 add templates and headings to wikiq. Nathan TeBlunthuis 2025-12-02 17:51:08 -0800
  • d3517ed5ca extract wikilinks. Nathan TeBlunthuis 2025-12-02 14:09:29 -0800
  • 329341efb6 improve tests. Nathan TeBlunthuis 2025-12-02 13:52:12 -0800
  • 76626a2785 Start working on adding columns from mwparserfromhell. Nathan TeBlunthuis 2025-12-02 12:26:03 -0800
  • b46f98a875 make --resume work with partitioned namespaces. Nathan TeBlunthuis 2025-12-01 07:19:52 -0800
  • 3c26185739 enable --resuming from interrupted jobs. Nathan TeBlunthuis 2025-11-30 20:36:31 -0800
  • 95b33123e3 revert previous and decrease timeout. Nathan TeBlunthuis 2025-11-28 20:29:51 -0800
  • 5c4fc6d5a0 let cache capacity be large. Nathan TeBlunthuis 2025-11-28 19:22:43 -0800
  • 77c7d2ba97 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs Nathan TeBlunthuis 2025-11-24 11:03:24 -0800
  • c40930d7d2 use ssh for gitea. Nathan TeBlunthuis 2025-11-24 11:01:36 -0800
  • 3b4c9c4441 index particular commit in pywikidiff2. Nathan TeBlunthuis 2025-11-18 20:17:52 -0800
  • 3a480940e9 fix error loggging. Nathan TeBlunthuis 2025-08-07 09:38:42 -0700
  • a1f94078c4 improve style. Nathan TeBlunthuis 2025-08-07 09:20:49 -0700
  • 329d682f4c fix asyncio bug. Nathan TeBlunthuis 2025-08-07 09:10:16 -0700
  • 19f67b3679 try fixing coro issue. Nathan TeBlunthuis 2025-08-07 08:58:45 -0700
  • 9b3237014d fix a couple possible bugs. Nathan TeBlunthuis 2025-08-05 23:20:04 -0700
  • bd8c30d80f fix raising exception. Nathan TeBlunthuis 2025-08-04 07:57:31 -0700
  • adf02310ef bugfix Nathan TeBlunthuis 2025-08-03 21:54:41 -0700
  • a563eaf6fc Timeout diffs. Nathan TeBlunthuis 2025-08-03 20:02:18 -0700
  • 730c678f51 disable cache limits. Nathan TeBlunthuis 2025-08-03 11:50:57 -0700
  • 77f367d95e Revert changes related to row-buffering to just "increase cache size." Nathan TeBlunthuis 2025-08-03 09:36:42 -0700
  • 1f08c01cf1 increase cache size. Nathan TeBlunthuis 2025-08-03 09:24:35 -0700
  • 2f853a879d reduce memory a tich more. Nathan TeBlunthuis 2025-08-01 20:10:38 -0700
  • 9799919470 reduce memory even more. Nathan TeBlunthuis 2025-08-01 19:59:36 -0700
  • 7528dc8b8e try reducing memory more. Nathan TeBlunthuis 2025-08-01 19:52:18 -0700
  • 615d630ff0 reduce memory usage. Nathan TeBlunthuis 2025-08-01 19:45:21 -0700
  • 32bc05ddfd set cache limits. Nathan TeBlunthuis 2025-08-01 19:30:46 -0700
  • ef78310580 reduce cache more. Nathan TeBlunthuis 2025-08-01 19:25:50 -0700
  • 40a92d2db6 reduce wd2 cache size Nathan TeBlunthuis 2025-08-01 19:18:26 -0700
  • 6bec0de9b2 configure wikidiff2. Nathan TeBlunthuis 2025-08-01 18:52:18 -0700
  • 54e996b910 configure pywikidiff2 cache limits. Nathan TeBlunthuis 2025-08-01 09:24:54 -0700
  • 83c92d1a37 decrease moved paragraph detection cutoff to see if that fixes memory issue. Nathan TeBlunthuis 2025-07-22 13:29:01 -0700
  • 076df15740 force garbage collection. Nathan TeBlunthuis 2025-07-22 13:13:18 -0700
  • 6557e25af7 make a new pywikidiff2 object for each revision to reduce memory. Nathan TeBlunthuis 2025-07-22 09:50:30 -0700
  • d20075b323 add memray for debugging memory usage. Nathan TeBlunthuis 2025-07-17 15:17:23 -0700
  • 6d03cac28d decrease batch_size. Nathan TeBlunthuis 2025-07-15 19:37:26 -0700
  • 3a44cfd4da increase batch size. Nathan TeBlunthuis 2025-07-15 19:09:36 -0700
  • 0fbe788e31 use ichunked instead of chunked. Nathan TeBlunthuis 2025-07-15 18:25:44 -0700
  • 37d095199a inc version. Nathan TeBlunthuis 2025-07-15 15:37:55 -0700
  • 6b04791de2 reduce batch size. Nathan TeBlunthuis 2025-07-15 15:31:00 -0700
  • 507335941d Revert "Merge branch 'compute-diffs' into HEAD" Nathan TeBlunthuis 2025-07-15 15:23:50 -0700
  • 907a35323e Merge branch 'compute-diffs' into HEAD Nathan TeBlunthuis 2025-07-15 15:23:13 -0700
  • c40506137b make wikiq memory efficient again via batch processing. Nathan TeBlunthuis 2025-07-15 15:20:17 -0700
  • e53e7ada5d try fixing the memory problem. Nathan TeBlunthuis 2025-07-14 18:58:27 -0700
  • 76d54ae597 support partitioning output parquet by namespace. Nathan TeBlunthuis 2025-07-07 20:58:43 -0700
  • c9fb94ccc0 fix tests. Nathan TeBlunthuis 2025-07-07 20:25:00 -0700
  • ac1dd47b08 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs Nathan TeBlunthuis 2025-07-07 20:16:38 -0700
  • c597a6b7f4 refactor into src-layout package. Nathan TeBlunthuis 2025-07-07 20:13:17 -0700
  • a2984bc656 refactor into src-layout package. Nathan TeBlunthuis 2025-07-07 20:13:17 -0700
  • 56c90fe1cc add missing files + add sorted_columns metadata. Nathan TeBlunthuis 2025-07-07 19:08:31 -0700
  • d6c4c0a416 add (optional) diff and text columns to output. Nathan TeBlunthuis 2025-07-07 14:39:52 -0700
  • a8e9e7f4fd wikidiff2 integration: pwr complete. Nathan TeBlunthuis 2025-07-07 12:06:43 -0700
  • 58c595bf0b add test files. Nathan TeBlunthuis 2025-07-07 11:22:56 -0700
  • cc96bb5f3f remove server. Nathan TeBlunthuis 2025-07-07 11:21:16 -0700
  • 14e819e565 compare pywikidiff2 to making requests to wikidiff2. Nathan TeBlunthuis 2025-07-07 10:51:11 -0700
  • 4654911533 almost there. working out edge cases. Nathan TeBlunthuis 2025-07-03 21:32:44 -0700
  • cf1fb61a84 WIP: fixing bugs and adding newlines to output. Nathan TeBlunthuis 2025-07-02 13:31:32 -0700
  • c4acc711d2 finish support for paragraph move. Nathan TeBlunthuis 2025-07-01 11:16:08 -0700
  • 20de5b93f9 Merge branch 'tmp' into compute-diffs Nathan TeBlunthuis 2025-06-30 20:52:23 -0500
  • 37734ed092 add test. Nathan TeBlunthuis 2025-06-30 15:45:56 -0700
  • 5a3e4102b5 got wikidiff2 persistence working except for paragraph moves. Nathan TeBlunthuis 2025-06-30 15:37:54 -0700
  • 186cb82fb8 some work on wiki_diff_matcher.py Nathan TeBlunthuis 2025-06-27 07:13:41 -0700
  • bc7f186112 Start interoperability between wikidiff2 and deltas Will Beason 2025-06-26 16:08:50 -0500
  • 1ec8bfaad4 Add php.ini file Will Beason 2025-06-24 09:24:35 -0500
  • 94454ffca3 Add PHP server file Will Beason 2025-06-23 14:17:53 -0500
  • 62db384aa4 Pass arrays of diffs instead of incremental Will Beason 2025-06-23 14:17:01 -0500
  • 96915a074b Add call to compute diffs via local PHP server Will Beason 2025-06-23 13:09:27 -0500
  • 0d9ab003f0 Fix tests for new field parquet_support Will Beason 2025-06-17 12:44:07 -0500
  • 4bbed4a196 Merge branch 'parquet_support' into test-parquet Will Beason 2025-06-17 12:20:19 -0500
  • 11d2587471 Add docs and rename import pc -> pacsv Will Beason 2025-06-17 11:46:16 -0500
  • 586ae85c65 Conform to 3.9 union type formatting Will Beason 2025-06-17 11:41:46 -0500
  • 390499dd90 Pin to python 3.9 Will Beason 2025-06-17 11:37:20 -0500
  • 84d464ea38 Remove unnecessary re-conversion to list(revs) Will Beason 2025-06-17 11:23:24 -0500
  • 3e8ae205e8 Factor out revision mutation logic into its own function Will Beason 2025-06-17 11:02:45 -0500
  • 8c707f5ef3 Remove unused code Will Beason 2025-06-03 17:20:05 -0500
  • b50c51a215 Get regex working Will Beason 2025-06-03 16:02:18 -0500
  • 89465b29f4 Re-add special case where revert radius is zero Will Beason 2025-06-03 15:18:21 -0500
  • 17c7f208ab Add collapsed_revs back Will Beason 2025-06-03 15:08:57 -0500
  • 123b9a18a8 Fix revert column behavior Will Beason 2025-06-03 15:03:33 -0500
  • 06a784ef27 Get columnar refactor partially working Will Beason 2025-06-03 12:51:31 -0500
  • 8b0f775610 Begin move to columnar types Will Beason 2025-06-03 08:52:57 -0500
  • f916af9836 Allow specifying output file basename instead of just directory Will Beason 2025-06-02 14:13:13 -0500
  • 9ee5ecfc91 Separate revision iteration and field collation logic Will Beason 2025-05-30 14:09:16 -0500
  • f9383440a0 Fix tests Will Beason 2025-05-30 13:56:31 -0500
  • 032fec3198 Remove unnecessary urlencode tests Will Beason 2025-05-30 13:20:10 -0500
  • 0d56267ae0 Get parquet libraries writing files Will Beason 2025-05-30 13:06:26 -0500