Commit Graph

161 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
3a480940e9 fix error loggging. 2025-08-07 09:38:42 -07:00
Nathan TeBlunthuis
a1f94078c4 improve style. 2025-08-07 09:20:49 -07:00
Nathan TeBlunthuis
329d682f4c fix asyncio bug. 2025-08-07 09:10:16 -07:00
Nathan TeBlunthuis
19f67b3679 try fixing coro issue. 2025-08-07 08:58:45 -07:00
Nathan TeBlunthuis
9b3237014d fix a couple possible bugs. 2025-08-05 23:20:04 -07:00
Nathan TeBlunthuis
bd8c30d80f fix raising exception. 2025-08-04 07:57:31 -07:00
Nathan TeBlunthuis
adf02310ef bugfix 2025-08-03 21:54:41 -07:00
Nathan TeBlunthuis
a563eaf6fc Timeout diffs. 2025-08-03 20:04:51 -07:00
Nathan TeBlunthuis
730c678f51 disable cache limits. 2025-08-03 11:50:57 -07:00
Nathan TeBlunthuis
77f367d95e Revert changes related to row-buffering to just "increase cache size."
This reverts commit 1f08c01cf1.
2025-08-03 09:37:35 -07:00
Nathan TeBlunthuis
1f08c01cf1 increase cache size. 2025-08-03 09:24:35 -07:00
Nathan TeBlunthuis
2f853a879d reduce memory a tich more. 2025-08-01 20:10:38 -07:00
Nathan TeBlunthuis
9799919470 reduce memory even more. 2025-08-01 19:59:36 -07:00
Nathan TeBlunthuis
7528dc8b8e try reducing memory more. 2025-08-01 19:52:18 -07:00
Nathan TeBlunthuis
615d630ff0 reduce memory usage. 2025-08-01 19:45:21 -07:00
Nathan TeBlunthuis
32bc05ddfd set cache limits. 2025-08-01 19:30:46 -07:00
Nathan TeBlunthuis
ef78310580 reduce cache more. 2025-08-01 19:25:50 -07:00
Nathan TeBlunthuis
40a92d2db6 reduce wd2 cache size 2025-08-01 19:18:26 -07:00
Nathan TeBlunthuis
6bec0de9b2 configure wikidiff2. 2025-08-01 18:53:07 -07:00
Nathan TeBlunthuis
54e996b910 configure pywikidiff2 cache limits. 2025-08-01 09:24:54 -07:00
Nathan TeBlunthuis
83c92d1a37 decrease moved paragraph detection cutoff to see if that fixes memory issue. 2025-07-22 13:29:01 -07:00
Nathan TeBlunthuis
076df15740 force garbage collection. 2025-07-22 13:13:18 -07:00
Nathan TeBlunthuis
6557e25af7 make a new pywikidiff2 object for each revision to reduce memory. 2025-07-22 09:50:30 -07:00
Nathan TeBlunthuis
d20075b323 add memray for debugging memory usage. 2025-07-17 15:17:23 -07:00
Nathan TeBlunthuis
6d03cac28d decrease batch_size. 2025-07-15 19:37:26 -07:00
Nathan TeBlunthuis
3a44cfd4da increase batch size. 2025-07-15 19:09:36 -07:00
Nathan TeBlunthuis
0fbe788e31 use ichunked instead of chunked. 2025-07-15 18:25:44 -07:00
Nathan TeBlunthuis
37d095199a inc version. 2025-07-15 15:37:55 -07:00
Nathan TeBlunthuis
6b04791de2 reduce batch size. 2025-07-15 15:31:00 -07:00
Nathan TeBlunthuis
507335941d Revert "Merge branch 'compute-diffs' into HEAD"
This reverts commit 907a35323e, reversing
changes made to c40506137b.
2025-07-15 15:23:50 -07:00
Nathan TeBlunthuis
907a35323e Merge branch 'compute-diffs' into HEAD 2025-07-15 15:23:13 -07:00
Nathan TeBlunthuis
c40506137b make wikiq memory efficient again via batch processing. 2025-07-15 15:20:17 -07:00
Nathan TeBlunthuis
e53e7ada5d try fixing the memory problem. 2025-07-14 18:58:27 -07:00
Nathan TeBlunthuis
76d54ae597 support partitioning output parquet by namespace. 2025-07-07 20:58:43 -07:00
Nathan TeBlunthuis
c9fb94ccc0 fix tests. 2025-07-07 20:25:00 -07:00
Nathan TeBlunthuis
ac1dd47b08 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs 2025-07-07 20:16:38 -07:00
Nathan TeBlunthuis
c597a6b7f4 refactor into src-layout package. 2025-07-07 20:14:13 -07:00
Nathan TeBlunthuis
a2984bc656 refactor into src-layout package. 2025-07-07 20:13:17 -07:00
Nathan TeBlunthuis
56c90fe1cc add missing files + add sorted_columns metadata. 2025-07-07 19:08:31 -07:00
Nathan TeBlunthuis
d6c4c0a416 add (optional) diff and text columns to output. 2025-07-07 14:39:52 -07:00
Nathan TeBlunthuis
a8e9e7f4fd wikidiff2 integration: pwr complete.
test for pwr based on wikidiff2.
2025-07-07 12:18:22 -07:00
Nathan TeBlunthuis
58c595bf0b add test files. 2025-07-07 11:29:10 -07:00
Nathan TeBlunthuis
cc96bb5f3f remove server. 2025-07-07 11:21:28 -07:00
Nathan TeBlunthuis
14e819e565 compare pywikidiff2 to making requests to wikidiff2. 2025-07-07 10:51:11 -07:00
Nathan TeBlunthuis
4654911533 almost there. working out edge cases. 2025-07-03 21:32:44 -07:00
Nathan TeBlunthuis
cf1fb61a84 WIP: fixing bugs and adding newlines to output. 2025-07-02 13:31:32 -07:00
Nathan TeBlunthuis
c4acc711d2 finish support for paragraph move. 2025-07-01 11:19:00 -07:00
Nathan TeBlunthuis
20de5b93f9 Merge branch 'tmp' into compute-diffs 2025-06-30 20:52:23 -05:00
Nathan TeBlunthuis
37734ed092 add test. 2025-06-30 15:45:56 -07:00
Nathan TeBlunthuis
5a3e4102b5 got wikidiff2 persistence working except for paragraph moves. 2025-06-30 15:37:54 -07:00