Nathan TeBlunthuis
d1fc094c96
don't put checkpoint files inside namespace directories.
2025-12-07 06:24:04 -08:00
Nathan TeBlunthuis
783f5fd8bc
improve resume logic.
2025-12-07 06:06:26 -08:00
Nathan TeBlunthuis
577ddc87f5
Add per-namespace resume support for partitioned parquet output.
...
- Implement per-namespace resume points (dict mapping namespace -> (pageid, revid))
to correctly handle interleaved dump ordering in partitioned output
- Extract resume functionality to dedicated resume.py module
- Add graceful shutdown handling via shutdown_requested flag (CLI-level only)
- Use lazy ParquetWriter creation to avoid empty files on early exit
- Refactor writing logic to _write_batch() helper method
- Simplify control flow by replacing continue statements with should_write flag
2025-12-06 06:56:19 -08:00
Nathan TeBlunthuis
d69d8b0df2
fix baseline output for new columns.
2025-12-02 19:22:08 -08:00
Nathan TeBlunthuis
5ce9808b50
add templates and headings to wikiq.
2025-12-02 17:51:08 -08:00
Nathan TeBlunthuis
d3517ed5ca
extract wikilinks.
2025-12-02 14:09:29 -08:00
Nathan TeBlunthuis
329341efb6
improve tests.
2025-12-02 13:52:12 -08:00
Nathan TeBlunthuis
76626a2785
Start working on adding columns from mwparserfromhell.
2025-12-02 12:26:03 -08:00
Nathan TeBlunthuis
b46f98a875
make --resume work with partitioned namespaces.
2025-12-01 07:19:52 -08:00
Nathan TeBlunthuis
3c26185739
enable --resuming from interrupted jobs.
2025-11-30 20:36:31 -08:00
Nathan TeBlunthuis
95b33123e3
revert previous and decrease timeout.
2025-11-28 20:29:51 -08:00
Nathan TeBlunthuis
5c4fc6d5a0
let cache capacity be large.
2025-11-28 19:22:43 -08:00
Nathan TeBlunthuis
77c7d2ba97
Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs
2025-11-24 11:03:24 -08:00
Nathan TeBlunthuis
c40930d7d2
use ssh for gitea.
2025-11-24 11:01:36 -08:00
Nathan TeBlunthuis
3b4c9c4441
index particular commit in pywikidiff2.
2025-11-18 20:17:52 -08:00
Nathan TeBlunthuis
3a480940e9
fix error loggging.
2025-08-07 09:38:42 -07:00
Nathan TeBlunthuis
a1f94078c4
improve style.
2025-08-07 09:20:49 -07:00
Nathan TeBlunthuis
329d682f4c
fix asyncio bug.
2025-08-07 09:10:16 -07:00
Nathan TeBlunthuis
19f67b3679
try fixing coro issue.
2025-08-07 08:58:45 -07:00
Nathan TeBlunthuis
9b3237014d
fix a couple possible bugs.
2025-08-05 23:20:04 -07:00
Nathan TeBlunthuis
bd8c30d80f
fix raising exception.
2025-08-04 07:57:31 -07:00
Nathan TeBlunthuis
adf02310ef
bugfix
2025-08-03 21:54:41 -07:00
Nathan TeBlunthuis
a563eaf6fc
Timeout diffs.
2025-08-03 20:04:51 -07:00
Nathan TeBlunthuis
730c678f51
disable cache limits.
2025-08-03 11:50:57 -07:00
Nathan TeBlunthuis
77f367d95e
Revert changes related to row-buffering to just "increase cache size."
...
This reverts commit 1f08c01cf1 .
2025-08-03 09:37:35 -07:00
Nathan TeBlunthuis
1f08c01cf1
increase cache size.
2025-08-03 09:24:35 -07:00
Nathan TeBlunthuis
2f853a879d
reduce memory a tich more.
2025-08-01 20:10:38 -07:00
Nathan TeBlunthuis
9799919470
reduce memory even more.
2025-08-01 19:59:36 -07:00
Nathan TeBlunthuis
7528dc8b8e
try reducing memory more.
2025-08-01 19:52:18 -07:00
Nathan TeBlunthuis
615d630ff0
reduce memory usage.
2025-08-01 19:45:21 -07:00
Nathan TeBlunthuis
32bc05ddfd
set cache limits.
2025-08-01 19:30:46 -07:00
Nathan TeBlunthuis
ef78310580
reduce cache more.
2025-08-01 19:25:50 -07:00
Nathan TeBlunthuis
40a92d2db6
reduce wd2 cache size
2025-08-01 19:18:26 -07:00
Nathan TeBlunthuis
6bec0de9b2
configure wikidiff2.
2025-08-01 18:53:07 -07:00
Nathan TeBlunthuis
54e996b910
configure pywikidiff2 cache limits.
2025-08-01 09:24:54 -07:00
Nathan TeBlunthuis
83c92d1a37
decrease moved paragraph detection cutoff to see if that fixes memory issue.
2025-07-22 13:29:01 -07:00
Nathan TeBlunthuis
076df15740
force garbage collection.
2025-07-22 13:13:18 -07:00
Nathan TeBlunthuis
6557e25af7
make a new pywikidiff2 object for each revision to reduce memory.
2025-07-22 09:50:30 -07:00
Nathan TeBlunthuis
d20075b323
add memray for debugging memory usage.
2025-07-17 15:17:23 -07:00
Nathan TeBlunthuis
6d03cac28d
decrease batch_size.
2025-07-15 19:37:26 -07:00
Nathan TeBlunthuis
3a44cfd4da
increase batch size.
2025-07-15 19:09:36 -07:00
Nathan TeBlunthuis
0fbe788e31
use ichunked instead of chunked.
2025-07-15 18:25:44 -07:00
Nathan TeBlunthuis
37d095199a
inc version.
2025-07-15 15:37:55 -07:00
Nathan TeBlunthuis
6b04791de2
reduce batch size.
2025-07-15 15:31:00 -07:00
Nathan TeBlunthuis
507335941d
Revert "Merge branch 'compute-diffs' into HEAD"
...
This reverts commit 907a35323e , reversing
changes made to c40506137b .
2025-07-15 15:23:50 -07:00
Nathan TeBlunthuis
907a35323e
Merge branch 'compute-diffs' into HEAD
2025-07-15 15:23:13 -07:00
Nathan TeBlunthuis
c40506137b
make wikiq memory efficient again via batch processing.
2025-07-15 15:20:17 -07:00
Nathan TeBlunthuis
e53e7ada5d
try fixing the memory problem.
2025-07-14 18:58:27 -07:00
Nathan TeBlunthuis
76d54ae597
support partitioning output parquet by namespace.
2025-07-07 20:58:43 -07:00
Nathan TeBlunthuis
c9fb94ccc0
fix tests.
2025-07-07 20:25:00 -07:00