Commit Graph

66 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
6b4f3939a5 more work on resuming. 2025-12-10 21:07:52 -08:00
Nathan TeBlunthuis
c3d31b4ab5 handle case when we have a valid resume file, but a corrupted original. 2025-12-10 20:33:04 -08:00
Nathan TeBlunthuis
f427291fd8 add logic for resuming after a resume. 2025-12-10 19:26:54 -08:00
Nathan TeBlunthuis
577ddc87f5 Add per-namespace resume support for partitioned parquet output.
- Implement per-namespace resume points (dict mapping namespace -> (pageid, revid))
  to correctly handle interleaved dump ordering in partitioned output
- Extract resume functionality to dedicated resume.py module
- Add graceful shutdown handling via shutdown_requested flag (CLI-level only)
- Use lazy ParquetWriter creation to avoid empty files on early exit
- Refactor writing logic to _write_batch() helper method
- Simplify control flow by replacing continue statements with should_write flag
2025-12-06 06:56:19 -08:00
Nathan TeBlunthuis
d69d8b0df2 fix baseline output for new columns. 2025-12-02 19:22:08 -08:00
Nathan TeBlunthuis
5ce9808b50 add templates and headings to wikiq. 2025-12-02 17:51:08 -08:00
Nathan TeBlunthuis
d3517ed5ca extract wikilinks. 2025-12-02 14:09:29 -08:00
Nathan TeBlunthuis
329341efb6 improve tests. 2025-12-02 13:52:12 -08:00
Nathan TeBlunthuis
76626a2785 Start working on adding columns from mwparserfromhell. 2025-12-02 12:26:03 -08:00
Nathan TeBlunthuis
b46f98a875 make --resume work with partitioned namespaces. 2025-12-01 07:19:52 -08:00
Nathan TeBlunthuis
3c26185739 enable --resuming from interrupted jobs. 2025-11-30 20:36:31 -08:00
Nathan TeBlunthuis
c40506137b make wikiq memory efficient again via batch processing. 2025-07-15 15:20:17 -07:00
Nathan TeBlunthuis
76d54ae597 support partitioning output parquet by namespace. 2025-07-07 20:58:43 -07:00
Nathan TeBlunthuis
c9fb94ccc0 fix tests. 2025-07-07 20:25:00 -07:00
Nathan TeBlunthuis
c597a6b7f4 refactor into src-layout package. 2025-07-07 20:14:13 -07:00
Nathan TeBlunthuis
56c90fe1cc add missing files + add sorted_columns metadata. 2025-07-07 19:08:31 -07:00
Nathan TeBlunthuis
d6c4c0a416 add (optional) diff and text columns to output. 2025-07-07 14:39:52 -07:00
Nathan TeBlunthuis
a8e9e7f4fd wikidiff2 integration: pwr complete.
test for pwr based on wikidiff2.
2025-07-07 12:18:22 -07:00
Nathan TeBlunthuis
58c595bf0b add test files. 2025-07-07 11:29:10 -07:00
Nathan TeBlunthuis
cc96bb5f3f remove server. 2025-07-07 11:21:28 -07:00
Nathan TeBlunthuis
14e819e565 compare pywikidiff2 to making requests to wikidiff2. 2025-07-07 10:51:11 -07:00
Nathan TeBlunthuis
4654911533 almost there. working out edge cases. 2025-07-03 21:32:44 -07:00
Nathan TeBlunthuis
cf1fb61a84 WIP: fixing bugs and adding newlines to output. 2025-07-02 13:31:32 -07:00
Nathan TeBlunthuis
c4acc711d2 finish support for paragraph move. 2025-07-01 11:19:00 -07:00
Nathan TeBlunthuis
37734ed092 add test. 2025-06-30 15:45:56 -07:00
Will Beason
bc7f186112 Start interoperability between wikidiff2 and deltas
The big challenges here (and remaining) are as follows:

1. Deltas requires changes to be given at the token level,
whereas wikidiff2 reports changes at the byte level. Thus,
it is often required to tokenize sequences of text to convert
to the desired token indices. As-is this is done inefficiently,
often requiring re-tokenization of previously-tokenized sequences.
A better implementation would incrementally tokenize, or
automatically find the referenced sequences.

2. Deltas only allows for Equal/Insert/Delete operations,
while wikidiff2 also detects paragraph moves. These paragraph
moves are NOT equivalent to Equal, as the moved paragraphs
are not guaranteed to be equivalent, just very similar.
Wikidiff2 does not report changes to moved paragraphs, so
to preserve token persistence, a difference algorithm
would need to be performed on the before/after sequences.
A stopgap (currently implemented) is to turn these
into strict deletions/insertions.

3. There appears to be a lot of memory consumption, and
sometimes this results in memory overflow. I am unsure
if this is a memory leak or simply that re-tokenizing
causes significant enough memory throughput that
my machine can't handle it.

4. Deltas expects all tokens in the before/after text to
be covered by segment ranges of Equal/Insert/Delete, but
wikidiff2 does not appear to ever emit any Equal ranges,
instead skipping them. These ranges must be computed
and inserted in sequence. As-is the code does not correctly
handle unchanged text at the end of pages.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-26 16:08:50 -05:00
Will Beason
96915a074b Add call to compute diffs via local PHP server
This is inefficient as it requires an individal request per diff.

Going to try collecting the revision texts to reduce communication
overhead.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 13:09:27 -05:00
Will Beason
0d9ab003f0 Fix tests for new field
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 12:44:07 -05:00
Will Beason
586ae85c65 Conform to 3.9 union type formatting
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:41:46 -05:00
Will Beason
123b9a18a8 Fix revert column behavior
Now all columns are tested in the parquet test.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:03:33 -05:00
Will Beason
06a784ef27 Get columnar refactor partially working
Noargs works, now to do persistence.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 12:51:31 -05:00
Will Beason
f916af9836 Allow specifying output file basename instead of just directory
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.

This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-02 14:13:13 -05:00
Will Beason
f9383440a0 Fix tests
Surprisingly replacing list<str> with str doesn't break anything,
even baselines.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:56:31 -05:00
Will Beason
032fec3198 Remove unnecessary urlencode tests
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:20:10 -05:00
Will Beason
aec6e5fafa Refactor collapse user logic
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:20:34 -05:00
Will Beason
c0e629a313 Add ability to disable revert detection
Also add test to ensure functionality works.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 11:59:10 -05:00
Will Beason
9009bb6fa4 Merge branch 'parquet_support' into test-parquet 2025-05-29 10:21:30 -05:00
Nathan TeBlunthuis
2a2b611d79 Fix issue with .7z archives
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.

This fixes the bug so .7z archives work in either case.
2025-05-28 21:49:11 -07:00
Will Beason
ee01ce3e61 Get Parquet test working
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 16:48:58 -05:00
Will Beason
52757a8239 Add noargs test for ikwiki
This way we can ensure that the parquet code outputs equivalent output.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 15:04:10 -05:00
Will Beason
3f94144b1b Begin adding test for parquet export
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.

Tests failing. Don't merge yet.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 13:17:30 -05:00
Will Beason
df0ad1de63 Finish test standardization
Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 10:11:58 -05:00
Will Beason
f3e6cc9392 Begin refactor of tests to make new tests easier to write
Handle file naming logic centrally rather than requiring a dedicated
class per input file.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 09:11:36 -05:00
Will Beason
c8b14c3303 Refactor test temporary file logic and wikiq call pattern
Test file refreshing and path computation is now handled by a helper.

The wikiq command is now constructed and handled by a single method
rather than in several ad-hoc ways.

The last places relying on the working directory are now removed.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 16:24:07 -05:00
Will Beason
4d3900b541 Standardize calling for wikiq in tests
This way failures show the output of stderr/etc.

Also create path constant strings for use in tests to avoid repetition
and make changes easier.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 14:27:49 -05:00
Will Beason
ebc57864f2 Make tests runnable from anywhere
Tests no longer implicitly require that the caller be in
a specific working directory.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 13:40:57 -05:00
Will Beason
3d0bf89938 Move main logic to main()
This avoids:
1) the main function running when sourcing the file
2) Creating many globally-scoped variables in the main logic

Also begin refactor of test output file logic

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 11:10:42 -05:00
Will Beason
6d133575c7 Remove resource leaks from tests
Close subprocesses within tests to fix resource leak warning.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 15:08:47 -05:00
Will Beason
09a84e7d11 Reformat Wikiq_Unit_Test.py
Separate out reformatting from editing.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 15:07:39 -05:00
Will Beason
4804ecc4b3 Add additional test dependencies
These are now noted in requirements.txt

Also make dependency on 7zip and ffmpeg explicit in README

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 12:29:49 -05:00