Commit Graph

120 Commits

Author SHA1 Message Date
Nathan TeBlunthuis
58c595bf0b add test files. 2025-07-07 11:29:10 -07:00
Nathan TeBlunthuis
cc96bb5f3f remove server. 2025-07-07 11:21:28 -07:00
Nathan TeBlunthuis
14e819e565 compare pywikidiff2 to making requests to wikidiff2. 2025-07-07 10:51:11 -07:00
Nathan TeBlunthuis
4654911533 almost there. working out edge cases. 2025-07-03 21:32:44 -07:00
Nathan TeBlunthuis
cf1fb61a84 WIP: fixing bugs and adding newlines to output. 2025-07-02 13:31:32 -07:00
Nathan TeBlunthuis
c4acc711d2 finish support for paragraph move. 2025-07-01 11:19:00 -07:00
Nathan TeBlunthuis
20de5b93f9 Merge branch 'tmp' into compute-diffs 2025-06-30 20:52:23 -05:00
Nathan TeBlunthuis
37734ed092 add test. 2025-06-30 15:45:56 -07:00
Nathan TeBlunthuis
5a3e4102b5 got wikidiff2 persistence working except for paragraph moves. 2025-06-30 15:37:54 -07:00
Nathan TeBlunthuis
186cb82fb8 some work on wiki_diff_matcher.py 2025-06-27 07:13:41 -07:00
Will Beason
bc7f186112 Start interoperability between wikidiff2 and deltas
The big challenges here (and remaining) are as follows:

1. Deltas requires changes to be given at the token level,
whereas wikidiff2 reports changes at the byte level. Thus,
it is often required to tokenize sequences of text to convert
to the desired token indices. As-is this is done inefficiently,
often requiring re-tokenization of previously-tokenized sequences.
A better implementation would incrementally tokenize, or
automatically find the referenced sequences.

2. Deltas only allows for Equal/Insert/Delete operations,
while wikidiff2 also detects paragraph moves. These paragraph
moves are NOT equivalent to Equal, as the moved paragraphs
are not guaranteed to be equivalent, just very similar.
Wikidiff2 does not report changes to moved paragraphs, so
to preserve token persistence, a difference algorithm
would need to be performed on the before/after sequences.
A stopgap (currently implemented) is to turn these
into strict deletions/insertions.

3. There appears to be a lot of memory consumption, and
sometimes this results in memory overflow. I am unsure
if this is a memory leak or simply that re-tokenizing
causes significant enough memory throughput that
my machine can't handle it.

4. Deltas expects all tokens in the before/after text to
be covered by segment ranges of Equal/Insert/Delete, but
wikidiff2 does not appear to ever emit any Equal ranges,
instead skipping them. These ranges must be computed
and inserted in sequence. As-is the code does not correctly
handle unchanged text at the end of pages.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-26 16:08:50 -05:00
Will Beason
1ec8bfaad4 Add php.ini file
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-24 09:24:35 -05:00
Will Beason
94454ffca3 Add PHP server file
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 14:17:53 -05:00
Will Beason
62db384aa4 Pass arrays of diffs instead of incremental
This is 3.5x faster

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 14:17:01 -05:00
Will Beason
96915a074b Add call to compute diffs via local PHP server
This is inefficient as it requires an individal request per diff.

Going to try collecting the revision texts to reduce communication
overhead.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 13:09:27 -05:00
Will Beason
0d9ab003f0 Fix tests for new field
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 12:44:07 -05:00
Will Beason
4bbed4a196 Merge branch 'parquet_support' into test-parquet 2025-06-17 12:20:19 -05:00
Will Beason
11d2587471 Add docs and rename import pc -> pacsv
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:46:16 -05:00
Will Beason
586ae85c65 Conform to 3.9 union type formatting
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:41:46 -05:00
Will Beason
390499dd90 Pin to python 3.9
Since our execution environment requires this

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:37:20 -05:00
Will Beason
84d464ea38 Remove unnecessary re-conversion to list(revs)
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:23:24 -05:00
Will Beason
3e8ae205e8 Factor out revision mutation logic into its own function
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:02:45 -05:00
Will Beason
8c707f5ef3 Remove unused code
This should help PR readability.

There is likely still some unused code, but that should be the
bulk of it.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 17:20:05 -05:00
Will Beason
b50c51a215 Get regex working
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 16:02:18 -05:00
Will Beason
89465b29f4 Re-add special case where revert radius is zero
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:18:21 -05:00
Will Beason
17c7f208ab Add collapsed_revs back
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:08:57 -05:00
Will Beason
123b9a18a8 Fix revert column behavior
Now all columns are tested in the parquet test.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:03:33 -05:00
Will Beason
06a784ef27 Get columnar refactor partially working
Noargs works, now to do persistence.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 12:51:31 -05:00
Will Beason
8b0f775610 Begin move to columnar types
This will allow making columns optional, as desired, and make
adding new columns straightforward without impacting existing
behavior.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 08:52:57 -05:00
Will Beason
f916af9836 Allow specifying output file basename instead of just directory
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.

This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-02 14:13:13 -05:00
Will Beason
9ee5ecfc91 Separate revision iteration and field collation logic
This way we're not adding temporary fields to objects that don't
normally have these fields.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 14:09:16 -05:00
Will Beason
f9383440a0 Fix tests
Surprisingly replacing list<str> with str doesn't break anything,
even baselines.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:56:31 -05:00
Will Beason
032fec3198 Remove unnecessary urlencode tests
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:20:10 -05:00
Will Beason
0d56267ae0 Get parquet libraries writing files
Tests broken due to url encoding, which can likely now be removed.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:06:26 -05:00
Nathan TeBlunthuis
260e2b177c fix order of fields. 2025-05-29 18:32:16 -07:00
Nathan TeBlunthuis
a13d7f1deb typo fix. 2025-05-29 18:25:08 -07:00
Nathan TeBlunthuis
ffbd180001 make editorid null not '' in parquet. 2025-05-29 18:24:33 -07:00
Nathan TeBlunthuis
606a399450 handle empty comments which are 'False' somehow. 2025-05-29 18:14:58 -07:00
Nathan TeBlunthuis
a9f76a0f62 change order of fields. 2025-05-29 18:10:59 -07:00
Nathan TeBlunthuis
f39ceefa4a Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support 2025-05-29 18:05:28 -07:00
Nathan TeBlunthuis
13ee160708 bugfix. 2025-05-29 18:04:41 -07:00
Nathan TeBlunthuis
bd22d26291 update deps and add edit_summary to wikiq output. 2025-05-29 18:02:14 -07:00
Will Beason
4dde25c508 Refactor revision logic to make more straightforward
Use groupby so we don't have to deal with edge cases and compare
revisions directly.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:46:45 -05:00
Will Beason
aec6e5fafa Refactor collapse user logic
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:20:34 -05:00
Will Beason
c0e629a313 Add ability to disable revert detection
Also add test to ensure functionality works.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 11:59:10 -05:00
Will Beason
9009bb6fa4 Merge branch 'parquet_support' into test-parquet 2025-05-29 10:21:30 -05:00
Will Beason
ab280dd765 Remove requirements.txt and add uv.lock to ignored files.
We can choose to check in uv.lock later if we want.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 10:05:49 -05:00
Nathan TeBlunthuis
22d14dc5f2 Remove dependency on pytest. 2025-05-28 21:54:31 -07:00
Nathan TeBlunthuis
5a10f59dc4 Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support 2025-05-28 23:52:59 -05:00
Nathan TeBlunthuis
b8cdc82fc2 add ipython for dev 2025-05-28 23:52:37 -05:00