The big challenges here (and remaining) are as follows:
1. Deltas requires changes to be given at the token level,
whereas wikidiff2 reports changes at the byte level. Thus,
it is often required to tokenize sequences of text to convert
to the desired token indices. As-is this is done inefficiently,
often requiring re-tokenization of previously-tokenized sequences.
A better implementation would incrementally tokenize, or
automatically find the referenced sequences.
2. Deltas only allows for Equal/Insert/Delete operations,
while wikidiff2 also detects paragraph moves. These paragraph
moves are NOT equivalent to Equal, as the moved paragraphs
are not guaranteed to be equivalent, just very similar.
Wikidiff2 does not report changes to moved paragraphs, so
to preserve token persistence, a difference algorithm
would need to be performed on the before/after sequences.
A stopgap (currently implemented) is to turn these
into strict deletions/insertions.
3. There appears to be a lot of memory consumption, and
sometimes this results in memory overflow. I am unsure
if this is a memory leak or simply that re-tokenizing
causes significant enough memory throughput that
my machine can't handle it.
4. Deltas expects all tokens in the before/after text to
be covered by segment ranges of Equal/Insert/Delete, but
wikidiff2 does not appear to ever emit any Equal ranges,
instead skipping them. These ranges must be computed
and inserted in sequence. As-is the code does not correctly
handle unchanged text at the end of pages.
Signed-off-by: Will Beason <willbeason@gmail.com>
This is inefficient as it requires an individal request per diff.
Going to try collecting the revision texts to reduce communication
overhead.
Signed-off-by: Will Beason <willbeason@gmail.com>
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.
This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.
Signed-off-by: Will Beason <willbeason@gmail.com>
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.
Signed-off-by: Will Beason <willbeason@gmail.com>
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case.
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.
Signed-off-by: Will Beason <willbeason@gmail.com>
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.
Tests failing. Don't merge yet.
Signed-off-by: Will Beason <willbeason@gmail.com>
Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.
Signed-off-by: Will Beason <willbeason@gmail.com>
Test file refreshing and path computation is now handled by a helper.
The wikiq command is now constructed and handled by a single method
rather than in several ad-hoc ways.
The last places relying on the working directory are now removed.
Signed-off-by: Will Beason <willbeason@gmail.com>
This way failures show the output of stderr/etc.
Also create path constant strings for use in tests to avoid repetition
and make changes easier.
Signed-off-by: Will Beason <willbeason@gmail.com>
This avoids:
1) the main function running when sourcing the file
2) Creating many globally-scoped variables in the main logic
Also begin refactor of test output file logic
Signed-off-by: Will Beason <willbeason@gmail.com>