This is inefficient as it requires an individal request per diff.
Going to try collecting the revision texts to reduce communication
overhead.
Signed-off-by: Will Beason <willbeason@gmail.com>
This should help PR readability.
There is likely still some unused code, but that should be the
bulk of it.
Signed-off-by: Will Beason <willbeason@gmail.com>
This will allow making columns optional, as desired, and make
adding new columns straightforward without impacting existing
behavior.
Signed-off-by: Will Beason <willbeason@gmail.com>
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.
This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.
Signed-off-by: Will Beason <willbeason@gmail.com>
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.
Signed-off-by: Will Beason <willbeason@gmail.com>
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case.
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.
Signed-off-by: Will Beason <willbeason@gmail.com>
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.
Tests failing. Don't merge yet.
Signed-off-by: Will Beason <willbeason@gmail.com>
Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.
Signed-off-by: Will Beason <willbeason@gmail.com>
Test file refreshing and path computation is now handled by a helper.
The wikiq command is now constructed and handled by a single method
rather than in several ad-hoc ways.
The last places relying on the working directory are now removed.
Signed-off-by: Will Beason <willbeason@gmail.com>