Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.
Signed-off-by: Will Beason <willbeason@gmail.com>
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case.
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.
Signed-off-by: Will Beason <willbeason@gmail.com>
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.
Tests failing. Don't merge yet.
Signed-off-by: Will Beason <willbeason@gmail.com>
Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.
Signed-off-by: Will Beason <willbeason@gmail.com>
Test file refreshing and path computation is now handled by a helper.
The wikiq command is now constructed and handled by a single method
rather than in several ad-hoc ways.
The last places relying on the working directory are now removed.
Signed-off-by: Will Beason <willbeason@gmail.com>
This way failures show the output of stderr/etc.
Also create path constant strings for use in tests to avoid repetition
and make changes easier.
Signed-off-by: Will Beason <willbeason@gmail.com>
This avoids:
1) the main function running when sourcing the file
2) Creating many globally-scoped variables in the main logic
Also begin refactor of test output file logic
Signed-off-by: Will Beason <willbeason@gmail.com>
The "mw" and "numpy" dependencies were unneeded.
Spaces and tabs were inconsistently used.
They are now used consistently, changes via auto-formatter.
Signed-off-by: Will Beason <willbeason@gmail.com>