Commit Graph

64 Commits

Author SHA1 Message Date
Will Beason
8c707f5ef3 Remove unused code
This should help PR readability.

There is likely still some unused code, but that should be the
bulk of it.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 17:20:05 -05:00
Will Beason
b50c51a215 Get regex working
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 16:02:18 -05:00
Will Beason
89465b29f4 Re-add special case where revert radius is zero
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:18:21 -05:00
Will Beason
17c7f208ab Add collapsed_revs back
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:08:57 -05:00
Will Beason
123b9a18a8 Fix revert column behavior
Now all columns are tested in the parquet test.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:03:33 -05:00
Will Beason
06a784ef27 Get columnar refactor partially working
Noargs works, now to do persistence.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 12:51:31 -05:00
Will Beason
8b0f775610 Begin move to columnar types
This will allow making columns optional, as desired, and make
adding new columns straightforward without impacting existing
behavior.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 08:52:57 -05:00
Will Beason
f916af9836 Allow specifying output file basename instead of just directory
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.

This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-02 14:13:13 -05:00
Will Beason
9ee5ecfc91 Separate revision iteration and field collation logic
This way we're not adding temporary fields to objects that don't
normally have these fields.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 14:09:16 -05:00
Will Beason
f9383440a0 Fix tests
Surprisingly replacing list<str> with str doesn't break anything,
even baselines.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:56:31 -05:00
Will Beason
032fec3198 Remove unnecessary urlencode tests
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:20:10 -05:00
Will Beason
0d56267ae0 Get parquet libraries writing files
Tests broken due to url encoding, which can likely now be removed.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:06:26 -05:00
Will Beason
4dde25c508 Refactor revision logic to make more straightforward
Use groupby so we don't have to deal with edge cases and compare
revisions directly.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:46:45 -05:00
Will Beason
aec6e5fafa Refactor collapse user logic
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:20:34 -05:00
Will Beason
c0e629a313 Add ability to disable revert detection
Also add test to ensure functionality works.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 11:59:10 -05:00
Will Beason
9009bb6fa4 Merge branch 'parquet_support' into test-parquet 2025-05-29 10:21:30 -05:00
Nathan TeBlunthuis
2a2b611d79 Fix issue with .7z archives
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.

This fixes the bug so .7z archives work in either case.
2025-05-28 21:49:11 -07:00
Nathan TeBlunthuis
383ee03250 Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support 2025-05-28 21:09:13 -07:00
Nathan TeBlunthuis
3c7fb088d6 fix schema bugs. 2025-05-28 20:54:42 -07:00
Will Beason
ee01ce3e61 Get Parquet test working
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 16:48:58 -05:00
Will Beason
52757a8239 Add noargs test for ikwiki
This way we can ensure that the parquet code outputs equivalent output.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 15:04:10 -05:00
Will Beason
3f94144b1b Begin adding test for parquet export
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.

Tests failing. Don't merge yet.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 13:17:30 -05:00
Will Beason
3d0bf89938 Move main logic to main()
This avoids:
1) the main function running when sourcing the file
2) Creating many globally-scoped variables in the main logic

Also begin refactor of test output file logic

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 11:10:42 -05:00
Will Beason
6d133575c7 Remove resource leaks from tests
Close subprocesses within tests to fix resource leak warning.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 15:08:47 -05:00
Will Beason
9c5bf577e6 Remove unused dependencies and fix spacing
The "mw" and "numpy" dependencies were unneeded.

Spaces and tabs were inconsistently used.
They are now used consistently, changes via auto-formatter.

Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 14:15:01 -05:00
1aea601a30 [Bugfix] Call the correct matchmake function. 2021-11-16 16:53:21 -08:00
c437b357db rename matchmake functions 2021-11-11 19:09:41 -08:00
bb83d62b74 Add some descriptive comments. 2021-10-19 16:55:24 -07:00
b1bea09ad6 fix bugs and unit tests 2021-10-18 13:33:05 -07:00
9a0c157ebb bugfix 2021-10-18 10:15:03 -07:00
ae870fed0b parquet path is code-complete 2021-10-17 21:46:31 -07:00
26f6d8f984 remove dependency on pandas. 2021-10-17 20:24:33 -07:00
ae9a241747 use dataclasses and pyarrow for types. 2021-10-17 20:21:22 -07:00
d8d20f670b initial work on parquet support 2021-10-17 13:22:22 -07:00
cdfa77d66d remove commented code 2019-11-11 11:28:48 -08:00
02b3250a36 refactor regex matching in a tidier object oriented style 2019-11-09 13:07:46 -08:00
414cc5ff2d validate tests and add asserts and baselines for regex tests. 2019-11-09 12:19:55 -08:00
sohyeonhwang
f147e1d899 merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests 2019-11-07 13:28:17 -06:00
c4416d0f1b make revert radius configurable 2019-10-07 13:57:49 -07:00
7b856bec86 Merge branch 'master' into regex_scanner 2019-10-05 18:17:03 -07:00
17529cdd48 bugfix, remove old legacy persistence flag 2019-10-05 16:13:11 -07:00
sohyeonhwang
7bf4559ceb changes for regex scanner addition 2019-10-05 15:36:58 -05:00
fb052ffa33 edont compute persistence by default 2019-09-22 15:54:17 -07:00
e871023ff5 elaborate docstring for persistence 2019-09-22 15:11:59 -07:00
7d62ff9fb7 improve help for namespace-include 2018-09-03 11:30:12 -07:00
Nate E TeBlunthuis
f784c77f60 add namespace filter parameter 2018-09-03 11:13:48 -07:00
317bafb50d Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence 2018-08-23 19:00:49 -07:00
7cd0bf3b9e Add parameter for selecting specific namespaces. 2018-08-23 18:49:32 -07:00
d93769c21f Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence 2018-08-23 18:27:09 -07:00
Nate E TeBlunthuis
afd40c1a45 Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence 2018-08-23 18:25:51 -07:00