Will Beason
9ee5ecfc91
Separate revision iteration and field collation logic
...
This way we're not adding temporary fields to objects that don't
normally have these fields.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 14:09:16 -05:00
Will Beason
f9383440a0
Fix tests
...
Surprisingly replacing list<str> with str doesn't break anything,
even baselines.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:56:31 -05:00
Will Beason
032fec3198
Remove unnecessary urlencode tests
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:20:10 -05:00
Will Beason
0d56267ae0
Get parquet libraries writing files
...
Tests broken due to url encoding, which can likely now be removed.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:06:26 -05:00
Will Beason
4dde25c508
Refactor revision logic to make more straightforward
...
Use groupby so we don't have to deal with edge cases and compare
revisions directly.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:46:45 -05:00
Will Beason
aec6e5fafa
Refactor collapse user logic
...
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:20:34 -05:00
Will Beason
c0e629a313
Add ability to disable revert detection
...
Also add test to ensure functionality works.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 11:59:10 -05:00
Will Beason
9009bb6fa4
Merge branch 'parquet_support' into test-parquet
2025-05-29 10:21:30 -05:00
Nathan TeBlunthuis
2a2b611d79
Fix issue with .7z archives
...
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case.
2025-05-28 21:49:11 -07:00
Nathan TeBlunthuis
383ee03250
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-28 21:09:13 -07:00
Nathan TeBlunthuis
3c7fb088d6
fix schema bugs.
2025-05-28 20:54:42 -07:00
Will Beason
ee01ce3e61
Get Parquet test working
...
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 16:48:58 -05:00
Will Beason
52757a8239
Add noargs test for ikwiki
...
This way we can ensure that the parquet code outputs equivalent output.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 15:04:10 -05:00
Will Beason
3f94144b1b
Begin adding test for parquet export
...
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.
Tests failing. Don't merge yet.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 13:17:30 -05:00
Will Beason
3d0bf89938
Move main logic to main()
...
This avoids:
1) the main function running when sourcing the file
2) Creating many globally-scoped variables in the main logic
Also begin refactor of test output file logic
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 11:10:42 -05:00
Will Beason
6d133575c7
Remove resource leaks from tests
...
Close subprocesses within tests to fix resource leak warning.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 15:08:47 -05:00
Will Beason
9c5bf577e6
Remove unused dependencies and fix spacing
...
The "mw" and "numpy" dependencies were unneeded.
Spaces and tabs were inconsistently used.
They are now used consistently, changes via auto-formatter.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 14:15:01 -05:00
1aea601a30
[Bugfix] Call the correct matchmake function.
2021-11-16 16:53:21 -08:00
c437b357db
rename matchmake functions
2021-11-11 19:09:41 -08:00
bb83d62b74
Add some descriptive comments.
2021-10-19 16:55:24 -07:00
b1bea09ad6
fix bugs and unit tests
2021-10-18 13:33:05 -07:00
9a0c157ebb
bugfix
2021-10-18 10:15:03 -07:00
ae870fed0b
parquet path is code-complete
2021-10-17 21:46:31 -07:00
26f6d8f984
remove dependency on pandas.
2021-10-17 20:24:33 -07:00
ae9a241747
use dataclasses and pyarrow for types.
2021-10-17 20:21:22 -07:00
d8d20f670b
initial work on parquet support
2021-10-17 13:22:22 -07:00
cdfa77d66d
remove commented code
2019-11-11 11:28:48 -08:00
02b3250a36
refactor regex matching in a tidier object oriented style
2019-11-09 13:07:46 -08:00
414cc5ff2d
validate tests and add asserts and baselines for regex tests.
2019-11-09 12:19:55 -08:00
sohyeonhwang
f147e1d899
merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests
2019-11-07 13:28:17 -06:00
c4416d0f1b
make revert radius configurable
2019-10-07 13:57:49 -07:00
7b856bec86
Merge branch 'master' into regex_scanner
2019-10-05 18:17:03 -07:00
17529cdd48
bugfix, remove old legacy persistence flag
2019-10-05 16:13:11 -07:00
sohyeonhwang
7bf4559ceb
changes for regex scanner addition
2019-10-05 15:36:58 -05:00
fb052ffa33
edont compute persistence by default
2019-09-22 15:54:17 -07:00
e871023ff5
elaborate docstring for persistence
2019-09-22 15:11:59 -07:00
7d62ff9fb7
improve help for namespace-include
2018-09-03 11:30:12 -07:00
Nate E TeBlunthuis
f784c77f60
add namespace filter parameter
2018-09-03 11:13:48 -07:00
317bafb50d
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence
2018-08-23 19:00:49 -07:00
7cd0bf3b9e
Add parameter for selecting specific namespaces.
2018-08-23 18:49:32 -07:00
d93769c21f
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence
2018-08-23 18:27:09 -07:00
Nate E TeBlunthuis
afd40c1a45
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence
2018-08-23 18:25:51 -07:00
Nate E TeBlunthuis
e4222c45dd
add namespace filter parameter
2018-08-23 18:25:08 -07:00
Nate E TeBlunthuis
776b73519a
add namespace filter parameter
2018-08-23 18:23:23 -07:00
Nate E TeBlunthuis
5b6aaad862
add namespace filter parameter
2018-08-23 18:02:56 -07:00
f468d1a5b6
add support for persistence with segment matching
2018-08-20 16:08:16 -07:00
bf396ad366
Prefix page titles with namespace names.
2018-07-09 22:11:17 -07:00
dba793c6ac
migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.
2018-07-05 01:16:00 -07:00
d77b0a4965
migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.
2018-07-04 19:06:07 -07:00
7db6288923
migrate reverts to python-mwreverts
2018-07-04 15:29:48 -07:00