Nathan TeBlunthuis
606a399450
handle empty comments which are 'False' somehow.
2025-05-29 18:14:58 -07:00
Nathan TeBlunthuis
a9f76a0f62
change order of fields.
2025-05-29 18:10:59 -07:00
Nathan TeBlunthuis
f39ceefa4a
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-29 18:05:28 -07:00
Nathan TeBlunthuis
13ee160708
bugfix.
2025-05-29 18:04:41 -07:00
Nathan TeBlunthuis
bd22d26291
update deps and add edit_summary to wikiq output.
2025-05-29 18:02:14 -07:00
Will Beason
ab280dd765
Remove requirements.txt and add uv.lock to ignored files.
...
We can choose to check in uv.lock later if we want.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 10:05:49 -05:00
Nathan TeBlunthuis
22d14dc5f2
Remove dependency on pytest.
2025-05-28 21:54:31 -07:00
Nathan TeBlunthuis
5a10f59dc4
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-28 23:52:59 -05:00
Nathan TeBlunthuis
b8cdc82fc2
add ipython for dev
2025-05-28 23:52:37 -05:00
Nathan TeBlunthuis
2a2b611d79
Fix issue with .7z archives
...
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case.
2025-05-28 21:49:11 -07:00
Nathan TeBlunthuis
39fec0820d
use my version of mwxml since it fixes a bug.
2025-05-28 21:13:18 -07:00
Nathan TeBlunthuis
383ee03250
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-28 21:09:13 -07:00
Nathan TeBlunthuis
15e9234903
adding pyproject.toml
2025-05-28 20:59:55 -07:00
Nathan TeBlunthuis
8c7d46472f
Merge branch 'parquet_support' of code:mediawiki_dump_tools into parquet_support
2025-05-28 20:54:52 -07:00
Nathan TeBlunthuis
3c7fb088d6
fix schema bugs.
2025-05-28 20:54:42 -07:00
Will Beason
df0ad1de63
Finish test standardization
...
Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 10:11:58 -05:00
Will Beason
f3e6cc9392
Begin refactor of tests to make new tests easier to write
...
Handle file naming logic centrally rather than requiring a dedicated
class per input file.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 09:11:36 -05:00
Will Beason
c8b14c3303
Refactor test temporary file logic and wikiq call pattern
...
Test file refreshing and path computation is now handled by a helper.
The wikiq command is now constructed and handled by a single method
rather than in several ad-hoc ways.
The last places relying on the working directory are now removed.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 16:24:07 -05:00
Will Beason
4d3900b541
Standardize calling for wikiq in tests
...
This way failures show the output of stderr/etc.
Also create path constant strings for use in tests to avoid repetition
and make changes easier.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 14:27:49 -05:00
Will Beason
ebc57864f2
Make tests runnable from anywhere
...
Tests no longer implicitly require that the caller be in
a specific working directory.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 13:40:57 -05:00
Will Beason
3d0bf89938
Move main logic to main()
...
This avoids:
1) the main function running when sourcing the file
2) Creating many globally-scoped variables in the main logic
Also begin refactor of test output file logic
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-27 11:10:42 -05:00
Will Beason
6d133575c7
Remove resource leaks from tests
...
Close subprocesses within tests to fix resource leak warning.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 15:08:47 -05:00
Will Beason
09a84e7d11
Reformat Wikiq_Unit_Test.py
...
Separate out reformatting from editing.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 15:07:39 -05:00
Will Beason
9c5bf577e6
Remove unused dependencies and fix spacing
...
The "mw" and "numpy" dependencies were unneeded.
Spaces and tabs were inconsistently used.
They are now used consistently, changes via auto-formatter.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 14:15:01 -05:00
Will Beason
4804ecc4b3
Add additional test dependencies
...
These are now noted in requirements.txt
Also make dependency on 7zip and ffmpeg explicit in README
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 12:29:49 -05:00
Will Beason
7a4c41159c
Exclude JetBrains config folder in .gitignore
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-26 10:48:17 -05:00
1aea601a30
[Bugfix] Call the correct matchmake function.
2021-11-16 16:53:21 -08:00
c437b357db
rename matchmake functions
2021-11-11 19:09:41 -08:00
bb83d62b74
Add some descriptive comments.
2021-10-19 16:55:24 -07:00
c285402683
add todos to readme
2021-10-18 14:14:11 -07:00
b1bea09ad6
fix bugs and unit tests
2021-10-18 13:33:05 -07:00
9a0c157ebb
bugfix
2021-10-18 10:15:03 -07:00
ae870fed0b
parquet path is code-complete
2021-10-17 21:46:31 -07:00
26f6d8f984
remove dependency on pandas.
2021-10-17 20:24:33 -07:00
ae9a241747
use dataclasses and pyarrow for types.
2021-10-17 20:21:22 -07:00
d8d20f670b
initial work on parquet support
2021-10-17 13:22:22 -07:00
cdfa77d66d
remove commented code
2019-11-11 11:28:48 -08:00
02b3250a36
refactor regex matching in a tidier object oriented style
2019-11-09 13:07:46 -08:00
414cc5ff2d
validate tests and add asserts and baselines for regex tests.
2019-11-09 12:19:55 -08:00
sohyeonhwang
4ccde84529
added regex scanner v2's dump unit test file regextest.xml.bz2
2019-11-07 14:06:15 -06:00
sohyeonhwang
f147e1d899
merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests
2019-11-07 13:28:17 -06:00
c84844cfb5
add unit tests for configuring revert_radius
2019-10-07 15:02:30 -07:00
c4416d0f1b
make revert radius configurable
2019-10-07 13:57:49 -07:00
7b856bec86
Merge branch 'master' into regex_scanner
2019-10-05 18:17:03 -07:00
324ccc8e26
update baseline outputs
2019-10-05 16:36:07 -07:00
17529cdd48
bugfix, remove old legacy persistence flag
2019-10-05 16:13:11 -07:00
sohyeonhwang
7bf4559ceb
changes for regex scanner addition
2019-10-05 15:36:58 -05:00
fb052ffa33
edont compute persistence by default
2019-09-22 15:54:17 -07:00
e871023ff5
elaborate docstring for persistence
2019-09-22 15:11:59 -07:00
7d62ff9fb7
improve help for namespace-include
2018-09-03 11:30:12 -07:00