Will Beason
94454ffca3
Add PHP server file
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 14:17:53 -05:00
Will Beason
62db384aa4
Pass arrays of diffs instead of incremental
...
This is 3.5x faster
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 14:17:01 -05:00
Will Beason
96915a074b
Add call to compute diffs via local PHP server
...
This is inefficient as it requires an individal request per diff.
Going to try collecting the revision texts to reduce communication
overhead.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-23 13:09:27 -05:00
Will Beason
0d9ab003f0
Fix tests for new field
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 12:44:07 -05:00
Will Beason
4bbed4a196
Merge branch 'parquet_support' into test-parquet
2025-06-17 12:20:19 -05:00
Will Beason
11d2587471
Add docs and rename import pc -> pacsv
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:46:16 -05:00
Will Beason
586ae85c65
Conform to 3.9 union type formatting
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:41:46 -05:00
Will Beason
390499dd90
Pin to python 3.9
...
Since our execution environment requires this
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:37:20 -05:00
Will Beason
84d464ea38
Remove unnecessary re-conversion to list(revs)
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:23:24 -05:00
Will Beason
3e8ae205e8
Factor out revision mutation logic into its own function
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-17 11:02:45 -05:00
Will Beason
8c707f5ef3
Remove unused code
...
This should help PR readability.
There is likely still some unused code, but that should be the
bulk of it.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 17:20:05 -05:00
Will Beason
b50c51a215
Get regex working
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 16:02:18 -05:00
Will Beason
89465b29f4
Re-add special case where revert radius is zero
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:18:21 -05:00
Will Beason
17c7f208ab
Add collapsed_revs back
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:08:57 -05:00
Will Beason
123b9a18a8
Fix revert column behavior
...
Now all columns are tested in the parquet test.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 15:03:33 -05:00
Will Beason
06a784ef27
Get columnar refactor partially working
...
Noargs works, now to do persistence.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 12:51:31 -05:00
Will Beason
8b0f775610
Begin move to columnar types
...
This will allow making columns optional, as desired, and make
adding new columns straightforward without impacting existing
behavior.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-03 08:52:57 -05:00
Will Beason
f916af9836
Allow specifying output file basename instead of just directory
...
This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.
This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-06-02 14:13:13 -05:00
Will Beason
9ee5ecfc91
Separate revision iteration and field collation logic
...
This way we're not adding temporary fields to objects that don't
normally have these fields.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 14:09:16 -05:00
Will Beason
f9383440a0
Fix tests
...
Surprisingly replacing list<str> with str doesn't break anything,
even baselines.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:56:31 -05:00
Will Beason
032fec3198
Remove unnecessary urlencode tests
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:20:10 -05:00
Will Beason
0d56267ae0
Get parquet libraries writing files
...
Tests broken due to url encoding, which can likely now be removed.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-30 13:06:26 -05:00
Nathan TeBlunthuis
260e2b177c
fix order of fields.
2025-05-29 18:32:16 -07:00
Nathan TeBlunthuis
a13d7f1deb
typo fix.
2025-05-29 18:25:08 -07:00
Nathan TeBlunthuis
ffbd180001
make editorid null not '' in parquet.
2025-05-29 18:24:33 -07:00
Nathan TeBlunthuis
606a399450
handle empty comments which are 'False' somehow.
2025-05-29 18:14:58 -07:00
Nathan TeBlunthuis
a9f76a0f62
change order of fields.
2025-05-29 18:10:59 -07:00
Nathan TeBlunthuis
f39ceefa4a
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-29 18:05:28 -07:00
Nathan TeBlunthuis
13ee160708
bugfix.
2025-05-29 18:04:41 -07:00
Nathan TeBlunthuis
bd22d26291
update deps and add edit_summary to wikiq output.
2025-05-29 18:02:14 -07:00
Will Beason
4dde25c508
Refactor revision logic to make more straightforward
...
Use groupby so we don't have to deal with edge cases and compare
revisions directly.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:46:45 -05:00
Will Beason
aec6e5fafa
Refactor collapse user logic
...
Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 15:20:34 -05:00
Will Beason
c0e629a313
Add ability to disable revert detection
...
Also add test to ensure functionality works.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 11:59:10 -05:00
Will Beason
9009bb6fa4
Merge branch 'parquet_support' into test-parquet
2025-05-29 10:21:30 -05:00
Will Beason
ab280dd765
Remove requirements.txt and add uv.lock to ignored files.
...
We can choose to check in uv.lock later if we want.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-29 10:05:49 -05:00
Nathan TeBlunthuis
22d14dc5f2
Remove dependency on pytest.
2025-05-28 21:54:31 -07:00
Nathan TeBlunthuis
5a10f59dc4
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-28 23:52:59 -05:00
Nathan TeBlunthuis
b8cdc82fc2
add ipython for dev
2025-05-28 23:52:37 -05:00
Nathan TeBlunthuis
2a2b611d79
Fix issue with .7z archives
...
Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case.
2025-05-28 21:49:11 -07:00
Nathan TeBlunthuis
39fec0820d
use my version of mwxml since it fixes a bug.
2025-05-28 21:13:18 -07:00
Nathan TeBlunthuis
383ee03250
Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support
2025-05-28 21:09:13 -07:00
Nathan TeBlunthuis
15e9234903
adding pyproject.toml
2025-05-28 20:59:55 -07:00
Nathan TeBlunthuis
8c7d46472f
Merge branch 'parquet_support' of code:mediawiki_dump_tools into parquet_support
2025-05-28 20:54:52 -07:00
Nathan TeBlunthuis
3c7fb088d6
fix schema bugs.
2025-05-28 20:54:42 -07:00
Will Beason
ee01ce3e61
Get Parquet test working
...
This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 16:48:58 -05:00
Will Beason
52757a8239
Add noargs test for ikwiki
...
This way we can ensure that the parquet code outputs equivalent output.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 15:04:10 -05:00
Will Beason
d413443740
Add numpy to environment
...
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 13:20:28 -05:00
Will Beason
3f94144b1b
Begin adding test for parquet export
...
Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.
Tests failing. Don't merge yet.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 13:17:30 -05:00
Will Beason
df0ad1de63
Finish test standardization
...
Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 10:11:58 -05:00
Will Beason
f3e6cc9392
Begin refactor of tests to make new tests easier to write
...
Handle file naming logic centrally rather than requiring a dedicated
class per input file.
Signed-off-by: Will Beason <willbeason@gmail.com>
2025-05-28 09:11:36 -05:00