mediawiki_dump_tools

Author	SHA1	Message	Date
Nathan TeBlunthuis	c597a6b7f4	refactor into src-layout package.	2025-07-07 20:14:13 -07:00
Nathan TeBlunthuis	d6c4c0a416	add (optional) diff and text columns to output.	2025-07-07 14:39:52 -07:00
Nathan TeBlunthuis	a8e9e7f4fd	wikidiff2 integration: pwr complete. test for pwr based on wikidiff2.	2025-07-07 12:18:22 -07:00
Will Beason	bc7f186112	Start interoperability between wikidiff2 and deltas The big challenges here (and remaining) are as follows: 1. Deltas requires changes to be given at the token level, whereas wikidiff2 reports changes at the byte level. Thus, it is often required to tokenize sequences of text to convert to the desired token indices. As-is this is done inefficiently, often requiring re-tokenization of previously-tokenized sequences. A better implementation would incrementally tokenize, or automatically find the referenced sequences. 2. Deltas only allows for Equal/Insert/Delete operations, while wikidiff2 also detects paragraph moves. These paragraph moves are NOT equivalent to Equal, as the moved paragraphs are not guaranteed to be equivalent, just very similar. Wikidiff2 does not report changes to moved paragraphs, so to preserve token persistence, a difference algorithm would need to be performed on the before/after sequences. A stopgap (currently implemented) is to turn these into strict deletions/insertions. 3. There appears to be a lot of memory consumption, and sometimes this results in memory overflow. I am unsure if this is a memory leak or simply that re-tokenizing causes significant enough memory throughput that my machine can't handle it. 4. Deltas expects all tokens in the before/after text to be covered by segment ranges of Equal/Insert/Delete, but wikidiff2 does not appear to ever emit any Equal ranges, instead skipping them. These ranges must be computed and inserted in sequence. As-is the code does not correctly handle unchanged text at the end of pages. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-06-26 16:08:50 -05:00
Will Beason	96915a074b	Add call to compute diffs via local PHP server This is inefficient as it requires an individal request per diff. Going to try collecting the revision texts to reduce communication overhead. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-06-23 13:09:27 -05:00
Will Beason	586ae85c65	Conform to 3.9 union type formatting Signed-off-by: Will Beason <willbeason@gmail.com>	2025-06-17 11:41:46 -05:00
Will Beason	123b9a18a8	Fix revert column behavior Now all columns are tested in the parquet test. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-06-03 15:03:33 -05:00
Will Beason	06a784ef27	Get columnar refactor partially working Noargs works, now to do persistence. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-06-03 12:51:31 -05:00
Will Beason	f916af9836	Allow specifying output file basename instead of just directory This is optional, and doesn't impact existing users as preexisting behavior when users specify an output directory is unchanged. This makes tests not need to copy large files as part of their execution, as they can ask files to be written to explicit locations. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-06-02 14:13:13 -05:00
Will Beason	f9383440a0	Fix tests Surprisingly replacing list<str> with str doesn't break anything, even baselines. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-30 13:56:31 -05:00
Will Beason	032fec3198	Remove unnecessary urlencode tests Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-30 13:20:10 -05:00
Will Beason	aec6e5fafa	Refactor collapse user logic Use simple loop for when we aren't collapsing users. Add test which covers case when users are deleted. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-29 15:20:34 -05:00
Will Beason	c0e629a313	Add ability to disable revert detection Also add test to ensure functionality works. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-29 11:59:10 -05:00
Will Beason	9009bb6fa4	Merge branch 'parquet_support' into test-parquet	2025-05-29 10:21:30 -05:00
Nathan TeBlunthuis	2a2b611d79	Fix issue with .7z archives Before, only fandom wikis dumps were compressed with .7z. These archives can have several .xml files in the .7z; not just one. So we need to have a flag for the fandom-2020 dumps. This fixes the bug so .7z archives work in either case.	2025-05-28 21:49:11 -07:00
Will Beason	ee01ce3e61	Get Parquet test working This requires some data smoothing to get read_table and read_parquet DataFrames to look close enough, but the test now passes and validates that the data match. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-28 16:48:58 -05:00
Will Beason	52757a8239	Add noargs test for ikwiki This way we can ensure that the parquet code outputs equivalent output. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-28 15:04:10 -05:00
Will Beason	3f94144b1b	Begin adding test for parquet export Changed logic for handling anonymous edits so that wikiq handles the type for editor ids consistently. Parquet can mix int64 and None, but not int64 and strings - previously the code used the empty string to denote anonymous editors. Tests failing. Don't merge yet. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-28 13:17:30 -05:00
Will Beason	df0ad1de63	Finish test standardization Test logic is executed within the WikiqTestCase, while WikiqTester handles creating and managing the variables tests need. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-28 10:11:58 -05:00
Will Beason	f3e6cc9392	Begin refactor of tests to make new tests easier to write Handle file naming logic centrally rather than requiring a dedicated class per input file. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-28 09:11:36 -05:00
Will Beason	c8b14c3303	Refactor test temporary file logic and wikiq call pattern Test file refreshing and path computation is now handled by a helper. The wikiq command is now constructed and handled by a single method rather than in several ad-hoc ways. The last places relying on the working directory are now removed. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-27 16:24:07 -05:00
Will Beason	4d3900b541	Standardize calling for wikiq in tests This way failures show the output of stderr/etc. Also create path constant strings for use in tests to avoid repetition and make changes easier. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-27 14:27:49 -05:00
Will Beason	ebc57864f2	Make tests runnable from anywhere Tests no longer implicitly require that the caller be in a specific working directory. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-27 13:40:57 -05:00
Will Beason	3d0bf89938	Move main logic to main() This avoids: 1) the main function running when sourcing the file 2) Creating many globally-scoped variables in the main logic Also begin refactor of test output file logic Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-27 11:10:42 -05:00
Will Beason	6d133575c7	Remove resource leaks from tests Close subprocesses within tests to fix resource leak warning. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-26 15:08:47 -05:00
Will Beason	09a84e7d11	Reformat Wikiq_Unit_Test.py Separate out reformatting from editing. Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-26 15:07:39 -05:00
Will Beason	4804ecc4b3	Add additional test dependencies These are now noted in requirements.txt Also make dependency on 7zip and ffmpeg explicit in README Signed-off-by: Will Beason <willbeason@gmail.com>	2025-05-26 12:29:49 -05:00
Nathan TeBlunthuis	b1bea09ad6	fix bugs and unit tests	2021-10-18 13:33:05 -07:00
Nathan TeBlunthuis	414cc5ff2d	validate tests and add asserts and baselines for regex tests.	2019-11-09 12:19:55 -08:00
sohyeonhwang	f147e1d899	merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests	2019-11-07 13:28:17 -06:00
groceryheist	c84844cfb5	add unit tests for configuring revert_radius	2019-10-07 15:02:30 -07:00
groceryheist	f7f5bf8fd4	sub assertEquals assertEqual	2018-09-03 11:21:49 -07:00
groceryheist	7cd0bf3b9e	Add parameter for selecting specific namespaces.	2018-08-23 18:49:32 -07:00
groceryheist	f468d1a5b6	add support for persistence with segment matching	2018-08-20 16:08:16 -07:00
groceryheist	dba793c6ac	migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.	2018-07-05 01:16:00 -07:00
groceryheist	d77b0a4965	migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.	2018-07-04 19:06:07 -07:00
groceryheist	e925ac9da1	add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.	2018-07-04 15:08:30 -07:00
groceryheist	d2746879d0	create baseline tests for xml dump processing	2018-07-03 23:43:47 -07:00

38 Commits