Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							94454ffca3 
							
						 
					 
					
						
						
							
							Add PHP server file  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-23 14:17:53 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							62db384aa4 
							
						 
					 
					
						
						
							
							Pass arrays of diffs instead of incremental  
						
						... 
						
						
						
						This is 3.5x faster
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-23 14:17:01 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							96915a074b 
							
						 
					 
					
						
						
							
							Add call to compute diffs via local PHP server  
						
						... 
						
						
						
						This is inefficient as it requires an individal request per diff.
Going to try collecting the revision texts to reduce communication
overhead.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-23 13:09:27 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							0d9ab003f0 
							
						 
					 
					
						
						
							
							Fix tests for new field  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-17 12:44:07 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							4bbed4a196 
							
						 
					 
					
						
						
							
							Merge branch 'parquet_support' into test-parquet  
						
						
						
					 
					
						2025-06-17 12:20:19 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							11d2587471 
							
						 
					 
					
						
						
							
							Add docs and rename import pc -> pacsv  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-17 11:46:16 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							586ae85c65 
							
						 
					 
					
						
						
							
							Conform to 3.9 union type formatting  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-17 11:41:46 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							390499dd90 
							
						 
					 
					
						
						
							
							Pin to python 3.9  
						
						... 
						
						
						
						Since our execution environment requires this
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-17 11:37:20 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							84d464ea38 
							
						 
					 
					
						
						
							
							Remove unnecessary re-conversion to list(revs)  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-17 11:23:24 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							3e8ae205e8 
							
						 
					 
					
						
						
							
							Factor out revision mutation logic into its own function  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-17 11:02:45 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							8c707f5ef3 
							
						 
					 
					
						
						
							
							Remove unused code  
						
						... 
						
						
						
						This should help PR readability.
There is likely still some unused code, but that should be the
bulk of it.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 17:20:05 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							b50c51a215 
							
						 
					 
					
						
						
							
							Get regex working  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 16:02:18 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							89465b29f4 
							
						 
					 
					
						
						
							
							Re-add special case where revert radius is zero  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 15:18:21 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							17c7f208ab 
							
						 
					 
					
						
						
							
							Add collapsed_revs back  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 15:08:57 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							123b9a18a8 
							
						 
					 
					
						
						
							
							Fix revert column behavior  
						
						... 
						
						
						
						Now all columns are tested in the parquet test.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 15:03:33 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							06a784ef27 
							
						 
					 
					
						
						
							
							Get columnar refactor partially working  
						
						... 
						
						
						
						Noargs works, now to do persistence.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 12:51:31 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							8b0f775610 
							
						 
					 
					
						
						
							
							Begin move to columnar types  
						
						... 
						
						
						
						This will allow making columns optional, as desired, and make
adding new columns straightforward without impacting existing
behavior.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-03 08:52:57 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							f916af9836 
							
						 
					 
					
						
						
							
							Allow specifying output file basename instead of just directory  
						
						... 
						
						
						
						This is optional, and doesn't impact existing users as preexisting
behavior when users specify an output directory is unchanged.
This makes tests not need to copy large files as part of their
execution, as they can ask files to be written to explicit
locations.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-06-02 14:13:13 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							9ee5ecfc91 
							
						 
					 
					
						
						
							
							Separate revision iteration and field collation logic  
						
						... 
						
						
						
						This way we're not adding temporary fields to objects that don't
normally have these fields.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-30 14:09:16 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							f9383440a0 
							
						 
					 
					
						
						
							
							Fix tests  
						
						... 
						
						
						
						Surprisingly replacing list<str> with str doesn't break anything,
even baselines.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-30 13:56:31 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							032fec3198 
							
						 
					 
					
						
						
							
							Remove unnecessary urlencode tests  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-30 13:20:10 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							0d56267ae0 
							
						 
					 
					
						
						
							
							Get parquet libraries writing files  
						
						... 
						
						
						
						Tests broken due to url encoding, which can likely now be removed.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-30 13:06:26 -05:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							260e2b177c 
							
						 
					 
					
						
						
							
							fix order of fields.  
						
						
						
					 
					
						2025-05-29 18:32:16 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							a13d7f1deb 
							
						 
					 
					
						
						
							
							typo fix.  
						
						
						
					 
					
						2025-05-29 18:25:08 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							ffbd180001 
							
						 
					 
					
						
						
							
							make editorid null not '' in parquet.  
						
						
						
					 
					
						2025-05-29 18:24:33 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							606a399450 
							
						 
					 
					
						
						
							
							handle empty comments which are 'False' somehow.  
						
						
						
					 
					
						2025-05-29 18:14:58 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							a9f76a0f62 
							
						 
					 
					
						
						
							
							change order of fields.  
						
						
						
					 
					
						2025-05-29 18:10:59 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							f39ceefa4a 
							
						 
					 
					
						
						
							
							Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support  
						
						
						
					 
					
						2025-05-29 18:05:28 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							13ee160708 
							
						 
					 
					
						
						
							
							bugfix.  
						
						
						
					 
					
						2025-05-29 18:04:41 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							bd22d26291 
							
						 
					 
					
						
						
							
							update deps and add edit_summary to wikiq output.  
						
						
						
					 
					
						2025-05-29 18:02:14 -07:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							4dde25c508 
							
						 
					 
					
						
						
							
							Refactor revision logic to make more straightforward  
						
						... 
						
						
						
						Use groupby so we don't have to deal with edge cases and compare
revisions directly.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-29 15:46:45 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							aec6e5fafa 
							
						 
					 
					
						
						
							
							Refactor collapse user logic  
						
						... 
						
						
						
						Use simple loop for when we aren't collapsing users.
Add test which covers case when users are deleted.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-29 15:20:34 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							c0e629a313 
							
						 
					 
					
						
						
							
							Add ability to disable revert detection  
						
						... 
						
						
						
						Also add test to ensure functionality works.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-29 11:59:10 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							9009bb6fa4 
							
						 
					 
					
						
						
							
							Merge branch 'parquet_support' into test-parquet  
						
						
						
					 
					
						2025-05-29 10:21:30 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							ab280dd765 
							
						 
					 
					
						
						
							
							Remove requirements.txt and add uv.lock to ignored files.  
						
						... 
						
						
						
						We can choose to check in uv.lock later if we want.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-29 10:05:49 -05:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							22d14dc5f2 
							
						 
					 
					
						
						
							
							Remove dependency on pytest.  
						
						
						
					 
					
						2025-05-28 21:54:31 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							5a10f59dc4 
							
						 
					 
					
						
						
							
							Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support  
						
						
						
					 
					
						2025-05-28 23:52:59 -05:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							b8cdc82fc2 
							
						 
					 
					
						
						
							
							add ipython for dev  
						
						
						
					 
					
						2025-05-28 23:52:37 -05:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							2a2b611d79 
							
						 
					 
					
						
						
							
							Fix issue with .7z archives  
						
						... 
						
						
						
						Before, only fandom wikis dumps were compressed with .7z.
These archives can have several .xml files in the .7z; not just one.
So we need to have a flag for the fandom-2020 dumps.
This fixes the bug so .7z archives work in either case. 
						
					 
					
						2025-05-28 21:49:11 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							39fec0820d 
							
						 
					 
					
						
						
							
							use my version of mwxml since it fixes a bug.  
						
						
						
					 
					
						2025-05-28 21:13:18 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							383ee03250 
							
						 
					 
					
						
						
							
							Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support  
						
						
						
					 
					
						2025-05-28 21:09:13 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							15e9234903 
							
						 
					 
					
						
						
							
							adding pyproject.toml  
						
						
						
					 
					
						2025-05-28 20:59:55 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							8c7d46472f 
							
						 
					 
					
						
						
							
							Merge branch 'parquet_support' of code:mediawiki_dump_tools into parquet_support  
						
						
						
					 
					
						2025-05-28 20:54:52 -07:00 
						 
				 
			
				
					
						
							
							
								Nathan TeBlunthuis 
							
						 
					 
					
						
						
						
						
							
						
						
							3c7fb088d6 
							
						 
					 
					
						
						
							
							fix schema bugs.  
						
						
						
					 
					
						2025-05-28 20:54:42 -07:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							ee01ce3e61 
							
						 
					 
					
						
						
							
							Get Parquet test working  
						
						... 
						
						
						
						This requires some data smoothing to get read_table and read_parquet
DataFrames to look close enough, but the test now passes and validates
that the data match.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-28 16:48:58 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							52757a8239 
							
						 
					 
					
						
						
							
							Add noargs test for ikwiki  
						
						... 
						
						
						
						This way we can ensure that the parquet code outputs equivalent output.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-28 15:04:10 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							d413443740 
							
						 
					 
					
						
						
							
							Add numpy to environment  
						
						... 
						
						
						
						Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-28 13:20:28 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							3f94144b1b 
							
						 
					 
					
						
						
							
							Begin adding test for parquet export  
						
						... 
						
						
						
						Changed logic for handling anonymous edits so that wikiq handles
the type for editor ids consistently. Parquet can mix int64 and
None, but not int64 and strings - previously the code used the empty
string to denote anonymous editors.
Tests failing. Don't merge yet.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-28 13:17:30 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							df0ad1de63 
							
						 
					 
					
						
						
							
							Finish test standardization  
						
						... 
						
						
						
						Test logic is executed within the WikiqTestCase, while WikiqTester
handles creating and managing the variables tests need.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-28 10:11:58 -05:00 
						 
				 
			
				
					
						
							
							
								Will Beason 
							
						 
					 
					
						
						
						
						
							
						
						
							f3e6cc9392 
							
						 
					 
					
						
						
							
							Begin refactor of tests to make new tests easier to write  
						
						... 
						
						
						
						Handle file naming logic centrally rather than requiring a dedicated
class per input file.
Signed-off-by: Will Beason <willbeason@gmail.com> 
						
					 
					
						2025-05-28 09:11:36 -05:00