Commit Graph

  • 3e8ae205e8 Factor out revision mutation logic into its own function Will Beason 2025-06-17 11:02:45 -05:00
  • 8c707f5ef3 Remove unused code Will Beason 2025-06-03 17:20:05 -05:00
  • b50c51a215 Get regex working Will Beason 2025-06-03 16:02:18 -05:00
  • 89465b29f4 Re-add special case where revert radius is zero Will Beason 2025-06-03 15:18:21 -05:00
  • 17c7f208ab Add collapsed_revs back Will Beason 2025-06-03 15:08:57 -05:00
  • 123b9a18a8 Fix revert column behavior Will Beason 2025-06-03 15:03:33 -05:00
  • 06a784ef27 Get columnar refactor partially working Will Beason 2025-06-03 12:51:31 -05:00
  • 8b0f775610 Begin move to columnar types Will Beason 2025-06-03 08:52:57 -05:00
  • f916af9836 Allow specifying output file basename instead of just directory Will Beason 2025-06-02 14:13:13 -05:00
  • 9ee5ecfc91 Separate revision iteration and field collation logic Will Beason 2025-05-30 14:09:16 -05:00
  • f9383440a0 Fix tests Will Beason 2025-05-30 13:56:31 -05:00
  • 032fec3198 Remove unnecessary urlencode tests Will Beason 2025-05-30 13:20:10 -05:00
  • 0d56267ae0 Get parquet libraries writing files Will Beason 2025-05-30 13:06:26 -05:00
  • 260e2b177c fix order of fields. Nathan TeBlunthuis 2025-05-29 18:32:16 -07:00
  • a13d7f1deb typo fix. Nathan TeBlunthuis 2025-05-29 18:25:08 -07:00
  • ffbd180001 make editorid null not '' in parquet. Nathan TeBlunthuis 2025-05-29 18:24:33 -07:00
  • 606a399450 handle empty comments which are 'False' somehow. Nathan TeBlunthuis 2025-05-29 18:14:58 -07:00
  • a9f76a0f62 change order of fields. Nathan TeBlunthuis 2025-05-29 18:10:59 -07:00
  • f39ceefa4a Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-29 18:05:28 -07:00
  • 13ee160708 bugfix. Nathan TeBlunthuis 2025-05-29 18:04:41 -07:00
  • bd22d26291 update deps and add edit_summary to wikiq output. Nathan TeBlunthuis 2025-05-29 18:02:14 -07:00
  • 4dde25c508 Refactor revision logic to make more straightforward Will Beason 2025-05-29 15:46:45 -05:00
  • aec6e5fafa Refactor collapse user logic Will Beason 2025-05-29 15:20:34 -05:00
  • c0e629a313 Add ability to disable revert detection Will Beason 2025-05-29 11:59:10 -05:00
  • 9009bb6fa4 Merge branch 'parquet_support' into test-parquet Will Beason 2025-05-29 10:21:30 -05:00
  • ab280dd765 Remove requirements.txt and add uv.lock to ignored files. Will Beason 2025-05-29 10:05:49 -05:00
  • 22d14dc5f2 Remove dependency on pytest. Nathan TeBlunthuis 2025-05-28 21:54:31 -07:00
  • 5a10f59dc4 Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-28 23:52:59 -05:00
  • b8cdc82fc2 add ipython for dev Nathan TeBlunthuis 2025-05-28 23:52:37 -05:00
  • 2a2b611d79 Fix issue with .7z archives Nathan TeBlunthuis 2025-05-28 21:31:41 -07:00
  • 39fec0820d use my version of mwxml since it fixes a bug. Nathan TeBlunthuis 2025-05-28 21:13:18 -07:00
  • 383ee03250 Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-28 21:09:13 -07:00
  • 15e9234903 adding pyproject.toml parquet_support Nathan TeBlunthuis 2025-05-28 20:59:55 -07:00
  • 8c7d46472f Merge branch 'parquet_support' of code:mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-28 20:54:52 -07:00
  • 3c7fb088d6 fix schema bugs. Nathan TeBlunthuis 2025-05-28 20:54:42 -07:00
  • ee01ce3e61 Get Parquet test working Will Beason 2025-05-28 16:48:58 -05:00
  • 52757a8239 Add noargs test for ikwiki Will Beason 2025-05-28 15:04:10 -05:00
  • d413443740 Add numpy to environment Will Beason 2025-05-28 13:20:28 -05:00
  • 3f94144b1b Begin adding test for parquet export Will Beason 2025-05-28 13:17:30 -05:00
  • df0ad1de63 Finish test standardization Will Beason 2025-05-28 10:11:58 -05:00
  • f3e6cc9392 Begin refactor of tests to make new tests easier to write Will Beason 2025-05-28 09:11:36 -05:00
  • c8b14c3303 Refactor test temporary file logic and wikiq call pattern Will Beason 2025-05-27 16:24:07 -05:00
  • 4d3900b541 Standardize calling for wikiq in tests Will Beason 2025-05-27 14:27:49 -05:00
  • ebc57864f2 Make tests runnable from anywhere Will Beason 2025-05-27 13:40:57 -05:00
  • 3d0bf89938 Move main logic to main() Will Beason 2025-05-27 11:10:42 -05:00
  • 6d133575c7 Remove resource leaks from tests Will Beason 2025-05-26 15:08:47 -05:00
  • 09a84e7d11 Reformat Wikiq_Unit_Test.py Will Beason 2025-05-26 15:07:39 -05:00
  • 9c5bf577e6 Remove unused dependencies and fix spacing Will Beason 2025-05-26 14:15:01 -05:00
  • 4804ecc4b3 Add additional test dependencies Will Beason 2025-05-26 12:29:49 -05:00
  • 7a4c41159c Exclude JetBrains config folder in .gitignore Will Beason 2025-05-26 10:48:17 -05:00
  • 933ca753ed code review. mako_changes-20230429 Nathan TeBlunthuis 2023-05-03 10:23:30 -07:00
  • 54fa6221a8 fix because pandas testing API has changed Benjamin Mako Hill 2023-04-29 11:52:13 -07:00
  • 9dcd337315 rename variables to be more consistent Benjamin Mako Hill 2023-04-29 11:44:48 -07:00
  • 2ff4d60613 added counting functionality to regex code Benjamin Mako Hill 2023-04-29 11:40:03 -07:00
  • 4729371d5a updated README file Benjamin Mako Hill 2023-04-28 14:40:18 -07:00
  • 7e6cd5b386 make sure that content is defined before testing for search patterns Benjamin Mako Hill 2023-04-28 14:30:42 -07:00
  • 556285b198 added a line to fix persistence with deleted revs Benjamin Mako Hill 2023-04-28 14:21:21 -07:00
  • b124f9c7c8 write regex captures to parquet arrays. redirects Nathan TeBlunthuis 2022-03-29 17:52:26 -07:00
  • 32283aa4da add a minor comment on the source of the redirect regex Nathan TeBlunthuis 2022-03-10 15:07:27 -08:00
  • 595728d8da resolve redirects if siteinfo is provided Nathan TeBlunthuis 2022-03-10 13:31:03 -08:00
  • 3e645b5e58 bugfix. column name text_chars Nathan TeBlunthuis 2022-03-08 20:17:20 -08:00
  • 1aea601a30 [Bugfix] Call the correct matchmake function. Nathan TeBlunthuis 2021-11-16 16:53:21 -08:00
  • c437b357db rename matchmake functions Nathan TeBlunthuis 2021-11-11 19:09:41 -08:00
  • bb83d62b74 Add some descriptive comments. Nathan TeBlunthuis 2021-10-19 16:55:24 -07:00
  • c285402683 add todos to readme Nathan TeBlunthuis 2021-10-18 14:14:11 -07:00
  • b1bea09ad6 fix bugs and unit tests Nathan TeBlunthuis 2021-10-18 13:33:05 -07:00
  • 9a0c157ebb bugfix Nathan TeBlunthuis 2021-10-18 10:15:03 -07:00
  • ae870fed0b parquet path is code-complete Nathan TeBlunthuis 2021-10-17 21:46:31 -07:00
  • 26f6d8f984 remove dependency on pandas. Nathan TeBlunthuis 2021-10-17 20:24:33 -07:00
  • ae9a241747 use dataclasses and pyarrow for types. Nathan TeBlunthuis 2021-10-17 20:21:22 -07:00
  • d8d20f670b initial work on parquet support Nathan TeBlunthuis 2021-10-17 13:22:22 -07:00
  • 950ed8fde9 regex scanner groups findall tuple bug fixed regex_scanner sohyeonhwang 2019-12-12 07:47:07 -06:00
  • 097c60a7bc handling empty text sohyeonhwang 2019-12-03 16:44:53 -06:00
  • cdfa77d66d remove commented code master Nathan TeBlunthuis 2019-11-11 11:28:48 -08:00
  • 02b3250a36 refactor regex matching in a tidier object oriented style Nathan TeBlunthuis 2019-11-09 13:07:46 -08:00
  • 414cc5ff2d validate tests and add asserts and baselines for regex tests. Nathan TeBlunthuis 2019-11-09 12:19:55 -08:00
  • 4ccde84529 added regex scanner v2's dump unit test file regextest.xml.bz2 sohyeonhwang 2019-11-07 14:06:15 -06:00
  • f147e1d899 merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests sohyeonhwang 2019-11-07 13:28:17 -06:00
  • c84844cfb5 add unit tests for configuring revert_radius groceryheist 2019-10-07 15:02:30 -07:00
  • c4416d0f1b make revert radius configurable groceryheist 2019-10-07 13:57:49 -07:00
  • 7b856bec86 Merge branch 'master' into regex_scanner groceryheist 2019-10-05 18:17:03 -07:00
  • 324ccc8e26 update baseline outputs groceryheist 2019-10-05 16:36:07 -07:00
  • 17529cdd48 bugfix, remove old legacy persistence flag groceryheist 2019-10-05 16:13:11 -07:00
  • 7bf4559ceb changes for regex scanner addition sohyeonhwang 2019-10-05 15:36:58 -05:00
  • fb052ffa33 edont compute persistence by default groceryheist 2019-09-22 15:54:17 -07:00
  • e871023ff5 elaborate docstring for persistence groceryheist 2019-09-22 15:11:59 -07:00
  • 2d5008113b add flag for excluding whitespace and punctuation tests groceryheist 2018-12-12 16:38:47 -08:00
  • 19eda6dd0e use only a part of the sailormoon wiki groceryheist 2018-12-12 16:12:57 -08:00
  • 4089ebae92 create state where all tests pass groceryheist 2018-12-12 16:08:00 -08:00
  • 1b81b9542d make sailormoon smaller groceryheist 2018-12-12 14:57:03 -08:00
  • 9c5a1b18f0 add test files groceryheist 2018-12-12 14:56:48 -08:00
  • 26ea272114 wikiq mostly functional, but reverters take all the credit for the content they restore. groceryheist 2018-12-11 19:36:49 -08:00
  • 0c2d72b881 checking in work to deepen migration to new mediawikiutils groceryheist 2018-12-11 00:31:50 -08:00
  • 7d62ff9fb7 improve help for namespace-include groceryheist 2018-09-03 11:30:12 -07:00
  • f7f5bf8fd4 sub assertEquals assertEqual advanced_persistence groceryheist 2018-09-03 11:21:49 -07:00
  • f784c77f60 add namespace filter parameter Nate E TeBlunthuis 2018-08-23 18:25:08 -07:00
  • df18d6e280 Merge branch 'user_level_wikiq' of code.communitydata.cc:mediawiki_dump_tools into user_level_wikiq user_level_wikiq groceryheist 2018-08-31 16:03:07 -07:00
  • 3af71f03e0 Merge branch 'user_level_wikiq' of code.communitydata.cc:mediawiki_dump_tools into user_level_wikiq groceryheist 2018-08-31 16:02:05 -07:00
  • 1d5a9b53b8 Merge branch 'user_level_wikiq' of code.communitydata.cc:mediawiki_dump_tools into user_level_wikiq groceryheist 2018-08-31 16:02:05 -07:00
  • cc551eef6e Merge branch 'user_level_wikiq' of code.communitydata.cc:mediawiki_dump_tools into user_level_wikiq groceryheist 2018-08-31 16:01:07 -07:00