Commit Graph

  • 83c92d1a37 decrease moved paragraph detection cutoff to see if that fixes memory issue. compute-diffs Nathan TeBlunthuis 2025-07-22 13:29:01 -0700
  • 076df15740 force garbage collection. Nathan TeBlunthuis 2025-07-22 13:13:18 -0700
  • 6557e25af7 make a new pywikidiff2 object for each revision to reduce memory. Nathan TeBlunthuis 2025-07-22 09:50:30 -0700
  • d20075b323 add memray for debugging memory usage. Nathan TeBlunthuis 2025-07-17 15:17:23 -0700
  • 6d03cac28d decrease batch_size. Nathan TeBlunthuis 2025-07-15 19:37:26 -0700
  • 3a44cfd4da increase batch size. Nathan TeBlunthuis 2025-07-15 19:09:36 -0700
  • 0fbe788e31 use ichunked instead of chunked. Nathan TeBlunthuis 2025-07-15 18:25:44 -0700
  • 37d095199a inc version. Nathan TeBlunthuis 2025-07-15 15:37:55 -0700
  • 6b04791de2 reduce batch size. Nathan TeBlunthuis 2025-07-15 15:31:00 -0700
  • 507335941d Revert "Merge branch 'compute-diffs' into HEAD" Nathan TeBlunthuis 2025-07-15 15:23:50 -0700
  • 907a35323e Merge branch 'compute-diffs' into HEAD Nathan TeBlunthuis 2025-07-15 15:23:13 -0700
  • c40506137b make wikiq memory efficient again via batch processing. Nathan TeBlunthuis 2025-07-15 15:20:17 -0700
  • e53e7ada5d try fixing the memory problem. Nathan TeBlunthuis 2025-07-14 18:58:27 -0700
  • 76d54ae597 support partitioning output parquet by namespace. Nathan TeBlunthuis 2025-07-07 20:58:43 -0700
  • c9fb94ccc0 fix tests. Nathan TeBlunthuis 2025-07-07 20:25:00 -0700
  • ac1dd47b08 Merge branch 'compute-diffs' of gitea:collective/mediawiki_dump_tools into compute-diffs Nathan TeBlunthuis 2025-07-07 20:16:38 -0700
  • c597a6b7f4 refactor into src-layout package. Nathan TeBlunthuis 2025-07-07 20:13:17 -0700
  • a2984bc656 refactor into src-layout package. Nathan TeBlunthuis 2025-07-07 20:13:17 -0700
  • 56c90fe1cc add missing files + add sorted_columns metadata. Nathan TeBlunthuis 2025-07-07 19:08:31 -0700
  • d6c4c0a416 add (optional) diff and text columns to output. Nathan TeBlunthuis 2025-07-07 14:39:52 -0700
  • a8e9e7f4fd wikidiff2 integration: pwr complete. Nathan TeBlunthuis 2025-07-07 12:06:43 -0700
  • 58c595bf0b add test files. Nathan TeBlunthuis 2025-07-07 11:22:56 -0700
  • cc96bb5f3f remove server. Nathan TeBlunthuis 2025-07-07 11:21:16 -0700
  • 14e819e565 compare pywikidiff2 to making requests to wikidiff2. Nathan TeBlunthuis 2025-07-07 10:51:11 -0700
  • 4654911533 almost there. working out edge cases. Nathan TeBlunthuis 2025-07-03 21:32:44 -0700
  • cf1fb61a84 WIP: fixing bugs and adding newlines to output. Nathan TeBlunthuis 2025-07-02 13:31:32 -0700
  • c4acc711d2 finish support for paragraph move. Nathan TeBlunthuis 2025-07-01 11:16:08 -0700
  • 20de5b93f9 Merge branch 'tmp' into compute-diffs Nathan TeBlunthuis 2025-06-30 20:52:23 -0500
  • 37734ed092 add test. Nathan TeBlunthuis 2025-06-30 15:45:56 -0700
  • 5a3e4102b5 got wikidiff2 persistence working except for paragraph moves. Nathan TeBlunthuis 2025-06-30 15:37:54 -0700
  • 186cb82fb8 some work on wiki_diff_matcher.py Nathan TeBlunthuis 2025-06-27 07:13:41 -0700
  • bc7f186112 Start interoperability between wikidiff2 and deltas Will Beason 2025-06-26 16:08:50 -0500
  • 1ec8bfaad4 Add php.ini file Will Beason 2025-06-24 09:24:35 -0500
  • 94454ffca3 Add PHP server file Will Beason 2025-06-23 14:17:53 -0500
  • 62db384aa4 Pass arrays of diffs instead of incremental Will Beason 2025-06-23 14:17:01 -0500
  • 96915a074b Add call to compute diffs via local PHP server Will Beason 2025-06-23 13:09:27 -0500
  • 0d9ab003f0 Fix tests for new field parquet_support Will Beason 2025-06-17 12:44:07 -0500
  • 4bbed4a196 Merge branch 'parquet_support' into test-parquet Will Beason 2025-06-17 12:20:19 -0500
  • 11d2587471 Add docs and rename import pc -> pacsv Will Beason 2025-06-17 11:46:16 -0500
  • 586ae85c65 Conform to 3.9 union type formatting Will Beason 2025-06-17 11:41:46 -0500
  • 390499dd90 Pin to python 3.9 Will Beason 2025-06-17 11:37:20 -0500
  • 84d464ea38 Remove unnecessary re-conversion to list(revs) Will Beason 2025-06-17 11:23:24 -0500
  • 3e8ae205e8 Factor out revision mutation logic into its own function Will Beason 2025-06-17 11:02:45 -0500
  • 8c707f5ef3 Remove unused code Will Beason 2025-06-03 17:20:05 -0500
  • b50c51a215 Get regex working Will Beason 2025-06-03 16:02:18 -0500
  • 89465b29f4 Re-add special case where revert radius is zero Will Beason 2025-06-03 15:18:21 -0500
  • 17c7f208ab Add collapsed_revs back Will Beason 2025-06-03 15:08:57 -0500
  • 123b9a18a8 Fix revert column behavior Will Beason 2025-06-03 15:03:33 -0500
  • 06a784ef27 Get columnar refactor partially working Will Beason 2025-06-03 12:51:31 -0500
  • 8b0f775610 Begin move to columnar types Will Beason 2025-06-03 08:52:57 -0500
  • f916af9836 Allow specifying output file basename instead of just directory Will Beason 2025-06-02 14:13:13 -0500
  • 9ee5ecfc91 Separate revision iteration and field collation logic Will Beason 2025-05-30 14:09:16 -0500
  • f9383440a0 Fix tests Will Beason 2025-05-30 13:56:31 -0500
  • 032fec3198 Remove unnecessary urlencode tests Will Beason 2025-05-30 13:20:10 -0500
  • 0d56267ae0 Get parquet libraries writing files Will Beason 2025-05-30 13:06:26 -0500
  • 260e2b177c fix order of fields. Nathan TeBlunthuis 2025-05-29 18:32:16 -0700
  • a13d7f1deb typo fix. Nathan TeBlunthuis 2025-05-29 18:25:08 -0700
  • ffbd180001 make editorid null not '' in parquet. Nathan TeBlunthuis 2025-05-29 18:24:33 -0700
  • 606a399450 handle empty comments which are 'False' somehow. Nathan TeBlunthuis 2025-05-29 18:14:58 -0700
  • a9f76a0f62 change order of fields. Nathan TeBlunthuis 2025-05-29 18:10:59 -0700
  • f39ceefa4a Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-29 18:05:28 -0700
  • 13ee160708 bugfix. Nathan TeBlunthuis 2025-05-29 18:04:41 -0700
  • bd22d26291 update deps and add edit_summary to wikiq output. Nathan TeBlunthuis 2025-05-29 18:02:14 -0700
  • 4dde25c508 Refactor revision logic to make more straightforward Will Beason 2025-05-29 15:46:45 -0500
  • aec6e5fafa Refactor collapse user logic Will Beason 2025-05-29 15:20:34 -0500
  • c0e629a313 Add ability to disable revert detection Will Beason 2025-05-29 11:59:10 -0500
  • 9009bb6fa4 Merge branch 'parquet_support' into test-parquet Will Beason 2025-05-29 10:21:30 -0500
  • ab280dd765 Remove requirements.txt and add uv.lock to ignored files. Will Beason 2025-05-29 10:05:49 -0500
  • 22d14dc5f2 Remove dependency on pytest. Nathan TeBlunthuis 2025-05-28 21:54:31 -0700
  • 5a10f59dc4 Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-28 23:52:59 -0500
  • b8cdc82fc2 add ipython for dev Nathan TeBlunthuis 2025-05-28 23:52:37 -0500
  • 2a2b611d79 Fix issue with .7z archives Nathan TeBlunthuis 2025-05-28 21:31:41 -0700
  • 39fec0820d use my version of mwxml since it fixes a bug. Nathan TeBlunthuis 2025-05-28 21:13:18 -0700
  • 383ee03250 Merge branch 'parquet_support' of gitea:collective/mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-28 21:09:13 -0700
  • 15e9234903 adding pyproject.toml Nathan TeBlunthuis 2025-05-28 20:59:55 -0700
  • 8c7d46472f Merge branch 'parquet_support' of code:mediawiki_dump_tools into parquet_support Nathan TeBlunthuis 2025-05-28 20:54:52 -0700
  • 3c7fb088d6 fix schema bugs. Nathan TeBlunthuis 2025-05-28 20:54:42 -0700
  • ee01ce3e61 Get Parquet test working Will Beason 2025-05-28 16:48:58 -0500
  • 52757a8239 Add noargs test for ikwiki Will Beason 2025-05-28 15:04:10 -0500
  • d413443740 Add numpy to environment Will Beason 2025-05-28 13:20:28 -0500
  • 3f94144b1b Begin adding test for parquet export Will Beason 2025-05-28 13:17:30 -0500
  • df0ad1de63 Finish test standardization will-setup Will Beason 2025-05-28 10:11:58 -0500
  • f3e6cc9392 Begin refactor of tests to make new tests easier to write Will Beason 2025-05-28 09:11:36 -0500
  • c8b14c3303 Refactor test temporary file logic and wikiq call pattern Will Beason 2025-05-27 16:24:07 -0500
  • 4d3900b541 Standardize calling for wikiq in tests Will Beason 2025-05-27 14:27:49 -0500
  • ebc57864f2 Make tests runnable from anywhere Will Beason 2025-05-27 13:40:57 -0500
  • 3d0bf89938 Move main logic to main() Will Beason 2025-05-27 11:10:42 -0500
  • 6d133575c7 Remove resource leaks from tests Will Beason 2025-05-26 15:08:47 -0500
  • 09a84e7d11 Reformat Wikiq_Unit_Test.py Will Beason 2025-05-26 15:07:39 -0500
  • 9c5bf577e6 Remove unused dependencies and fix spacing Will Beason 2025-05-26 14:15:01 -0500
  • 4804ecc4b3 Add additional test dependencies Will Beason 2025-05-26 12:29:49 -0500
  • 7a4c41159c Exclude JetBrains config folder in .gitignore Will Beason 2025-05-26 10:48:17 -0500
  • 933ca753ed code review. mako_changes-20230429 Nathan TeBlunthuis 2023-05-03 10:23:30 -0700
  • 54fa6221a8 fix because pandas testing API has changed Benjamin Mako Hill 2023-04-29 11:52:13 -0700
  • 9dcd337315 rename variables to be more consistent Benjamin Mako Hill 2023-04-29 11:44:48 -0700
  • 2ff4d60613 added counting functionality to regex code Benjamin Mako Hill 2023-04-29 11:40:03 -0700
  • 4729371d5a updated README file Benjamin Mako Hill 2023-04-28 14:40:18 -0700
  • 7e6cd5b386 make sure that content is defined before testing for search patterns Benjamin Mako Hill 2023-04-28 14:30:42 -0700
  • 556285b198 added a line to fix persistence with deleted revs Benjamin Mako Hill 2023-04-28 14:21:21 -0700
  • b124f9c7c8 write regex captures to parquet arrays. redirects Nathan TeBlunthuis 2022-03-29 17:52:26 -0700