Go to file
Benjamin Mako Hill 2ff4d60613 added counting functionality to regex code
The regex code has historically returned the actual matched patterns and the
named capture groups within regexes.  When trying to count common and/or large
patterns, this leads to very large outputs.

I've added two new functions -RPc and -CPc that will cause wikiq to return
counts of each pattern (0 when there are no matches). The options apply to all
comment or revision patterns. I considered interfaces to make it possible to do
some but others but concluded this would be too complicated an interface.

This code should be checked before it's merged.
2023-04-29 11:40:03 -07:00
test validate tests and add asserts and baselines for regex tests. 2019-11-09 12:19:55 -08:00
.gitignore added list of compressed dump files to .gitignore 2015-07-23 12:16:31 -07:00
.gitmodules migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy. 2018-07-04 19:06:07 -07:00
README.rst updated README file 2023-04-28 14:40:18 -07:00
wikiq added counting functionality to regex code 2023-04-29 11:40:03 -07:00

When you install this from git, you will need to first clone the repository::

  git clone git://projects.mako.cc/mediawiki_dump_tools

From within the repository working directory, initiatlize and set up the
submodule like::

  git submodule init
  git submodule update


Wikimedia dumps are usually in a compressed format such as 7z (most common),
gz, or bz2. Wikiq uses your computer's compression software to read these
files. Therefore wikiq depends on `7za`, `gzcat`, and `zcat`. 

There are also a series of Python dependencies. You can install these using pip
with a command like:

  pip3 install mwbase mwreverts mwxml mwtypes mwcli mwdiffs mwpersistence