- Implement per-namespace resume points (dict mapping namespace -> (pageid, revid)) to correctly handle interleaved dump ordering in partitioned output - Extract resume functionality to dedicated resume.py module - Add graceful shutdown handling via shutdown_requested flag (CLI-level only) - Use lazy ParquetWriter creation to avoid empty files on early exit - Refactor writing logic to _write_batch() helper method - Simplify control flow by replacing continue statements with should_write flag |
||
|---|---|---|
| src/wikiq | ||
| test | ||
| .gitignore | ||
| .gitmodules | ||
| .python-version | ||
| pyproject.toml | ||
| README.md | ||
| README.rst | ||
| runtest.sh | ||
When you install this from git, you will need to first clone the repository:
git clone git://projects.mako.cc/mediawiki_dump_tools
From within the repository working directory, initiatlize and set up the submodule like:
git submodule init
git submodule update
Wikimedia dumps are usually in a compressed format such as 7z (most common), gz, or bz2. Wikiq uses your computer's compression software to read these files. Therefore wikiq depends on [7za]{.title-ref}, [gzcat]{.title-ref}, and [zcat]{.title-ref}.
Dependencies
These non-Python dependencies must be installed on your system for wikiq and its associated tests to work.
- 7zip
- ffmpeg
A new diff engine based on [_wikidiff2]{.title-ref} can be used for word-persistence. Wikiq can also output the diffs between each page revision. This requires installing Wikidiff 2 on your system. On Debian or Ubuntu Linux this can be done via.
apt-get install php-wikidiff2
You may have to also run. sudo phpenmod wikidiff2.
Tests ----To run tests:
python -m unittest test.Wikiq_Unit_Test
TODO:
-
versions of deltas? -