Initial commit
p# new file: runwikiq.sh
This commit is contained in:
5
mediawiki_dump_tools/.gitignore
vendored
Normal file
5
mediawiki_dump_tools/.gitignore
vendored
Normal file
@@ -0,0 +1,5 @@
|
||||
*.xml.gz
|
||||
*.7z
|
||||
*.xml.bz2
|
||||
*.xml.xz
|
||||
*.swp
|
||||
3
mediawiki_dump_tools/.gitmodules
vendored
Normal file
3
mediawiki_dump_tools/.gitmodules
vendored
Normal file
@@ -0,0 +1,3 @@
|
||||
[submodule "Mediawiki-Utilities"]
|
||||
path = Mediawiki-Utilities
|
||||
url = https://github.com/halfak/Mediawiki-Utilities.git
|
||||
46
mediawiki_dump_tools/Mediawiki-Utilities/.gitignore
vendored
Normal file
46
mediawiki_dump_tools/Mediawiki-Utilities/.gitignore
vendored
Normal file
@@ -0,0 +1,46 @@
|
||||
# Demo files
|
||||
demo_*
|
||||
|
||||
# Byte-compiled / optimized / DLL files
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
|
||||
# Temporary text editor files
|
||||
*~
|
||||
|
||||
# C extensions
|
||||
*.so
|
||||
|
||||
# Distribution / packaging
|
||||
.Python
|
||||
env/
|
||||
bin/
|
||||
build/
|
||||
develop-eggs/
|
||||
dist/
|
||||
eggs/
|
||||
#lib/
|
||||
lib64/
|
||||
parts/
|
||||
sdist/
|
||||
var/
|
||||
*.egg-info/
|
||||
.installed.cfg
|
||||
*.egg
|
||||
|
||||
# Installer logs
|
||||
pip-log.txt
|
||||
pip-delete-this-directory.txt
|
||||
|
||||
# Unit test / coverage reports
|
||||
htmlcov/
|
||||
.tox/
|
||||
.coverage
|
||||
.cache
|
||||
nosetests.xml
|
||||
coverage.xml
|
||||
|
||||
# Sphinx documentation
|
||||
doc/_build/
|
||||
doc/.buildfile
|
||||
*.toctree
|
||||
19
mediawiki_dump_tools/Mediawiki-Utilities/CHANGE_LOG.rst
Normal file
19
mediawiki_dump_tools/Mediawiki-Utilities/CHANGE_LOG.rst
Normal file
@@ -0,0 +1,19 @@
|
||||
v0.4.4
|
||||
======
|
||||
|
||||
Adds API helper for persistence tracking and example script.
|
||||
|
||||
v0.4.0
|
||||
======
|
||||
|
||||
Adds api.collections.users
|
||||
|
||||
v0.3.8
|
||||
======
|
||||
|
||||
Adds support for spaces in XML dump filenames when using the dump mapper.
|
||||
|
||||
v0.3.7
|
||||
======
|
||||
|
||||
Fixes pickling issues in Timestamp
|
||||
21
mediawiki_dump_tools/Mediawiki-Utilities/LICENSE
Normal file
21
mediawiki_dump_tools/Mediawiki-Utilities/LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
The MIT License (MIT)
|
||||
|
||||
Copyright (c) 2014 Aaron Halfaker
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
1
mediawiki_dump_tools/Mediawiki-Utilities/MANIFEST.in
Normal file
1
mediawiki_dump_tools/Mediawiki-Utilities/MANIFEST.in
Normal file
@@ -0,0 +1 @@
|
||||
include LICENSE README.rst
|
||||
25
mediawiki_dump_tools/Mediawiki-Utilities/README.rst
Normal file
25
mediawiki_dump_tools/Mediawiki-Utilities/README.rst
Normal file
@@ -0,0 +1,25 @@
|
||||
===================
|
||||
MediaWiki Utilities
|
||||
===================
|
||||
|
||||
MediaWiki Utilities is an open source (MIT Licensed) library developed by Aaron Halfaker for extracting and processing data from MediaWiki installations, slave databases and xml dumps.
|
||||
|
||||
**Install with pip:** ``pip install mediawiki-utilities``
|
||||
|
||||
**Note:** *Use of this library requires Python 3 or later.*
|
||||
|
||||
**Documentation:** http://pythonhosted.org/mediawiki-utilities/
|
||||
|
||||
About the author
|
||||
================
|
||||
:name:
|
||||
Aaron Halfaker
|
||||
:email:
|
||||
aaron.halfaker@gmail.com
|
||||
:website:
|
||||
http://halfaker.info --
|
||||
http://en.wikipedia.org/wiki/User:EpochFail
|
||||
|
||||
Contributors
|
||||
============
|
||||
None yet. See http://github.com/halfak/mediawiki-utilities. Pull requests are encouraged.
|
||||
63
mediawiki_dump_tools/Mediawiki-Utilities/WORK_LOG.rst
Normal file
63
mediawiki_dump_tools/Mediawiki-Utilities/WORK_LOG.rst
Normal file
@@ -0,0 +1,63 @@
|
||||
2014-06-02
|
||||
After some reading, it looks like py3 will do something reasonable with re-raised errors, so I'm just going to let the error be re-raised and call it good.
|
||||
|
||||
2014-05-31
|
||||
I figured out that you just plain can't get a stack trace out of a multiprocessing.Process in such a way that you an re-associate it with its exception on the other side. I'm now working on putting together a picklable container exception that I can use to manage and format the exceptions that come out of a mapping function. It's not going great.
|
||||
|
||||
2014-04-08
|
||||
I've been extending the API. I added list=deletedrevs and tested (fixed) the api.Session.login() method. It all seems to work now. I also did some minor cleanup on lib.title.Parser to make the method names more explicit.
|
||||
|
||||
I'd like to start tracking changes so that I can build changelists to go with new versions. For now, I'll keep track of substantial changes here.
|
||||
|
||||
* Released 0.2.1
|
||||
* Added list=deletedrevs to api module
|
||||
|
||||
2014-03-27
|
||||
I just fixed up the structure for lib.reverts.database.check() and check_row(). You can give check_row() a database row or check() a rev_id and page_id. The functions should then either return None or the first reverting revision they encounter.
|
||||
|
||||
I like this pattern. Lib gets to reference core, but not vice versa. I need to talk to the Wikimetrics people about implementing some of the metrics within a new lib. Yet, one of the cool things about libs is that they don't necessarily need to be packaged with core. So you could write something that makes use of core and other libs as a standalone package first and incorporate it later. :D
|
||||
|
||||
2014-03-20
|
||||
Just a quick update today. I realized that database.DB.add_args was setting
|
||||
default values that won't make sense for anyone but me personally. I cleared that up and added a way to set your own defaults.
|
||||
|
||||
2014-03-18
|
||||
Refactoring! I've got a user. He immediately found problems. So I'm fixing them aggressively. I just renamed the library back to "mw". I also renamed the dump processing module to "xml_dump". I hope that these name changes will make more sense.
|
||||
|
||||
I also moved the revert detection functionality out of the database module and into the lib.reverts module. I think that this makes more sense. If it is a core functionality, it should live in code. If it is a library, it should only have other libraries depend on it. If I need to write a magical DB abstractor in lib, so be it.
|
||||
|
||||
2014-02-08
|
||||
It's time to kill `mw.lib.changes`. I just don't see that working as a core
|
||||
part of this library. It might make sense to return build up another library
|
||||
to handle changes. I'll have to get back to that at some other time.
|
||||
|
||||
2013-12-23
|
||||
Still hacking on `mw.lib.changes`. It's the same set of issues described in
|
||||
the last log. I'm making progress building a params parser. I think that my strategy is going to be to let the user handle params parsing themselves with a new `types.Protection` type.
|
||||
|
||||
Oh! And I did get `types.TimestampType` extended to have a `strptime` method.
|
||||
That's all nice and tested.
|
||||
|
||||
Note that I think it might be a good idea to consolidate all defaults for
|
||||
better documentation.
|
||||
|
||||
Anyway. All tests are passing. It's time to work on something else for a
|
||||
little while.
|
||||
|
||||
2013-12-19
|
||||
Still working on `mw.lib.changes`. I like the structure for the most part. It looks like I'm going to have to join `revision` and `logging` to `recentchanges` in order construct an appropriate `change.Change` from a row. That means I'm going to need a funny new method on `database.RecentChanges`. That's going to confuse people. Boo.
|
||||
|
||||
I also need to figure out a way to configure for the lame timestamp format that appears in blocks and page protections. I think I'm going to extend `types.TimestampType` to have a `strptime` method.
|
||||
|
||||
2013-12-18
|
||||
Tests passing. HistoricalMap was fine. Will be code-complete once lib.changes is done. Still need to figure out how I'm going to configure a title parser and pass it into the change constructor. Also, I rediscovered how stupid the recentchanges table is.
|
||||
|
||||
OK.. New lame thing. So, when you "protect" a page, the log keeps the following type of value in log_params:
|
||||
``\u200e[edit=autoconfirmed] (expires 03:20, 21 November 2013 (UTC))``
|
||||
|
||||
That date format... It's not the long or short format for `Timestamp`. I think it is a custom format that changes on a wiki-to-wiki basis.
|
||||
|
||||
I feel sad. This made my day worse. It's important to remind myself of the fact that MediaWiki was not designed to allow me to reverse engineer it.
|
||||
|
||||
2013-12-17
|
||||
Test on revert detector failing since simplifying restructure. I'm not sure what the issue is, but I suspect that I broke something in util.ordered.HistoricalMap. -halfak
|
||||
@@ -0,0 +1,5 @@
|
||||
python3-mediawiki-utilities (0.4.16) UNRELEASED; urgency=medium
|
||||
|
||||
* Initial version of the package
|
||||
|
||||
-- yuvipanda <yuvipanda@riseup.net> Tue, 04 Aug 2015 16:42:51 -0700
|
||||
1
mediawiki_dump_tools/Mediawiki-Utilities/debian/compat
Normal file
1
mediawiki_dump_tools/Mediawiki-Utilities/debian/compat
Normal file
@@ -0,0 +1 @@
|
||||
9
|
||||
18
mediawiki_dump_tools/Mediawiki-Utilities/debian/control
Normal file
18
mediawiki_dump_tools/Mediawiki-Utilities/debian/control
Normal file
@@ -0,0 +1,18 @@
|
||||
Source: python3-mediawiki-utilities
|
||||
Maintainer: Aaron Halfakar <aaron.halfakar@gmail.com>
|
||||
Section: python
|
||||
Priority: optional
|
||||
Build-Depends: python3-setuptools, python3-all, debhelper (>= 9), python3-nose, python3-pymysql, python3-requests
|
||||
Standards-Version: 3.9.6
|
||||
|
||||
Package: python3-mediawiki-utilities
|
||||
Architecture: all
|
||||
Depends: ${misc:Depends}, ${python3:Depends}
|
||||
Description: Infrastructure for running webservices on tools.wmflabs.org
|
||||
Provides scripts and a python package for running and controlling
|
||||
user provided webservices on tools.wmflabs.org.
|
||||
.
|
||||
webservice-new is the user facing script that can start / stop / restart
|
||||
webservices when run from commandline in bastion hosts.
|
||||
webservice-runner is the script that starts on the exec hosts and
|
||||
exec's to the appropriate command to run the webserver itself.
|
||||
26
mediawiki_dump_tools/Mediawiki-Utilities/debian/copyright
Normal file
26
mediawiki_dump_tools/Mediawiki-Utilities/debian/copyright
Normal file
@@ -0,0 +1,26 @@
|
||||
Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
|
||||
Upstream-Name: mediawiki-utilities
|
||||
|
||||
Files: *
|
||||
Copyright: 2014 Aaron Halfaker <aaron.halfaker@gmail.com>
|
||||
License: MIT
|
||||
|
||||
License: MIT
|
||||
Permission is hereby granted, free of charge, to any person obtaining
|
||||
a copy of this software and associated documentation files (the
|
||||
"Software"), to deal in the Software without restriction, including
|
||||
without limitation the rights to use, copy, modify, merge, publish,
|
||||
distribute, sublicense, and/or sell copies of the Software, and to
|
||||
permit persons to whom the Software is furnished to do so, subject to
|
||||
the following conditions:
|
||||
.
|
||||
The above copyright notice and this permission notice shall be
|
||||
included in all copies or substantial portions of the Software.
|
||||
.
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
||||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
||||
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
|
||||
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
|
||||
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
|
||||
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
4
mediawiki_dump_tools/Mediawiki-Utilities/debian/rules
Executable file
4
mediawiki_dump_tools/Mediawiki-Utilities/debian/rules
Executable file
@@ -0,0 +1,4 @@
|
||||
#!/usr/bin/make -f
|
||||
|
||||
%:
|
||||
dh $@ --with python3 --buildsystem=pybuild
|
||||
182
mediawiki_dump_tools/Mediawiki-Utilities/doc/Makefile
Normal file
182
mediawiki_dump_tools/Mediawiki-Utilities/doc/Makefile
Normal file
@@ -0,0 +1,182 @@
|
||||
# Makefile for Sphinx documentation
|
||||
#
|
||||
|
||||
# You can set these variables from the command line.
|
||||
SPHINXOPTS = -v
|
||||
SPHINXBUILD = sphinx-build
|
||||
PAPER =
|
||||
BUILDDIR = _build
|
||||
|
||||
# User-friendly check for sphinx-build
|
||||
ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1)
|
||||
$(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/)
|
||||
endif
|
||||
|
||||
# Internal variables.
|
||||
PAPEROPT_a4 = -D latex_paper_size=a4
|
||||
PAPEROPT_letter = -D latex_paper_size=letter
|
||||
ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
|
||||
# the i18n builder cannot share the environment and doctrees with the others
|
||||
I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
|
||||
|
||||
.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
|
||||
|
||||
help:
|
||||
@echo "Please use \`make <target>' where <target> is one of"
|
||||
@echo " html to make standalone HTML files"
|
||||
@echo " dirhtml to make HTML files named index.html in directories"
|
||||
@echo " singlehtml to make a single large HTML file"
|
||||
@echo " pickle to make pickle files"
|
||||
@echo " json to make JSON files"
|
||||
@echo " htmlhelp to make HTML files and a HTML help project"
|
||||
@echo " qthelp to make HTML files and a qthelp project"
|
||||
@echo " devhelp to make HTML files and a Devhelp project"
|
||||
@echo " epub to make an epub"
|
||||
@echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
|
||||
@echo " latexpdf to make LaTeX files and run them through pdflatex"
|
||||
@echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx"
|
||||
@echo " text to make text files"
|
||||
@echo " man to make manual pages"
|
||||
@echo " texinfo to make Texinfo files"
|
||||
@echo " info to make Texinfo files and run them through makeinfo"
|
||||
@echo " gettext to make PO message catalogs"
|
||||
@echo " changes to make an overview of all changed/added/deprecated items"
|
||||
@echo " xml to make Docutils-native XML files"
|
||||
@echo " pseudoxml to make pseudoxml-XML files for display purposes"
|
||||
@echo " linkcheck to check all external links for integrity"
|
||||
@echo " doctest to run all doctests embedded in the documentation (if enabled)"
|
||||
|
||||
clean:
|
||||
rm -rf $(BUILDDIR)/*
|
||||
|
||||
html:
|
||||
$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
|
||||
@echo
|
||||
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
|
||||
|
||||
dirhtml:
|
||||
$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
|
||||
@echo
|
||||
@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
|
||||
|
||||
singlehtml:
|
||||
$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
|
||||
@echo
|
||||
@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
|
||||
|
||||
pickle:
|
||||
$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
|
||||
@echo
|
||||
@echo "Build finished; now you can process the pickle files."
|
||||
|
||||
json:
|
||||
$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
|
||||
@echo
|
||||
@echo "Build finished; now you can process the JSON files."
|
||||
|
||||
htmlhelp:
|
||||
$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
|
||||
@echo
|
||||
@echo "Build finished; now you can run HTML Help Workshop with the" \
|
||||
".hhp project file in $(BUILDDIR)/htmlhelp."
|
||||
|
||||
qthelp:
|
||||
$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
|
||||
@echo
|
||||
@echo "Build finished; now you can run "qcollectiongenerator" with the" \
|
||||
".qhcp project file in $(BUILDDIR)/qthelp, like this:"
|
||||
@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/mediawiki-utilities.qhcp"
|
||||
@echo "To view the help file:"
|
||||
@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/mediawiki-utilities.qhc"
|
||||
|
||||
devhelp:
|
||||
$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
|
||||
@echo
|
||||
@echo "Build finished."
|
||||
@echo "To view the help file:"
|
||||
@echo "# mkdir -p $$HOME/.local/share/devhelp/mediawiki-utilities"
|
||||
@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/mediawiki-utilities"
|
||||
@echo "# devhelp"
|
||||
|
||||
epub:
|
||||
$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
|
||||
@echo
|
||||
@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
|
||||
|
||||
latex:
|
||||
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||
@echo
|
||||
@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
|
||||
@echo "Run \`make' in that directory to run these through (pdf)latex" \
|
||||
"(use \`make latexpdf' here to do that automatically)."
|
||||
|
||||
latexpdf:
|
||||
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||
@echo "Running LaTeX files through pdflatex..."
|
||||
$(MAKE) -C $(BUILDDIR)/latex all-pdf
|
||||
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
|
||||
|
||||
latexpdfja:
|
||||
$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
|
||||
@echo "Running LaTeX files through platex and dvipdfmx..."
|
||||
$(MAKE) -C $(BUILDDIR)/latex all-pdf-ja
|
||||
@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
|
||||
|
||||
text:
|
||||
$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
|
||||
@echo
|
||||
@echo "Build finished. The text files are in $(BUILDDIR)/text."
|
||||
|
||||
man:
|
||||
$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
|
||||
@echo
|
||||
@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
|
||||
|
||||
texinfo:
|
||||
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
|
||||
@echo
|
||||
@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
|
||||
@echo "Run \`make' in that directory to run these through makeinfo" \
|
||||
"(use \`make info' here to do that automatically)."
|
||||
|
||||
info:
|
||||
$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
|
||||
@echo "Running Texinfo files through makeinfo..."
|
||||
make -C $(BUILDDIR)/texinfo info
|
||||
@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
|
||||
|
||||
gettext:
|
||||
$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
|
||||
@echo
|
||||
@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
|
||||
|
||||
changes:
|
||||
$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
|
||||
@echo
|
||||
@echo "The overview file is in $(BUILDDIR)/changes."
|
||||
|
||||
linkcheck:
|
||||
$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
|
||||
@echo
|
||||
@echo "Link check complete; look for any errors in the above output " \
|
||||
"or in $(BUILDDIR)/linkcheck/output.txt."
|
||||
|
||||
doctest:
|
||||
$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
|
||||
@echo "Testing of doctests in the sources finished, look at the " \
|
||||
"results in $(BUILDDIR)/doctest/output.txt."
|
||||
|
||||
xml:
|
||||
$(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml
|
||||
@echo
|
||||
@echo "Build finished. The XML files are in $(BUILDDIR)/xml."
|
||||
|
||||
pseudoxml:
|
||||
$(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml
|
||||
@echo
|
||||
@echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml."
|
||||
|
||||
htmlzip: html
|
||||
cd _build/html/ && \
|
||||
zip -r ../../html.zip * && \
|
||||
cd ../../
|
||||
0
mediawiki_dump_tools/Mediawiki-Utilities/doc/_static/PLACEHOLDER
vendored
Normal file
0
mediawiki_dump_tools/Mediawiki-Utilities/doc/_static/PLACEHOLDER
vendored
Normal file
0
mediawiki_dump_tools/Mediawiki-Utilities/doc/_templates/PLACEHOLDER
vendored
Normal file
0
mediawiki_dump_tools/Mediawiki-Utilities/doc/_templates/PLACEHOLDER
vendored
Normal file
267
mediawiki_dump_tools/Mediawiki-Utilities/doc/conf.py
Normal file
267
mediawiki_dump_tools/Mediawiki-Utilities/doc/conf.py
Normal file
@@ -0,0 +1,267 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
#
|
||||
# mediawiki-utilities documentation build configuration file, created by
|
||||
# sphinx-quickstart on Thu Apr 10 17:31:47 2014.
|
||||
#
|
||||
# This file is execfile()d with the current directory set to its
|
||||
# containing dir.
|
||||
#
|
||||
# Note that not all possible configuration values are present in this
|
||||
# autogenerated file.
|
||||
#
|
||||
# All configuration values have a default; values that are commented out
|
||||
# serve to show the default.
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
# If extensions (or modules to document with autodoc) are in another directory,
|
||||
# add these directories to sys.path here. If the directory is relative to the
|
||||
# documentation root, use os.path.abspath to make it absolute, like shown here.
|
||||
sys.path.insert(0, os.path.abspath('../'))
|
||||
import mw
|
||||
|
||||
# -- General configuration ------------------------------------------------
|
||||
|
||||
# If your documentation needs a minimal Sphinx version, state it here.
|
||||
#needs_sphinx = '1.0'
|
||||
|
||||
# Add any Sphinx extension module names here, as strings. They can be
|
||||
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
|
||||
# ones.
|
||||
extensions = [
|
||||
'sphinx.ext.autodoc',
|
||||
'sphinx.ext.doctest',
|
||||
'sphinx.ext.todo',
|
||||
'sphinx.ext.coverage',
|
||||
'sphinx.ext.mathjax',
|
||||
'sphinx.ext.viewcode',
|
||||
]
|
||||
|
||||
# Add any paths that contain templates here, relative to this directory.
|
||||
templates_path = ['_templates']
|
||||
|
||||
# The suffix of source filenames.
|
||||
source_suffix = '.rst'
|
||||
|
||||
# The encoding of source files.
|
||||
#source_encoding = 'utf-8-sig'
|
||||
|
||||
# The master toctree document.
|
||||
master_doc = 'index'
|
||||
|
||||
# General information about the project.
|
||||
project = 'mediawiki-utilities'
|
||||
copyright = '2014, Aaron Halfaker'
|
||||
|
||||
# The version info for the project you're documenting, acts as replacement for
|
||||
# |version| and |release|, also used in various other places throughout the
|
||||
# built documents.
|
||||
#
|
||||
# The short X.Y version.
|
||||
version = mw.__version__
|
||||
# The full version, including alpha/beta/rc tags.
|
||||
release = mw.__version__
|
||||
|
||||
# The language for content autogenerated by Sphinx. Refer to documentation
|
||||
# for a list of supported languages.
|
||||
#language = None
|
||||
|
||||
# There are two options for replacing |today|: either, you set today to some
|
||||
# non-false value, then it is used:
|
||||
#today = ''
|
||||
# Else, today_fmt is used as the format for a strftime call.
|
||||
#today_fmt = '%B %d, %Y'
|
||||
|
||||
# List of patterns, relative to source directory, that match files and
|
||||
# directories to ignore when looking for source files.
|
||||
exclude_patterns = ['_build']
|
||||
|
||||
# The reST default role (used for this markup: `text`) to use for all
|
||||
# documents.
|
||||
#default_role = None
|
||||
|
||||
# If true, '()' will be appended to :func: etc. cross-reference text.
|
||||
#add_function_parentheses = True
|
||||
|
||||
# If true, the current module name will be prepended to all description
|
||||
# unit titles (such as .. function::).
|
||||
#add_module_names = True
|
||||
|
||||
# If true, sectionauthor and moduleauthor directives will be shown in the
|
||||
# output. They are ignored by default.
|
||||
#show_authors = False
|
||||
|
||||
# The name of the Pygments (syntax highlighting) style to use.
|
||||
pygments_style = 'sphinx'
|
||||
|
||||
# A list of ignored prefixes for module index sorting.
|
||||
#modindex_common_prefix = []
|
||||
|
||||
# If true, keep warnings as "system message" paragraphs in the built documents.
|
||||
#keep_warnings = False
|
||||
|
||||
|
||||
# -- Options for HTML output ----------------------------------------------
|
||||
|
||||
# The theme to use for HTML and HTML Help pages. See the documentation for
|
||||
# a list of builtin themes.
|
||||
html_theme = 'default'
|
||||
|
||||
# Theme options are theme-specific and customize the look and feel of a theme
|
||||
# further. For a list of options available for each theme, see the
|
||||
# documentation.
|
||||
#html_theme_options = {}
|
||||
|
||||
# Add any paths that contain custom themes here, relative to this directory.
|
||||
#html_theme_path = []
|
||||
|
||||
# The name for this set of Sphinx documents. If None, it defaults to
|
||||
# "<project> v<release> documentation".
|
||||
#html_title = None
|
||||
|
||||
# A shorter title for the navigation bar. Default is the same as html_title.
|
||||
#html_short_title = None
|
||||
|
||||
# The name of an image file (relative to this directory) to place at the top
|
||||
# of the sidebar.
|
||||
#html_logo = None
|
||||
|
||||
# The name of an image file (within the static path) to use as favicon of the
|
||||
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
|
||||
# pixels large.
|
||||
#html_favicon = None
|
||||
|
||||
# Add any paths that contain custom static files (such as style sheets) here,
|
||||
# relative to this directory. They are copied after the builtin static files,
|
||||
# so a file named "default.css" will overwrite the builtin "default.css".
|
||||
html_static_path = ['_static']
|
||||
|
||||
# Add any extra paths that contain custom files (such as robots.txt or
|
||||
# .htaccess) here, relative to this directory. These files are copied
|
||||
# directly to the root of the documentation.
|
||||
#html_extra_path = []
|
||||
|
||||
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
|
||||
# using the given strftime format.
|
||||
#html_last_updated_fmt = '%b %d, %Y'
|
||||
|
||||
# If true, SmartyPants will be used to convert quotes and dashes to
|
||||
# typographically correct entities.
|
||||
#html_use_smartypants = True
|
||||
|
||||
# Custom sidebar templates, maps document names to template names.
|
||||
#html_sidebars = {}
|
||||
|
||||
# Additional templates that should be rendered to pages, maps page names to
|
||||
# template names.
|
||||
#html_additional_pages = {}
|
||||
|
||||
# If false, no module index is generated.
|
||||
#html_domain_indices = True
|
||||
|
||||
# If false, no index is generated.
|
||||
#html_use_index = True
|
||||
|
||||
# If true, the index is split into individual pages for each letter.
|
||||
#html_split_index = False
|
||||
|
||||
# If true, links to the reST sources are added to the pages.
|
||||
#html_show_sourcelink = True
|
||||
|
||||
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
|
||||
#html_show_sphinx = True
|
||||
|
||||
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
|
||||
#html_show_copyright = True
|
||||
|
||||
# If true, an OpenSearch description file will be output, and all pages will
|
||||
# contain a <link> tag referring to it. The value of this option must be the
|
||||
# base URL from which the finished HTML is served.
|
||||
#html_use_opensearch = ''
|
||||
|
||||
# This is the file name suffix for HTML files (e.g. ".xhtml").
|
||||
#html_file_suffix = None
|
||||
|
||||
# Output file base name for HTML help builder.
|
||||
htmlhelp_basename = 'mediawiki-utilitiesdoc'
|
||||
|
||||
|
||||
# -- Options for LaTeX output ---------------------------------------------
|
||||
|
||||
latex_elements = {
|
||||
# The paper size ('letterpaper' or 'a4paper').
|
||||
#'papersize': 'letterpaper',
|
||||
|
||||
# The font size ('10pt', '11pt' or '12pt').
|
||||
#'pointsize': '10pt',
|
||||
|
||||
# Additional stuff for the LaTeX preamble.
|
||||
#'preamble': '',
|
||||
}
|
||||
|
||||
# Grouping the document tree into LaTeX files. List of tuples
|
||||
# (source start file, target name, title,
|
||||
# author, documentclass [howto, manual, or own class]).
|
||||
latex_documents = [
|
||||
('index', 'mediawiki-utilities.tex', 'mediawiki-utilities Documentation',
|
||||
'Aaron Halfaker', 'manual'),
|
||||
]
|
||||
|
||||
# The name of an image file (relative to this directory) to place at the top of
|
||||
# the title page.
|
||||
#latex_logo = None
|
||||
|
||||
# For "manual" documents, if this is true, then toplevel headings are parts,
|
||||
# not chapters.
|
||||
#latex_use_parts = False
|
||||
|
||||
# If true, show page references after internal links.
|
||||
#latex_show_pagerefs = False
|
||||
|
||||
# If true, show URL addresses after external links.
|
||||
#latex_show_urls = False
|
||||
|
||||
# Documents to append as an appendix to all manuals.
|
||||
#latex_appendices = []
|
||||
|
||||
# If false, no module index is generated.
|
||||
#latex_domain_indices = True
|
||||
|
||||
|
||||
# -- Options for manual page output ---------------------------------------
|
||||
|
||||
# One entry per manual page. List of tuples
|
||||
# (source start file, name, description, authors, manual section).
|
||||
man_pages = [
|
||||
('index', 'mediawiki-utilities', 'mediawiki-utilities Documentation',
|
||||
['Aaron Halfaker'], 1)
|
||||
]
|
||||
|
||||
# If true, show URL addresses after external links.
|
||||
#man_show_urls = False
|
||||
|
||||
|
||||
# -- Options for Texinfo output -------------------------------------------
|
||||
|
||||
# Grouping the document tree into Texinfo files. List of tuples
|
||||
# (source start file, target name, title, author,
|
||||
# dir menu entry, description, category)
|
||||
texinfo_documents = [
|
||||
('index', 'mediawiki-utilities', 'mediawiki-utilities Documentation',
|
||||
'Aaron Halfaker', 'mediawiki-utilities', 'One line description of project.',
|
||||
'Miscellaneous'),
|
||||
]
|
||||
|
||||
# Documents to append as an appendix to all manuals.
|
||||
#texinfo_appendices = []
|
||||
|
||||
# If false, no module index is generated.
|
||||
#texinfo_domain_indices = True
|
||||
|
||||
# How to display URL addresses: 'footnote', 'no', or 'inline'.
|
||||
#texinfo_show_urls = 'footnote'
|
||||
|
||||
# If true, do not generate a @detailmenu in the "Top" node's menu.
|
||||
#texinfo_no_detailmenu = False
|
||||
77
mediawiki_dump_tools/Mediawiki-Utilities/doc/core/api.rst
Normal file
77
mediawiki_dump_tools/Mediawiki-Utilities/doc/core/api.rst
Normal file
@@ -0,0 +1,77 @@
|
||||
.. _mw.api:
|
||||
|
||||
===================================
|
||||
mw.api -- MediaWiki API abstraction
|
||||
===================================
|
||||
|
||||
This module contains a set of utilities for interacting with the MediaWiki API.
|
||||
|
||||
Here's an example of a common usage pattern:
|
||||
|
||||
>>> from mw import api
|
||||
>>>
|
||||
>>> session = api.Session("https://en.wikipedia.org/w/api.php")
|
||||
>>>
|
||||
>>> revisions = session.revisions.query(
|
||||
... properties={'ids', 'content'},
|
||||
... titles={"User:EpochFail"},
|
||||
... direction="newer",
|
||||
... limit=3
|
||||
... )
|
||||
>>>
|
||||
>>> for rev in revisions:
|
||||
... print(
|
||||
... "rev_id={0}, length={1} characters".format(
|
||||
... rev['revid'],
|
||||
... len(rev.get('*', ""))
|
||||
... )
|
||||
... )
|
||||
...
|
||||
rev_id=190055192, length=124 characters
|
||||
rev_id=276121340, length=132 characters
|
||||
rev_id=276121389, length=124 characters
|
||||
|
||||
Session
|
||||
=======
|
||||
|
||||
.. autoclass:: mw.api.Session
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
|
||||
Collections
|
||||
===========
|
||||
|
||||
.. autoclass:: mw.api.DeletedRevisions
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.api.Pages
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.api.RecentChanges
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.api.Revisions
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.api.SiteInfo
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.api.UserContribs
|
||||
:members:
|
||||
|
||||
Errors
|
||||
======
|
||||
|
||||
|
||||
.. autoclass:: mw.api.errors.APIError
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
.. autoclass:: mw.api.errors.AuthenticationError
|
||||
:members:
|
||||
:inherited-members:
|
||||
|
||||
.. autoclass:: mw.api.errors.MalformedResponse
|
||||
:members:
|
||||
:inherited-members:
|
||||
@@ -0,0 +1,53 @@
|
||||
.. _mw.database:
|
||||
|
||||
=========================================
|
||||
mw.database -- MySQL database abstraction
|
||||
=========================================
|
||||
|
||||
This module contains a set of utilities for interacting with MediaWiki databases.
|
||||
|
||||
Here's an example of a common usage pattern:
|
||||
::
|
||||
|
||||
from mw import database
|
||||
|
||||
db = database.DB.from_params(
|
||||
host="s1-analytics-slave.eqiad.wmnet",
|
||||
read_default_file="~/.my.cnf",
|
||||
user="research",
|
||||
db="enwiki"
|
||||
)
|
||||
revisions = db.revisions.query(user_id=9133062)
|
||||
|
||||
for rev_row in revisions:
|
||||
rev_row['rev_id']
|
||||
|
||||
|
||||
DB
|
||||
======
|
||||
|
||||
.. autoclass:: mw.database.DB
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
|
||||
Collections
|
||||
===========
|
||||
|
||||
.. autoclass:: mw.database.Archives
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.database.AllRevisions
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.database.Pages
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.database.RecentChanges
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.database.Revisions
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.database.Users
|
||||
:members:
|
||||
@@ -0,0 +1,52 @@
|
||||
.. _mw.xml_dump:
|
||||
|
||||
==================================
|
||||
mw.xml_dump -- XML dump processing
|
||||
==================================
|
||||
|
||||
.. automodule:: mw.xml_dump
|
||||
|
||||
The map() function
|
||||
==================
|
||||
|
||||
.. autofunction:: mw.xml_dump.map
|
||||
|
||||
Iteration
|
||||
=========
|
||||
|
||||
.. autoclass:: mw.xml_dump.Iterator
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. autoclass:: mw.xml_dump.Page
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. autoclass:: mw.xml_dump.Redirect
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. autoclass:: mw.xml_dump.Revision
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. autoclass:: mw.xml_dump.Comment
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. autoclass:: mw.xml_dump.Contributor
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. autoclass:: mw.xml_dump.Text
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
Errors
|
||||
======
|
||||
|
||||
.. autoclass:: mw.xml_dump.errors.FileTypeError
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.xml_dump.errors.MalformedXML
|
||||
:members:
|
||||
100
mediawiki_dump_tools/Mediawiki-Utilities/doc/index.rst
Normal file
100
mediawiki_dump_tools/Mediawiki-Utilities/doc/index.rst
Normal file
@@ -0,0 +1,100 @@
|
||||
.. mediawiki-utilities documentation master file, created by
|
||||
sphinx-quickstart on Thu Apr 10 17:31:47 2014.
|
||||
You can adapt this file completely to your liking, but it should at least
|
||||
contain the root `toctree` directive.
|
||||
|
||||
===================
|
||||
MediaWiki Utilities
|
||||
===================
|
||||
|
||||
MediaWiki Utilities is an open source (MIT Licensed) library developed by Aaron Halfaker for extracting and processing data from MediaWiki installations, slave databases and xml dumps.
|
||||
|
||||
**Instal with pip:** ``pip install mediawiki-utilities``
|
||||
|
||||
**Note:** *Use of this library requires Python 3 or later.*
|
||||
|
||||
Types
|
||||
=====
|
||||
:ref:`mw.Timestamp <mw.types>`
|
||||
A simple datatype for handling MediaWiki's various time formats.
|
||||
|
||||
Core modules
|
||||
============
|
||||
|
||||
:ref:`mw.api <mw.api>`
|
||||
A set of utilities for interacting with MediaWiki's web API.
|
||||
|
||||
* :class:`~mw.api.Session` -- Constructs an API session with a MediaWiki installation. Contains convenience methods for accessing ``prop=revisions``, ``list=usercontribs``, ``meta=siteinfo``, ``list=deletedrevs`` and ``list=recentchanges``.
|
||||
|
||||
:ref:`mw.database <mw.database>`
|
||||
A set of utilities for interacting with MediaWiki's database.
|
||||
|
||||
* :class:`~mw.database.DB` -- Constructs a mysql database connector with convenience methods for accessing ``revision``, ``archive``, ``page``, ``user``, and ``recentchanges``.
|
||||
|
||||
:ref:`mw.xml_dump <mw.xml_dump>`
|
||||
A set of utilities for processing MediaWiki's XML database dumps quickly and without dealing with streaming XML.
|
||||
|
||||
* :func:`~mw.xml_dump.map` -- Applies a function to a set of dump files (:class:`~mw.xml_dump.Iterator`) using :class:`multiprocessing` and aggregates the output.
|
||||
* :class:`~mw.xml_dump.Iterator` -- Constructs an iterator over a standard XML dump. Dumps contain site_info and pages. Pages contain metadata and revisions. Revisions contain metadata and text. This is probably why you are here.
|
||||
|
||||
Libraries
|
||||
=========
|
||||
|
||||
:ref:`mw.lib.persistence <mw.lib.persistence>`
|
||||
A set of utilities for tracking the persistence of content between revisions.
|
||||
|
||||
* :class:`~mw.lib.persistence.State` -- Constructs an object that represents the current content persistence state of a page. Reports useful details about the persistence of content when updated.
|
||||
|
||||
:ref:`mw.lib.reverts <mw.lib.reverts>`
|
||||
A set of utilities for performing revert detection
|
||||
|
||||
* :func:`~mw.lib.reverts.detect` -- Detects reverts in a sequence of revision events.
|
||||
* :class:`~mw.lib.reverts.Detector` -- Constructs an identity revert detector that can be updated manually over the history of a page.
|
||||
|
||||
:ref:`mw.lib.sessions <mw.lib.sessions>`
|
||||
A set of utilities for grouping revisions and other events into sessions
|
||||
|
||||
* :func:`~mw.lib.sessions.cluster` -- Clusters a sequence of user actions into sessions.
|
||||
* :class:`~mw.lib.sessions.Cache` -- Constructs a cache of recent user actions that can be updated manually in order to detect sessions.
|
||||
|
||||
:ref:`mw.lib.title <mw.lib.title>`
|
||||
A set of utilities for normalizing and parsing page titles
|
||||
|
||||
* :func:`~mw.lib.title.normalize` -- Normalizes a page title.
|
||||
* :class:`~mw.lib.title.Parser` -- Constructs a parser with a set of namespaces that can be used to parse and normalize page titles.
|
||||
|
||||
About the author
|
||||
================
|
||||
:name:
|
||||
Aaron Halfaker
|
||||
:email:
|
||||
aaron.halfaker@gmail.com
|
||||
:website:
|
||||
http://halfaker.info --
|
||||
http://en.wikipedia.org/wiki/User:EpochFail
|
||||
|
||||
|
||||
Contributors
|
||||
============
|
||||
None yet. See http://github.com/halfak/mediawiki-utilities. Pull requests are encouraged.
|
||||
|
||||
|
||||
Indices and tables
|
||||
==================
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
types
|
||||
core/api
|
||||
core/database
|
||||
core/xml_dump
|
||||
lib/persistence
|
||||
lib/reverts
|
||||
lib/sessions
|
||||
lib/title
|
||||
|
||||
* :ref:`genindex`
|
||||
* :ref:`modindex`
|
||||
* :ref:`search`
|
||||
|
||||
@@ -0,0 +1,35 @@
|
||||
.. _mw.lib.persistence:
|
||||
|
||||
=======================================================
|
||||
mw.lib.persistence -- tracking content between revisions
|
||||
=======================================================
|
||||
|
||||
.. autoclass:: mw.lib.persistence.State
|
||||
:members:
|
||||
|
||||
Tokenization
|
||||
============
|
||||
|
||||
.. autoclass:: mw.lib.persistence.Tokens
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.lib.persistence.Token
|
||||
:members:
|
||||
|
||||
.. automodule:: mw.lib.persistence.tokenization
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
Difference
|
||||
==========
|
||||
|
||||
.. automodule:: mw.lib.persistence.difference
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
Constants
|
||||
=========
|
||||
|
||||
.. automodule:: mw.lib.persistence.defaults
|
||||
:members:
|
||||
:member-order: bysource
|
||||
30
mediawiki_dump_tools/Mediawiki-Utilities/doc/lib/reverts.rst
Normal file
30
mediawiki_dump_tools/Mediawiki-Utilities/doc/lib/reverts.rst
Normal file
@@ -0,0 +1,30 @@
|
||||
.. _mw.lib.reverts:
|
||||
|
||||
=============================================
|
||||
mw.lib.reverts -- detecting reverts
|
||||
=============================================
|
||||
|
||||
.. automodule:: mw.lib.reverts
|
||||
|
||||
.. autofunction:: mw.lib.reverts.detect
|
||||
|
||||
.. autoclass:: mw.lib.reverts.Revert
|
||||
|
||||
.. autoclass:: mw.lib.reverts.Detector
|
||||
:members:
|
||||
|
||||
Convenience functions
|
||||
=====================
|
||||
.. automodule:: mw.lib.reverts.api
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
.. automodule:: mw.lib.reverts.database
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
Constants
|
||||
=========
|
||||
|
||||
.. automodule:: mw.lib.reverts.defaults
|
||||
:members:
|
||||
@@ -0,0 +1,18 @@
|
||||
.. _mw.lib.sessions:
|
||||
|
||||
===================================
|
||||
mw.lib.sessions -- event clustering
|
||||
===================================
|
||||
|
||||
.. autofunction:: mw.lib.sessions.cluster
|
||||
|
||||
.. autoclass:: mw.lib.sessions.Session
|
||||
|
||||
.. autoclass:: mw.lib.sessions.Cache
|
||||
:members:
|
||||
|
||||
Constants
|
||||
=========
|
||||
|
||||
.. automodule:: mw.lib.sessions.defaults
|
||||
:members:
|
||||
15
mediawiki_dump_tools/Mediawiki-Utilities/doc/lib/title.rst
Normal file
15
mediawiki_dump_tools/Mediawiki-Utilities/doc/lib/title.rst
Normal file
@@ -0,0 +1,15 @@
|
||||
.. _mw.lib.title:
|
||||
|
||||
============================================================
|
||||
mw.lib.title -- parsing and normalizing titles
|
||||
============================================================
|
||||
|
||||
.. autofunction:: mw.lib.title.normalize
|
||||
|
||||
|
||||
Title parser
|
||||
================
|
||||
.. autoclass:: mw.lib.title.Parser
|
||||
:members:
|
||||
:member-order: bysource
|
||||
|
||||
11
mediawiki_dump_tools/Mediawiki-Utilities/doc/types.rst
Normal file
11
mediawiki_dump_tools/Mediawiki-Utilities/doc/types.rst
Normal file
@@ -0,0 +1,11 @@
|
||||
.. _mw.types:
|
||||
|
||||
========================
|
||||
mw.types -- common types
|
||||
========================
|
||||
|
||||
.. autoclass:: mw.Timestamp
|
||||
:members:
|
||||
|
||||
.. autoclass:: mw.Namespace
|
||||
:members:
|
||||
@@ -0,0 +1,37 @@
|
||||
"""
|
||||
Prints the rev_id, characters and hash of all revisions to Willy_on_Wheels.
|
||||
"""
|
||||
import getpass
|
||||
import hashlib
|
||||
import os
|
||||
import sys
|
||||
|
||||
try:
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
|
||||
from mw import api
|
||||
except: raise
|
||||
|
||||
|
||||
|
||||
api_session = api.Session("https://en.wikipedia.org/w/api.php")
|
||||
|
||||
print("(EN) Wikipedia credentials...")
|
||||
username = input("Username: ")
|
||||
password = getpass.getpass("Password: ")
|
||||
api_session.login(username, password)
|
||||
|
||||
revisions = api_session.deleted_revisions.query(
|
||||
properties={'ids', 'content'},
|
||||
titles={'Willy on Wheels'},
|
||||
direction="newer"
|
||||
)
|
||||
|
||||
for rev in revisions:
|
||||
print(
|
||||
"{0} ({1} chars): {2}".format(
|
||||
rev['revid'],
|
||||
len(rev.get('*', "")),
|
||||
hashlib.sha1(bytes(rev.get('*', ""), 'utf8')).hexdigest()
|
||||
)
|
||||
)
|
||||
19
mediawiki_dump_tools/Mediawiki-Utilities/examples/api.py
Normal file
19
mediawiki_dump_tools/Mediawiki-Utilities/examples/api.py
Normal file
@@ -0,0 +1,19 @@
|
||||
"""
|
||||
Prints the rev_id of all revisions to User:EpochFail.
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
|
||||
from mw import api
|
||||
|
||||
api_session = api.Session("https://en.wikipedia.org/w/api.php")
|
||||
|
||||
revisions = api_session.revisions.query(
|
||||
properties={'ids'},
|
||||
titles={'User:TestAccountForMWUtils'}
|
||||
)
|
||||
|
||||
for rev in revisions:
|
||||
print(rev['revid'])
|
||||
@@ -0,0 +1,30 @@
|
||||
"""
|
||||
Prints the rev_id and hash of the 10 oldest edits in recent_changes.
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
|
||||
try:
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
from mw import api
|
||||
except:
|
||||
raise
|
||||
|
||||
api_session = api.Session("https://en.wikipedia.org/w/api.php")
|
||||
|
||||
changes = api_session.recent_changes.query(
|
||||
type={'edit', 'new'},
|
||||
properties={'ids', 'sha1', 'timestamp'},
|
||||
direction="newer",
|
||||
limit=10
|
||||
)
|
||||
|
||||
for change in changes:
|
||||
print(
|
||||
"{0} ({1}) @ {2}: {3}".format(
|
||||
change['rcid'],
|
||||
change['type'],
|
||||
change['timestamp'],
|
||||
change.get('sha1', "")
|
||||
)
|
||||
)
|
||||
@@ -0,0 +1,28 @@
|
||||
"""
|
||||
Prints the rev_id, characters and hash of all revisions to User:EpochFail.
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
|
||||
import hashlib
|
||||
from mw import api
|
||||
|
||||
api_session = api.Session("https://en.wikipedia.org/w/api.php")
|
||||
|
||||
revisions = api_session.revisions.query(
|
||||
properties={'ids', 'content'},
|
||||
titles={"User:EpochFail"},
|
||||
direction="newer",
|
||||
limit=51
|
||||
)
|
||||
|
||||
for rev in revisions:
|
||||
print(
|
||||
"{0} ({1} chars): {2}".format(
|
||||
rev['revid'],
|
||||
len(rev.get('*', "")),
|
||||
hashlib.sha1(bytes(rev.get('*', ""), 'utf8')).hexdigest()
|
||||
)
|
||||
)
|
||||
@@ -0,0 +1,20 @@
|
||||
"""
|
||||
Prints the rev_id, characters and hash of all revisions to User:EpochFail.
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
|
||||
try:
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
from mw import api
|
||||
except:
|
||||
raise
|
||||
|
||||
api_session = api.Session("https://en.wikipedia.org/w/api.php")
|
||||
|
||||
user_docs = api_session.users.query(
|
||||
users=["EpochFail", "Halfak (WMF)"]
|
||||
)
|
||||
|
||||
for user_doc in user_docs:
|
||||
print(user_doc)
|
||||
@@ -0,0 +1,31 @@
|
||||
"""
|
||||
|
||||
"""
|
||||
import os
|
||||
import sys
|
||||
|
||||
try:
|
||||
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
from mw import database
|
||||
|
||||
except:
|
||||
raise
|
||||
|
||||
|
||||
|
||||
db = database.DB.from_params(
|
||||
host="analytics-store.eqiad.wmnet",
|
||||
read_default_file="~/.my.cnf",
|
||||
user="research",
|
||||
db="enwiki"
|
||||
)
|
||||
|
||||
users = db.users.query(
|
||||
registered_after="20140101000000",
|
||||
direction="newer",
|
||||
limit=10
|
||||
)
|
||||
|
||||
for user in users:
|
||||
print("{user_id}:{user_name} -- {user_editcount} edits".format(**user))
|
||||
59
mediawiki_dump_tools/Mediawiki-Utilities/examples/dump.xml
Normal file
59
mediawiki_dump_tools/Mediawiki-Utilities/examples/dump.xml
Normal file
@@ -0,0 +1,59 @@
|
||||
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xsi:schemaLocation="//www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd"
|
||||
version="0.8" xml:lang="en">
|
||||
<siteinfo>
|
||||
<sitename>Wikipedia</sitename>
|
||||
<base>http://en.wikipedia.org/wiki/Main_Page</base>
|
||||
<generator>MediaWiki 1.22wmf2</generator>
|
||||
<case>first-letter</case>
|
||||
<namespaces>
|
||||
<namespace key="0" case="first-letter" />
|
||||
<namespace key="1" case="first-letter">Talk</namespace>
|
||||
</namespaces>
|
||||
</siteinfo>
|
||||
<page>
|
||||
<title>Foo</title>
|
||||
<ns>0</ns>
|
||||
<id>1</id>
|
||||
<revision>
|
||||
<id>1</id>
|
||||
<timestamp>2004-08-09T09:04:08Z</timestamp>
|
||||
<contributor>
|
||||
<username>Gen0cide</username>
|
||||
<id>92182</id>
|
||||
</contributor>
|
||||
<text xml:space="preserve">Revision 1 text</text>
|
||||
<sha1>g9chqqg94myzq11c56ixvq7o1yg75n9</sha1>
|
||||
<model>wikitext</model>
|
||||
<format>text/x-wiki</format>
|
||||
</revision>
|
||||
<revision>
|
||||
<id>2</id>
|
||||
<timestamp>2004-08-10T09:04:08Z</timestamp>
|
||||
<contributor>
|
||||
<ip>222.152.210.109</ip>
|
||||
</contributor>
|
||||
<text xml:space="preserve">Revision 2 text</text>
|
||||
<sha1>g9chqqg94myzq11c56ixvq7o1yg75n9</sha1>
|
||||
<model>wikitext</model>
|
||||
<comment>Comment 2</comment>
|
||||
<format>text/x-wiki</format>
|
||||
</revision>
|
||||
</page>
|
||||
<page>
|
||||
<title>Bar</title>
|
||||
<ns>1</ns>
|
||||
<id>2</id>
|
||||
<revision>
|
||||
<id>3</id>
|
||||
<timestamp>2004-08-11T09:04:08Z</timestamp>
|
||||
<contributor>
|
||||
<ip>222.152.210.22</ip>
|
||||
</contributor>
|
||||
<text xml:space="preserve">Revision 3 text</text>
|
||||
<sha1>g9chqqg94myzq11c56ixvq7o1yg75n9</sha1>
|
||||
<model>wikitext</model>
|
||||
<format>text/x-wiki</format>
|
||||
</revision>
|
||||
</page>
|
||||
</mediawiki>
|
||||
31
mediawiki_dump_tools/Mediawiki-Utilities/examples/dump2.xml
Normal file
31
mediawiki_dump_tools/Mediawiki-Utilities/examples/dump2.xml
Normal file
@@ -0,0 +1,31 @@
|
||||
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
|
||||
xsi:schemaLocation="//www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd"
|
||||
version="0.8" xml:lang="en">
|
||||
<siteinfo>
|
||||
<sitename>Wikipedia</sitename>
|
||||
<base>http://en.wikipedia.org/wiki/Main_Page</base>
|
||||
<generator>MediaWiki 1.22wmf2</generator>
|
||||
<case>first-letter</case>
|
||||
<namespaces>
|
||||
<namespace key="0" case="first-letter" />
|
||||
<namespace key="1" case="first-letter">Talk</namespace>
|
||||
</namespaces>
|
||||
</siteinfo>
|
||||
<page>
|
||||
<title>Herp</title>
|
||||
<ns>1</ns>
|
||||
<id>2</id>
|
||||
<revision>
|
||||
<id>4</id>
|
||||
<timestamp>2004-08-11T09:04:08Z</timestamp>
|
||||
<contributor>
|
||||
<id>10</id>
|
||||
<name>FOobar!?</name>
|
||||
</contributor>
|
||||
<text xml:space="preserve">Revision 4 text</text>
|
||||
<sha1>g9chqqg94myzq11c56ixvq7o1yg75n9</sha1>
|
||||
<model>wikitext</model>
|
||||
<format>text/x-wiki</format>
|
||||
</revision>
|
||||
</page>
|
||||
</mediawiki>
|
||||
@@ -0,0 +1,19 @@
|
||||
import pprint
|
||||
import re
|
||||
|
||||
from mw.api import Session
|
||||
from mw.lib import persistence
|
||||
|
||||
session = Session("https://en.wikipedia.org/w/api.php")
|
||||
|
||||
rev, tokens_added, future_revs = persistence.api.score(session, 560561013,
|
||||
properties={'user'})
|
||||
|
||||
words_re = re.compile("\w+", re.UNICODE)
|
||||
|
||||
print("Words added")
|
||||
for token in tokens_added:
|
||||
if words_re.search(token.text):
|
||||
print("'{0}' survived:".format(token.text))
|
||||
for frev in token.revisions:
|
||||
print("\t{revid} by {user}".format(**frev))
|
||||
@@ -0,0 +1,18 @@
|
||||
"""
|
||||
Prints the reverting rev_id, rev_id and reverted to rev_id of all reverted
|
||||
revisions made by user "PermaNoob".
|
||||
"""
|
||||
from mw.api import Session
|
||||
from mw.lib import reverts
|
||||
|
||||
session = Session("https://en.wikipedia.org/w/api.php")
|
||||
revisions = session.user_contribs.query(user={"PermaNoob"}, direction="newer")
|
||||
|
||||
for rev in revisions:
|
||||
revert = reverts.api.check_rev(session, rev, window=60*60*24*2)
|
||||
if revert is not None:
|
||||
print("{0} reverted {1} to {2}".format(
|
||||
revert.reverting['revid'],
|
||||
rev['revid'],
|
||||
revert.reverted_to['revid'])
|
||||
)
|
||||
@@ -0,0 +1,23 @@
|
||||
"""
|
||||
Prints the reverting rev_id, rev_id and reverted to rev_id of all reverted
|
||||
revisions made by user with ID 9133062.
|
||||
"""
|
||||
from mw.database import DB
|
||||
from mw.lib import reverts
|
||||
|
||||
db = DB.from_params(
|
||||
host="s1-analytics-slave.eqiad.wmnet",
|
||||
read_default_file="~/.my.cnf",
|
||||
user="research",
|
||||
db="enwiki"
|
||||
)
|
||||
revisions = db.revisions.query(user_id=9133062)
|
||||
|
||||
for rev_row in revisions:
|
||||
revert = reverts.database.check_row(db, rev_row)
|
||||
if revert is not None:
|
||||
print("{0} reverted {1} to {2}".format(
|
||||
revert.reverting['rev_id'],
|
||||
rev_row['rev_id'],
|
||||
revert.reverted_to['rev_id'])
|
||||
)
|
||||
@@ -0,0 +1,21 @@
|
||||
"""
|
||||
Prints all reverted revisions of User:EpochFail.
|
||||
"""
|
||||
from mw.api import Session
|
||||
from mw.lib import reverts
|
||||
|
||||
# Gather a page's revisions from the API
|
||||
api_session = Session("https://en.wikipedia.org/w/api.php")
|
||||
revs = api_session.revisions.query(
|
||||
titles={"User:EpochFail"},
|
||||
properties={'ids', 'sha1'},
|
||||
direction="newer"
|
||||
)
|
||||
|
||||
# Creates a revsion event iterator
|
||||
rev_events = ((rev['sha1'], rev) for rev in revs)
|
||||
|
||||
# Detect and print reverts
|
||||
for revert in reverts.detect(rev_events):
|
||||
print("{0} reverted back to {1}".format(revert.reverting['revid'],
|
||||
revert.reverted_to['revid']))
|
||||
@@ -0,0 +1,17 @@
|
||||
"""
|
||||
Prints out session information for user "TextAccountForMWUtils"
|
||||
"""
|
||||
from mw.api import Session
|
||||
from mw.lib import sessions
|
||||
|
||||
# Gather a user's revisions from the API
|
||||
api_session = Session("https://en.wikipedia.org/w/api.php")
|
||||
revs = api_session.user_contribs.query(
|
||||
user={"TestAccountForMWUtils"},
|
||||
direction="newer"
|
||||
)
|
||||
rev_events = ((rev['user'], rev['timestamp'], rev) for rev in revs)
|
||||
|
||||
# Extract and print sessions
|
||||
for user, session in sessions.cluster(rev_events):
|
||||
print("{0}'s session with {1} revisions".format(user, len(session)))
|
||||
@@ -0,0 +1,26 @@
|
||||
"""
|
||||
Demonstrates title normalization and parsing.
|
||||
"""
|
||||
import sys
|
||||
import os
|
||||
|
||||
sys.path.insert(0, os.path.abspath(os.getcwd()))
|
||||
|
||||
from mw.api import Session
|
||||
from mw.lib import title
|
||||
|
||||
# Normalize titles
|
||||
title.normalize("foo bar")
|
||||
# > "Foo_bar"
|
||||
|
||||
# Construct a title parser from the API
|
||||
api_session = Session("https://en.wikipedia.org/w/api.php")
|
||||
parser = title.Parser.from_api(api_session)
|
||||
|
||||
# Handles normalization
|
||||
parser.parse("user:epochFail")
|
||||
# > 2, "EpochFail"
|
||||
|
||||
# Handles namespace aliases
|
||||
parser.parse("WT:foobar")
|
||||
# > 5, "Foobar"
|
||||
@@ -0,0 +1,27 @@
|
||||
"""
|
||||
Demonstrates some simple Timestamp operations
|
||||
"""
|
||||
from mw import Timestamp
|
||||
|
||||
# Seconds since Unix Epoch
|
||||
str(Timestamp(1234567890))
|
||||
# > '20090213233130'
|
||||
|
||||
# Database format
|
||||
int(Timestamp("20090213233130"))
|
||||
# > 1234567890
|
||||
|
||||
# API format
|
||||
int(Timestamp("2009-02-13T23:31:30Z"))
|
||||
# > 1234567890
|
||||
|
||||
# Difference in seconds
|
||||
Timestamp("2009-02-13T23:31:31Z") - Timestamp(1234567890)
|
||||
# > 1
|
||||
|
||||
# strptime and strftime
|
||||
Timestamp(1234567890).strftime("%Y foobar")
|
||||
# > '2009 foobar'
|
||||
|
||||
str(Timestamp.strptime("2009 derp 10", "%Y derp %m"))
|
||||
# > '20091001000000'
|
||||
@@ -0,0 +1,14 @@
|
||||
"""
|
||||
Prints out all rev_ids that appear in dump.xml.
|
||||
"""
|
||||
from mw.xml_dump import Iterator
|
||||
|
||||
# Construct dump file iterator
|
||||
dump = Iterator.from_file(open("examples/dump.xml"))
|
||||
|
||||
# Iterate through pages
|
||||
for page in dump:
|
||||
|
||||
# Iterate through a page's revisions
|
||||
for revision in page:
|
||||
print(revision.id)
|
||||
@@ -0,0 +1,15 @@
|
||||
"""
|
||||
Processes two dump files.
|
||||
"""
|
||||
from mw import xml_dump
|
||||
|
||||
files = ["examples/dump.xml", "examples/dump2.xml"]
|
||||
|
||||
|
||||
def page_info(dump, path):
|
||||
for page in dump:
|
||||
yield page.id, page.namespace, page.title
|
||||
|
||||
|
||||
for page_id, page_namespace, page_title in xml_dump.map(files, page_info):
|
||||
print("\t".join([str(page_id), str(page_namespace), page_title]))
|
||||
3
mediawiki_dump_tools/Mediawiki-Utilities/mw/__init__.py
Normal file
3
mediawiki_dump_tools/Mediawiki-Utilities/mw/__init__.py
Normal file
@@ -0,0 +1,3 @@
|
||||
from .types import Timestamp, Namespace
|
||||
|
||||
__version__ = "0.4.18"
|
||||
@@ -0,0 +1,5 @@
|
||||
from . import errors
|
||||
from .session import Session
|
||||
|
||||
from .collections import Pages, RecentChanges, Revisions, SiteInfo, \
|
||||
UserContribs, DeletedRevisions
|
||||
@@ -0,0 +1,7 @@
|
||||
from .deleted_revisions import DeletedRevisions
|
||||
from .pages import Pages
|
||||
from .recent_changes import RecentChanges
|
||||
from .revisions import Revisions
|
||||
from .site_info import SiteInfo
|
||||
from .user_contribs import UserContribs
|
||||
from .users import Users
|
||||
@@ -0,0 +1,68 @@
|
||||
import re
|
||||
|
||||
|
||||
class Collection:
|
||||
"""
|
||||
Represents a collection of items that can be queried via the API. This is
|
||||
an abstract base class that should be extended
|
||||
"""
|
||||
|
||||
TIMESTAMP = re.compile(r"[0-9]{4}-?[0-9]{2}-?[0-9]{2}T?" +
|
||||
r"[0-9]{2}:?[0-9]{2}:?[0-9]{2}Z?")
|
||||
"""
|
||||
A regular expression for matching the API's timestamp format.
|
||||
"""
|
||||
|
||||
DIRECTIONS = {'newer', 'older'}
|
||||
"""
|
||||
A set of potential direction names.
|
||||
"""
|
||||
|
||||
def __init__(self, session):
|
||||
"""
|
||||
:Parameters:
|
||||
session : `mw.api.Session`
|
||||
An api session to use for post & get.
|
||||
"""
|
||||
self.session = session
|
||||
|
||||
def _check_direction(self, direction):
|
||||
if direction is None:
|
||||
return direction
|
||||
else:
|
||||
direction = str(direction)
|
||||
|
||||
assert direction in {None} | self.DIRECTIONS, \
|
||||
"Direction must be one of {0}".format(self.DIRECTIONS)
|
||||
|
||||
return direction
|
||||
|
||||
def _check_timestamp(self, timestamp):
|
||||
if timestamp is None:
|
||||
return timestamp
|
||||
else:
|
||||
timestamp = str(timestamp)
|
||||
|
||||
if not self.TIMESTAMP.match(timestamp):
|
||||
raise TypeError(
|
||||
"{0} is not formatted like ".format(repr(timestamp)) +
|
||||
"a MediaWiki timestamp."
|
||||
)
|
||||
|
||||
return timestamp
|
||||
|
||||
def _items(self, items, none=True, levels=None, type=lambda val: val):
|
||||
|
||||
if none and items is None:
|
||||
return None
|
||||
else:
|
||||
items = {str(type(item)) for item in items}
|
||||
|
||||
if levels is not None:
|
||||
levels = {str(level) for level in levels}
|
||||
|
||||
assert len(items - levels) == 0, \
|
||||
"items {0} not in levels {1}".format(
|
||||
items - levels, levels)
|
||||
|
||||
return "|".join(items)
|
||||
@@ -0,0 +1,150 @@
|
||||
import logging
|
||||
import sys
|
||||
|
||||
from ...types import Timestamp
|
||||
from ...util import none_or
|
||||
from ..errors import MalformedResponse
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.deletedrevs")
|
||||
|
||||
|
||||
class DeletedRevisions(Collection):
|
||||
PROPERTIES = {'ids', 'flags', 'timestamp', 'user', 'userid', 'size',
|
||||
'sha1', 'contentmodel', 'comment', 'parsedcomment', 'content',
|
||||
'tags'}
|
||||
|
||||
# TODO:
|
||||
# This is *not* the right way to do this, but it should work for all queries.
|
||||
MAX_REVISIONS = 500
|
||||
|
||||
def get(self, rev_id, *args, **kwargs):
|
||||
|
||||
rev_id = int(rev_id)
|
||||
|
||||
revs = list(self.query(revids={rev_id}, **kwargs))
|
||||
|
||||
if len(revs) < 1:
|
||||
raise KeyError(rev_id)
|
||||
else:
|
||||
return revs[0]
|
||||
|
||||
def query(self, *args, limit=sys.maxsize, **kwargs):
|
||||
"""
|
||||
Queries deleted revisions.
|
||||
See https://www.mediawiki.org/wiki/API:Deletedrevs
|
||||
|
||||
:Parameters:
|
||||
titles : set(str)
|
||||
A set of page names to query (note that namespace prefix is expected)
|
||||
start : :class:`mw.Timestamp`
|
||||
A timestamp to start querying from
|
||||
end : :class:`mw.Timestamp`
|
||||
A timestamp to end querying
|
||||
from_title : str
|
||||
A title from which to start querying (alphabetically)
|
||||
to_title : str
|
||||
A title from which to stop querying (alphabetically)
|
||||
prefix : str
|
||||
A title prefix to match on
|
||||
drcontinue : str
|
||||
When more results are available, use this to continue (3) Note: may only work if drdir is set to newer.
|
||||
unique : bool
|
||||
List only one revision for each page
|
||||
tag : str
|
||||
Only list revision tagged with this tag
|
||||
user : str
|
||||
Only list revisions saved by this user_text
|
||||
excludeuser : str
|
||||
Do not list revision saved by this user_text
|
||||
namespace : int
|
||||
Only list pages in this namespace (id)
|
||||
limit : int
|
||||
Limit the number of results
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
properties : set(str)
|
||||
A list of properties to include in the results:
|
||||
|
||||
|
||||
* ids - The ID of the revision.
|
||||
* flags - Revision flags (minor).
|
||||
* timestamp - The timestamp of the revision.
|
||||
* user - User that made the revision.
|
||||
* userid - User ID of the revision creator.
|
||||
* size - Length (bytes) of the revision.
|
||||
* sha1 - SHA-1 (base 16) of the revision.
|
||||
* contentmodel - Content model ID of the revision.
|
||||
* comment - Comment by the user for the revision.
|
||||
* parsedcomment - Parsed comment by the user for the revision.
|
||||
* content - Text of the revision.
|
||||
* tags - Tags for the revision.
|
||||
"""
|
||||
# `limit` means something diffent here
|
||||
kwargs['limit'] = min(limit, self.MAX_REVISIONS)
|
||||
revisions_yielded = 0
|
||||
done = False
|
||||
while not done and revisions_yielded <= limit:
|
||||
rev_docs, query_continue = self._query(*args, **kwargs)
|
||||
for doc in rev_docs:
|
||||
yield doc
|
||||
revisions_yielded += 1
|
||||
if revisions_yielded >= limit:
|
||||
break
|
||||
|
||||
if query_continue != "" and len(rev_docs) > 0:
|
||||
kwargs['query_continue'] = query_continue
|
||||
else:
|
||||
done = True
|
||||
|
||||
def _query(self, titles=None, pageids=None, revids=None,
|
||||
start=None, end=None, query_continue=None, unique=None, tag=None,
|
||||
user=None, excludeuser=None, namespace=None, limit=None,
|
||||
properties=None, direction=None):
|
||||
|
||||
params = {
|
||||
'action': "query",
|
||||
'prop': "deletedrevisions"
|
||||
}
|
||||
|
||||
params['titles'] = self._items(titles)
|
||||
params['pageids'] = self._items(pageids)
|
||||
params['revids'] = self._items(revids)
|
||||
params['drvprop'] = self._items(properties, levels=self.PROPERTIES)
|
||||
params['drvlimit'] = none_or(limit, int)
|
||||
params['drvstart'] = self._check_timestamp(start)
|
||||
params['drvend'] = self._check_timestamp(end)
|
||||
|
||||
params['drvdir'] = self._check_direction(direction)
|
||||
params['drvuser'] = none_or(user, str)
|
||||
params['drvexcludeuser'] = none_or(excludeuser, int)
|
||||
params['drvtag'] = none_or(tag, str)
|
||||
params.update(query_continue or {'continue': ""})
|
||||
|
||||
doc = self.session.get(params)
|
||||
doc_copy = dict(doc)
|
||||
|
||||
try:
|
||||
if 'continue' in doc:
|
||||
query_continue = doc['continue']
|
||||
else:
|
||||
query_continue = ''
|
||||
|
||||
pages = doc['query']['pages'].values()
|
||||
rev_docs = []
|
||||
|
||||
for page_doc in pages:
|
||||
page_rev_docs = page_doc.get('deletedrevisions', [])
|
||||
|
||||
try: del page_doc['deletedrevisions']
|
||||
except KeyError: pass
|
||||
|
||||
for rev_doc in page_rev_docs:
|
||||
rev_doc['page'] = page_doc
|
||||
|
||||
rev_docs.extend(page_rev_docs)
|
||||
|
||||
return rev_docs, query_continue
|
||||
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(str(e), doc)
|
||||
@@ -0,0 +1,50 @@
|
||||
import logging
|
||||
|
||||
from ...util import none_or
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.pages")
|
||||
|
||||
|
||||
class Pages(Collection):
|
||||
"""
|
||||
TODO
|
||||
"""
|
||||
|
||||
def _edit(self, title=None, pageid=None, section=None, sectiontitle=None,
|
||||
text=None, token=None, summary=None, minor=None,
|
||||
notminor=None, bot=None, basetimestamp=None,
|
||||
starttimestamp=None, recreate=None, createonly=None,
|
||||
nocreate=None, watch=None, unwatch=None, watchlist=None,
|
||||
md5=None, prependtext=None, appendtext=None, undo=None,
|
||||
undoafter=None, redirect=None, contentformat=None,
|
||||
contentmodel=None, assert_=None, nassert=None,
|
||||
captchaword=None, captchaid=None):
|
||||
params = {
|
||||
'action': "edit"
|
||||
}
|
||||
params['title'] = none_or(title, str)
|
||||
params['pageid'] = none_or(pageid, int)
|
||||
params['section'] = none_or(section, int, levels={'new'})
|
||||
params['sectiontitle'] = none_or(sectiontitle, str)
|
||||
params['text'] = none_or(text, str)
|
||||
params['token'] = none_or(token, str)
|
||||
params['summary'] = none_or(summary, str)
|
||||
params['minor'] = none_or(minor, bool)
|
||||
params['notminor'] = none_or(notminor, bool)
|
||||
params['bot'] = none_or(bot, bool)
|
||||
params['basetimestamp'] = self._check_timestamp(basetimestamp)
|
||||
params['starttimestamp'] = self._check_timestamp(starttimestamp)
|
||||
params['recreate'] = none_or(recreate, bool)
|
||||
params['createonly'] = none_or(createonly, bool)
|
||||
params['nocreate'] = none_or(nocreate, bool)
|
||||
params['watch'] = none_or(watch, bool)
|
||||
params['unwatch'] = none_or(unwatch, bool)
|
||||
params['watchlist'] = none_or(watchlist, bool)
|
||||
params['md5'] = none_or(md5, str)
|
||||
params['prependtext'] = none_or(prependtext, str)
|
||||
params['appendtext'] = none_or(appendtext, str)
|
||||
params['undo'] = none_or(undo, int)
|
||||
params['undoafter'] = none_or(undoafter, int)
|
||||
|
||||
# TODO finish this
|
||||
@@ -0,0 +1,192 @@
|
||||
import logging
|
||||
import re
|
||||
|
||||
from ...util import none_or
|
||||
from ..errors import MalformedResponse
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.recent_changes")
|
||||
|
||||
|
||||
class RecentChanges(Collection):
|
||||
"""
|
||||
Recent changes (revisions, page creations, registrations, moves, etc.)
|
||||
"""
|
||||
|
||||
RCCONTINUE = re.compile(r"([0-9]{4}-[0-9]{2}-[0-9]{2}T" +
|
||||
r"[0-9]{2}:[0-9]{2}:[0-9]{2}Z|" +
|
||||
r"[0-9]{14})" +
|
||||
r"\|[0-9]+")
|
||||
|
||||
PROPERTIES = {'user', 'userid', 'comment', 'timestamp', 'title',
|
||||
'ids', 'sizes', 'redirect', 'flags', 'loginfo',
|
||||
'tags', 'sha1'}
|
||||
|
||||
SHOW = {'minor', '!minor', 'bot', '!bot', 'anon', '!anon',
|
||||
'redirect', '!redirect', 'patrolled', '!patrolled'}
|
||||
|
||||
TYPES = {'edit', 'external', 'new', 'log'}
|
||||
|
||||
DIRECTIONS = {'newer', 'older'}
|
||||
|
||||
MAX_CHANGES = 50
|
||||
|
||||
def _check_rccontinue(self, rccontinue):
|
||||
if rccontinue is None:
|
||||
return None
|
||||
elif self.RCCONTINUE.match(rccontinue):
|
||||
return rccontinue
|
||||
else:
|
||||
raise TypeError(
|
||||
"rccontinue {0} is not formatted correctly ".format(rccontinue) +
|
||||
"'%Y-%m-%dT%H:%M:%SZ|<last_rcid>'"
|
||||
)
|
||||
|
||||
def query(self, *args, limit=None, **kwargs):
|
||||
"""
|
||||
Enumerate recent changes.
|
||||
See `<https://www.mediawiki.org/wiki/API:Recentchanges>`_
|
||||
|
||||
:Parameters:
|
||||
start : :class:`mw.Timestamp`
|
||||
The timestamp to start enumerating from
|
||||
end : :class:`mw.Timestamp`
|
||||
The timestamp to end enumerating
|
||||
direction :
|
||||
"newer" or "older"
|
||||
namespace : int
|
||||
Filter log entries to only this namespace(s)
|
||||
user : str
|
||||
Only list changes by this user
|
||||
excludeuser : str
|
||||
Don't list changes by this user
|
||||
tag : str
|
||||
Only list changes tagged with this tag
|
||||
properties : set(str)
|
||||
Include additional pieces of information
|
||||
|
||||
* user - Adds the user responsible for the edit and tags if they are an IP
|
||||
* userid - Adds the user id responsible for the edit
|
||||
* comment - Adds the comment for the edit
|
||||
* parsedcomment - Adds the parsed comment for the edit
|
||||
* flags - Adds flags for the edit
|
||||
* timestamp - Adds timestamp of the edit
|
||||
* title - Adds the page title of the edit
|
||||
* ids - Adds the page ID, recent changes ID and the new and old revision ID
|
||||
* sizes - Adds the new and old page length in bytes
|
||||
* redirect - Tags edit if page is a redirect
|
||||
* patrolled - Tags patrollable edits as being patrolled or unpatrolled
|
||||
* loginfo - Adds log information (logid, logtype, etc) to log entries
|
||||
* tags - Lists tags for the entry
|
||||
* sha1 - Adds the content checksum for entries associated with a revision
|
||||
|
||||
token : set(str)
|
||||
Which tokens to obtain for each change
|
||||
|
||||
* patrol
|
||||
|
||||
show : set(str)
|
||||
Show only items that meet this criteria. For example, to see
|
||||
only minor edits done by logged-in users, set
|
||||
show={'minor', '!anon'}.
|
||||
|
||||
* minor
|
||||
* !minor
|
||||
* bot
|
||||
* !bot
|
||||
* anon
|
||||
* !anon
|
||||
* redirect
|
||||
* !redirect
|
||||
* patrolled
|
||||
* !patrolled
|
||||
* unpatrolled
|
||||
limit : int
|
||||
How many total changes to return
|
||||
type : set(str)
|
||||
Which types of changes to show
|
||||
|
||||
* edit
|
||||
* external
|
||||
* new
|
||||
* log
|
||||
|
||||
toponly : bool
|
||||
Only list changes which are the latest revision
|
||||
rccontinue : str
|
||||
Use this to continue loading results from where you last left off
|
||||
"""
|
||||
limit = none_or(limit, int)
|
||||
|
||||
changes_yielded = 0
|
||||
done = False
|
||||
while not done:
|
||||
|
||||
if limit is None:
|
||||
kwargs['limit'] = self.MAX_CHANGES
|
||||
else:
|
||||
kwargs['limit'] = min(limit - changes_yielded, self.MAX_CHANGES)
|
||||
|
||||
rc_docs, rccontinue = self._query(*args, **kwargs)
|
||||
|
||||
for doc in rc_docs:
|
||||
yield doc
|
||||
changes_yielded += 1
|
||||
|
||||
if limit is not None and changes_yielded >= limit:
|
||||
done = True
|
||||
break
|
||||
|
||||
if rccontinue is not None and len(rc_docs) > 0:
|
||||
|
||||
kwargs['rccontinue'] = rccontinue
|
||||
else:
|
||||
done = True
|
||||
|
||||
def _query(self, start=None, end=None, direction=None, namespace=None,
|
||||
user=None, excludeuser=None, tag=None, properties=None,
|
||||
token=None, show=None, limit=None, type=None,
|
||||
toponly=None, rccontinue=None):
|
||||
|
||||
params = {
|
||||
'action': "query",
|
||||
'list': "recentchanges"
|
||||
}
|
||||
|
||||
params['rcstart'] = none_or(start, str)
|
||||
params['rcend'] = none_or(end, str)
|
||||
|
||||
assert direction in {None} | self.DIRECTIONS, \
|
||||
"Direction must be one of {0}".format(self.DIRECTIONS)
|
||||
|
||||
params['rcdir'] = direction
|
||||
params['rcnamespace'] = none_or(namespace, int)
|
||||
params['rcuser'] = none_or(user, str)
|
||||
params['rcexcludeuser'] = none_or(excludeuser, str)
|
||||
params['rctag'] = none_or(tag, str)
|
||||
params['rcprop'] = self._items(properties, levels=self.PROPERTIES)
|
||||
params['rctoken'] = none_or(tag, str)
|
||||
params['rcshow'] = self._items(show, levels=self.SHOW)
|
||||
params['rclimit'] = none_or(limit, int)
|
||||
params['rctype'] = self._items(type, self.TYPES)
|
||||
params['rctoponly'] = none_or(toponly, bool)
|
||||
params['rccontinue'] = self._check_rccontinue(rccontinue)
|
||||
|
||||
doc = self.session.get(params)
|
||||
|
||||
try:
|
||||
rc_docs = doc['query']['recentchanges']
|
||||
|
||||
if 'query-continue' in doc:
|
||||
rccontinue = \
|
||||
doc['query-continue']['recentchanges']['rccontinue']
|
||||
elif len(rc_docs) > 0:
|
||||
rccontinue = "|".join([rc_docs[-1]['timestamp'],
|
||||
str(rc_docs[-1]['rcid'] + 1)])
|
||||
else:
|
||||
pass # Leave it be
|
||||
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(str(e), doc)
|
||||
|
||||
return rc_docs, rccontinue
|
||||
@@ -0,0 +1,220 @@
|
||||
import logging
|
||||
|
||||
from ...util import none_or
|
||||
from ..errors import MalformedResponse
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.revisions")
|
||||
|
||||
|
||||
class Revisions(Collection):
|
||||
"""
|
||||
A collection of revisions indexes by title, page_id and user_text.
|
||||
Note that revisions of deleted pages are queriable via
|
||||
:class:`mw.api.DeletedRevs`.
|
||||
"""
|
||||
|
||||
PROPERTIES = {'ids', 'flags', 'timestamp', 'user', 'userid', 'size',
|
||||
'sha1', 'contentmodel', 'comment', 'parsedcomment',
|
||||
'content', 'tags', 'flagged'}
|
||||
|
||||
DIFF_TO = {'prev', 'next', 'cur'}
|
||||
|
||||
# This is *not* the right way to do this, but it should work for all queries.
|
||||
MAX_REVISIONS = 50
|
||||
|
||||
def get(self, rev_id, **kwargs):
|
||||
"""
|
||||
Get a single revision based on it's ID. Throws a :py:class:`KeyError`
|
||||
if the rev_id cannot be found.
|
||||
|
||||
:Parameters:
|
||||
rev_id : int
|
||||
Revision ID
|
||||
``**kwargs``
|
||||
Passed to :py:meth:`query`
|
||||
|
||||
:Returns:
|
||||
A single rev dict
|
||||
"""
|
||||
rev_id = int(rev_id)
|
||||
|
||||
revs = list(self.query(revids={rev_id}, **kwargs))
|
||||
|
||||
if len(revs) < 1:
|
||||
raise KeyError(rev_id)
|
||||
else:
|
||||
return revs[0]
|
||||
|
||||
def query(self, *args, limit=None, **kwargs):
|
||||
"""
|
||||
Get revision information.
|
||||
See `<https://www.mediawiki.org/wiki/API:Properties#revisions_.2F_rv>`_
|
||||
|
||||
:Parameters:
|
||||
properties : set(str)
|
||||
Which properties to get for each revision:
|
||||
|
||||
* ids - The ID of the revision
|
||||
* flags - Revision flags (minor)
|
||||
* timestamp - The timestamp of the revision
|
||||
* user - User that made the revision
|
||||
* userid - User id of revision creator
|
||||
* size - Length (bytes) of the revision
|
||||
* sha1 - SHA-1 (base 16) of the revision
|
||||
* contentmodel - Content model id
|
||||
* comment - Comment by the user for revision
|
||||
* parsedcomment - Parsed comment by the user for the revision
|
||||
* content - Text of the revision
|
||||
* tags - Tags for the revision
|
||||
limit : int
|
||||
Limit how many revisions will be returned
|
||||
No more than 500 (5000 for bots) allowed
|
||||
start_id : int
|
||||
From which revision id to start enumeration (enum)
|
||||
end_id : int
|
||||
Stop revision enumeration on this revid
|
||||
start : :class:`mw.Timestamp`
|
||||
From which revision timestamp to start enumeration (enum)
|
||||
end : :class:`mw.Timestamp`
|
||||
Enumerate up to this timestamp
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
user : str
|
||||
Only include revisions made by user_text
|
||||
excludeuser : bool
|
||||
Exclude revisions made by user
|
||||
tag : str
|
||||
Only list revisions tagged with this tag
|
||||
expandtemplates : bool
|
||||
Expand templates in revision content (requires "content" propery)
|
||||
generatexml : bool
|
||||
Generate XML parse tree for revision content (requires "content" propery)
|
||||
parse : bool
|
||||
Parse revision content (requires "content" propery)
|
||||
section : int
|
||||
Only retrieve the content of this section number
|
||||
token : set(str)
|
||||
Which tokens to obtain for each revision
|
||||
|
||||
* rollback - See `<https://www.mediawiki.org/wiki/API:Edit_-_Rollback#Token>`_
|
||||
rvcontinue : str
|
||||
When more results are available, use this to continue
|
||||
diffto : int
|
||||
Revision ID to diff each revision to. Use "prev", "next" and
|
||||
"cur" for the previous, next and current revision respectively
|
||||
difftotext : str
|
||||
Text to diff each revision to. Only diffs a limited number of
|
||||
revisions. Overrides diffto. If section is set, only that
|
||||
section will be diffed against this text
|
||||
contentformat : str
|
||||
Serialization format used for difftotext and expected for output of content
|
||||
|
||||
* text/x-wiki
|
||||
* text/javascript
|
||||
* text/css
|
||||
* text/plain
|
||||
* application/json
|
||||
|
||||
:Returns:
|
||||
An iterator of rev dicts returned from the API.
|
||||
"""
|
||||
|
||||
revisions_yielded = 0
|
||||
done = False
|
||||
while not done:
|
||||
if limit == None:
|
||||
kwargs['limit'] = self.MAX_REVISIONS
|
||||
else:
|
||||
kwargs['limit'] = min(limit - revisions_yielded, self.MAX_REVISIONS)
|
||||
|
||||
rev_docs, rvcontinue = self._query(*args, **kwargs)
|
||||
|
||||
for doc in rev_docs:
|
||||
yield doc
|
||||
revisions_yielded += 1
|
||||
|
||||
if limit != None and revisions_yielded >= limit:
|
||||
done = True
|
||||
break
|
||||
|
||||
if rvcontinue != None and len(rev_docs) > 0:
|
||||
kwargs['rvcontinue'] = rvcontinue
|
||||
else:
|
||||
done = True
|
||||
|
||||
|
||||
def _query(self, revids=None, titles=None, pageids=None, properties=None,
|
||||
limit=None, start_id=None, end_id=None, start=None,
|
||||
end=None, direction=None, user=None, excludeuser=None,
|
||||
tag=None, expandtemplates=None, generatexml=None,
|
||||
parse=None, section=None, token=None, rvcontinue=None,
|
||||
diffto=None, difftotext=None, contentformat=None):
|
||||
|
||||
params = {
|
||||
'action': "query",
|
||||
'prop': "revisions",
|
||||
'rawcontinue': ''
|
||||
}
|
||||
|
||||
params['revids'] = self._items(revids, type=int)
|
||||
params['titles'] = self._items(titles)
|
||||
params['pageids'] = self._items(pageids, type=int)
|
||||
|
||||
params['rvprop'] = self._items(properties, levels=self.PROPERTIES)
|
||||
|
||||
if revids == None: # Can't have a limit unless revids is none
|
||||
params['rvlimit'] = none_or(limit, int)
|
||||
|
||||
params['rvstartid'] = none_or(start_id, int)
|
||||
params['rvendid'] = none_or(end_id, int)
|
||||
params['rvstart'] = self._check_timestamp(start)
|
||||
params['rvend'] = self._check_timestamp(end)
|
||||
|
||||
params['rvdir'] = self._check_direction(direction)
|
||||
params['rvuser'] = none_or(user, str)
|
||||
params['rvexcludeuser'] = none_or(excludeuser, int)
|
||||
params['rvtag'] = none_or(tag, str)
|
||||
params['rvexpandtemplates'] = none_or(expandtemplates, bool)
|
||||
params['rvgeneratexml'] = none_or(generatexml, bool)
|
||||
params['rvparse'] = none_or(parse, bool)
|
||||
params['rvsection'] = none_or(section, int)
|
||||
params['rvtoken'] = none_or(token, str)
|
||||
params['rvcontinue'] = none_or(rvcontinue, str)
|
||||
params['rvdiffto'] = self._check_diffto(diffto)
|
||||
params['rvdifftotext'] = none_or(difftotext, str)
|
||||
params['rvcontentformat'] = none_or(contentformat, str)
|
||||
|
||||
doc = self.session.get(params)
|
||||
|
||||
try:
|
||||
if 'query-continue' in doc:
|
||||
rvcontinue = doc['query-continue']['revisions']['rvcontinue']
|
||||
else:
|
||||
rvcontinue = None
|
||||
|
||||
pages = doc['query'].get('pages', {}).values()
|
||||
rev_docs = []
|
||||
|
||||
for page_doc in pages:
|
||||
if 'missing' in page_doc or 'revisions' not in page_doc: continue
|
||||
|
||||
page_rev_docs = page_doc['revisions']
|
||||
del page_doc['revisions']
|
||||
|
||||
for rev_doc in page_rev_docs:
|
||||
rev_doc['page'] = page_doc
|
||||
|
||||
rev_docs.extend(page_rev_docs)
|
||||
|
||||
return rev_docs, rvcontinue
|
||||
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(str(e), doc)
|
||||
|
||||
|
||||
def _check_diffto(self, diffto):
|
||||
if diffto == None or diffto in self.DIFF_TO:
|
||||
return diffto
|
||||
else:
|
||||
return int(diffto)
|
||||
@@ -0,0 +1,81 @@
|
||||
import logging
|
||||
|
||||
from ..errors import MalformedResponse
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.site_info")
|
||||
|
||||
|
||||
class SiteInfo(Collection):
|
||||
"""
|
||||
General information about the site.
|
||||
"""
|
||||
|
||||
PROPERTIES = {'general', 'namespaces', 'namespacealiases',
|
||||
'specialpagealiases', 'magicwords', 'interwikimap',
|
||||
'dbrepllag', 'statistics', 'usergroups', 'extensions',
|
||||
'fileextensions', 'rightsinfo', 'languages', 'skins',
|
||||
'extensiontags', 'functionhooks', 'showhooks',
|
||||
'variables', 'protocols'}
|
||||
|
||||
FILTERIW = {'local', '!local'}
|
||||
|
||||
def query(self, properties=None, filteriw=None, showalldb=None,
|
||||
numberinggroup=None, inlanguagecode=None):
|
||||
"""
|
||||
General information about the site.
|
||||
See `<https://www.mediawiki.org/wiki/API:Meta#siteinfo_.2F_si>`_
|
||||
|
||||
:Parameters:
|
||||
properties: set(str)
|
||||
Which sysinfo properties to get:
|
||||
|
||||
* general - Overall system information
|
||||
* namespaces - List of registered namespaces and their canonical names
|
||||
* namespacealiases - List of registered namespace aliases
|
||||
* specialpagealiases - List of special page aliases
|
||||
* magicwords - List of magic words and their aliases
|
||||
* statistics - Returns site statistics
|
||||
* interwikimap - Returns interwiki map (optionally filtered, (optionally localised by using siinlanguagecode))
|
||||
* dbrepllag - Returns database server with the highest replication lag
|
||||
* usergroups - Returns user groups and the associated permissions
|
||||
* extensions - Returns extensions installed on the wiki
|
||||
* fileextensions - Returns list of file extensions allowed to be uploaded
|
||||
* rightsinfo - Returns wiki rights (license) information if available
|
||||
* restrictions - Returns information on available restriction (protection) types
|
||||
* languages - Returns a list of languages MediaWiki supports(optionally localised by using siinlanguagecode)
|
||||
* skins - Returns a list of all enabled skins
|
||||
* extensiontags - Returns a list of parser extension tags
|
||||
* functionhooks - Returns a list of parser function hooks
|
||||
* showhooks - Returns a list of all subscribed hooks (contents of $wgHooks)
|
||||
* variables - Returns a list of variable IDs
|
||||
* protocols - Returns a list of protocols that are allowed in external links.
|
||||
* defaultoptions - Returns the default values for user preferences.
|
||||
filteriw : str
|
||||
"local" or "!local" Return only local or only nonlocal entries of the interwiki map
|
||||
showalldb : bool
|
||||
List all database servers, not just the one lagging the most
|
||||
numberingroup : bool
|
||||
Lists the number of users in user groups
|
||||
inlanguagecode : bool
|
||||
Language code for localised language names (best effort, use CLDR extension)
|
||||
"""
|
||||
|
||||
siprop = self._items(properties, levels=self.PROPERTIES)
|
||||
|
||||
doc = self.session.get(
|
||||
{
|
||||
'action': "query",
|
||||
'meta': "siteinfo",
|
||||
'siprop': siprop,
|
||||
'sifilteriw': filteriw,
|
||||
'sishowalldb': showalldb,
|
||||
'sinumberinggroup': numberinggroup,
|
||||
'siinlanguagecode': inlanguagecode
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
return doc['query']
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(str(e), doc)
|
||||
@@ -0,0 +1,132 @@
|
||||
import logging
|
||||
|
||||
from ...util import none_or
|
||||
from ..errors import MalformedResponse
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.user_contribs")
|
||||
|
||||
|
||||
class UserContribs(Collection):
|
||||
"""
|
||||
A collection of revisions indexes by user.
|
||||
"""
|
||||
|
||||
PROPERTIES = {'ids', 'title', 'timestamp', 'comment', 'parsedcomment',
|
||||
'size', 'sizediff', 'flags', 'patrolled', 'tags'}
|
||||
|
||||
SHOW = {'minor', '!minor', 'patrolled', '!patrolled'}
|
||||
|
||||
MAX_REVISIONS = 50
|
||||
|
||||
def query(self, *args, limit=None, **kwargs):
|
||||
"""
|
||||
Get a user's revisions.
|
||||
See `<https://www.mediawiki.org/wiki/API:Usercontribs>`_
|
||||
|
||||
:Parameters:
|
||||
limit : int
|
||||
The maximum number of contributions to return.
|
||||
start : :class:`mw.Timestamp`
|
||||
The start timestamp to return from
|
||||
end : :class:`mw.Timestamp`
|
||||
The end timestamp to return to
|
||||
user : set(str)
|
||||
The users to retrieve contributions for. Maximum number of values 50 (500 for bots)
|
||||
userprefix : set(str)
|
||||
Retrieve contributions for all users whose names begin with this value.
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
namespace : int
|
||||
Only list contributions in these namespaces
|
||||
properties :
|
||||
Include additional pieces of information
|
||||
|
||||
* ids - Adds the page ID and revision ID
|
||||
* title - Adds the title and namespace ID of the page
|
||||
* timestamp - Adds the timestamp of the edit
|
||||
* comment - Adds the comment of the edit
|
||||
* parsedcomment - Adds the parsed comment of the edit
|
||||
* size - Adds the new size of the edit
|
||||
* sizediff - Adds the size delta of the edit against its parent
|
||||
* flags - Adds flags of the edit
|
||||
* patrolled - Tags patrolled edits
|
||||
* tags - Lists tags for the edit
|
||||
show : set(str)
|
||||
Show only items that meet thse criteria, e.g. non minor edits only: ucshow=!minor.
|
||||
NOTE: If ucshow=patrolled or ucshow=!patrolled is set, revisions older than
|
||||
$wgRCMaxAge (2592000) won't be shown
|
||||
|
||||
* minor
|
||||
* !minor,
|
||||
* patrolled,
|
||||
* !patrolled,
|
||||
* top,
|
||||
* !top,
|
||||
* new,
|
||||
* !new
|
||||
tag : str
|
||||
Only list revisions tagged with this tag
|
||||
toponly : bool
|
||||
DEPRECATED! Only list changes which are the latest revision
|
||||
"""
|
||||
limit = none_or(limit, int)
|
||||
|
||||
revisions_yielded = 0
|
||||
done = False
|
||||
while not done:
|
||||
|
||||
if limit is None:
|
||||
kwargs['limit'] = self.MAX_REVISIONS
|
||||
else:
|
||||
kwargs['limit'] = min(limit - revisions_yielded, self.MAX_REVISIONS)
|
||||
|
||||
uc_docs, uccontinue = self._query(*args, **kwargs)
|
||||
|
||||
for doc in uc_docs:
|
||||
yield doc
|
||||
revisions_yielded += 1
|
||||
|
||||
if limit is not None and revisions_yielded >= limit:
|
||||
done = True
|
||||
break
|
||||
|
||||
if uccontinue is None or len(uc_docs) == 0:
|
||||
done = True
|
||||
else:
|
||||
kwargs['uccontinue'] = uccontinue
|
||||
|
||||
def _query(self, user=None, userprefix=None, limit=None, start=None,
|
||||
end=None, direction=None, namespace=None, properties=None,
|
||||
show=None, tag=None, toponly=None,
|
||||
uccontinue=None):
|
||||
|
||||
params = {
|
||||
'action': "query",
|
||||
'list': "usercontribs"
|
||||
}
|
||||
params['uclimit'] = none_or(limit, int)
|
||||
params['ucstart'] = self._check_timestamp(start)
|
||||
params['ucend'] = self._check_timestamp(end)
|
||||
if uccontinue is not None:
|
||||
params.update(uccontinue)
|
||||
params['ucuser'] = self._items(user, type=str)
|
||||
params['ucuserprefix'] = self._items(userprefix, type=str)
|
||||
params['ucdir'] = self._check_direction(direction)
|
||||
params['ucnamespace'] = none_or(namespace, int)
|
||||
params['ucprop'] = self._items(properties, levels=self.PROPERTIES)
|
||||
params['ucshow'] = self._items(show, levels=self.SHOW)
|
||||
|
||||
doc = self.session.get(params)
|
||||
try:
|
||||
if 'query-continue' in doc:
|
||||
uccontinue = doc['query-continue']['usercontribs']
|
||||
else:
|
||||
uccontinue = None
|
||||
|
||||
uc_docs = doc['query']['usercontribs']
|
||||
|
||||
return uc_docs, uccontinue
|
||||
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(str(e), doc)
|
||||
@@ -0,0 +1,83 @@
|
||||
import logging
|
||||
|
||||
from ...util import none_or
|
||||
from ..errors import MalformedResponse
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.api.collections.users")
|
||||
|
||||
|
||||
class Users(Collection):
|
||||
"""
|
||||
A collection of information about users
|
||||
"""
|
||||
|
||||
PROPERTIES = {'blockinfo', 'implicitgroups', 'groups', 'registration',
|
||||
'emailable', 'editcount', 'gender'}
|
||||
|
||||
SHOW = {'minor', '!minor', 'patrolled', '!patrolled'}
|
||||
|
||||
MAX_REVISIONS = 50
|
||||
|
||||
def query(self, *args, **kwargs):
|
||||
"""
|
||||
Get a user's metadata.
|
||||
See `<https://www.mediawiki.org/wiki/API:Users>`_
|
||||
|
||||
:Parameters:
|
||||
users : str
|
||||
The usernames of the users to be retrieved.
|
||||
|
||||
properties : set(str)
|
||||
Include additional pieces of information
|
||||
|
||||
blockinfo - Tags if the user is blocked, by whom, and
|
||||
for what reason
|
||||
groups - Lists all the groups the user(s) belongs to
|
||||
implicitgroups - Lists all the groups a user is automatically
|
||||
a member of
|
||||
rights - Lists all the rights the user(s) has
|
||||
editcount - Adds the user's edit count
|
||||
registration - Adds the user's registration timestamp
|
||||
emailable - Tags if the user can and wants to receive
|
||||
email through [[Special:Emailuser]]
|
||||
gender - Tags the gender of the user. Returns "male",
|
||||
"female", or "unknown"
|
||||
"""
|
||||
done = False
|
||||
while not done:
|
||||
|
||||
us_docs, query_continue = self._query(*args, **kwargs)
|
||||
|
||||
for doc in us_docs:
|
||||
yield doc
|
||||
|
||||
if query_continue is None or len(us_docs) == 0:
|
||||
done = True
|
||||
else:
|
||||
kwargs['query_continue'] = query_continue
|
||||
|
||||
def _query(self, users, query_continue=None, properties=None):
|
||||
|
||||
params = {
|
||||
'action': "query",
|
||||
'list': "users"
|
||||
}
|
||||
params['ususers'] = self._items(users, type=str)
|
||||
params['usprop'] = self._items(properties, levels=self.PROPERTIES)
|
||||
if query_continue is not None:
|
||||
params.update(query_continue)
|
||||
|
||||
doc = self.session.get(params)
|
||||
try:
|
||||
if 'query-continue' in doc:
|
||||
query_continue = doc['query-continue']['users']
|
||||
else:
|
||||
query_continue = None
|
||||
|
||||
us_docs = doc['query']['users']
|
||||
|
||||
return us_docs, query_continue
|
||||
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(str(e), doc)
|
||||
48
mediawiki_dump_tools/Mediawiki-Utilities/mw/api/errors.py
Normal file
48
mediawiki_dump_tools/Mediawiki-Utilities/mw/api/errors.py
Normal file
@@ -0,0 +1,48 @@
|
||||
class DocError(Exception):
|
||||
def __init__(self, message, doc):
|
||||
super().__init__(message)
|
||||
|
||||
self.doc = doc
|
||||
"""
|
||||
The document returned by the API that brought about this error.
|
||||
"""
|
||||
|
||||
|
||||
class APIError(DocError):
|
||||
def __init__(self, doc):
|
||||
|
||||
code = doc.get('error', {}).get('code')
|
||||
message = doc.get('error', {}).get('message')
|
||||
|
||||
super().__init__("{0}:{1}".format(code, message), doc)
|
||||
|
||||
self.code = code
|
||||
"""
|
||||
The error code returned by the api -- if available.
|
||||
"""
|
||||
|
||||
self.message = message
|
||||
"""
|
||||
The error message returned by the api -- if available.
|
||||
"""
|
||||
|
||||
class AuthenticationError(DocError):
|
||||
def __init__(self, doc):
|
||||
result = doc['login']['result']
|
||||
super().__init__(result, doc)
|
||||
|
||||
self.result = result
|
||||
"""
|
||||
The result code of an authentication attempt.
|
||||
"""
|
||||
|
||||
|
||||
class MalformedResponse(DocError):
|
||||
def __init__(self, key, doc):
|
||||
|
||||
super().__init__("Expected to find '{0}' in result.".format(key), doc)
|
||||
|
||||
self.key = key
|
||||
"""
|
||||
The expected, but missing key from the API call.
|
||||
"""
|
||||
134
mediawiki_dump_tools/Mediawiki-Utilities/mw/api/session.py
Normal file
134
mediawiki_dump_tools/Mediawiki-Utilities/mw/api/session.py
Normal file
@@ -0,0 +1,134 @@
|
||||
import logging
|
||||
|
||||
from ..util import api
|
||||
from .collections import (DeletedRevisions, Pages, RecentChanges, Revisions,
|
||||
SiteInfo, UserContribs, Users)
|
||||
from .errors import APIError, AuthenticationError, MalformedResponse
|
||||
|
||||
logger = logging.getLogger("mw.api.session")
|
||||
|
||||
DEFAULT_USER_AGENT = "MediaWiki-Utilities"
|
||||
"""
|
||||
The default User-Agent to be sent with requests to the API.
|
||||
"""
|
||||
|
||||
class Session(api.Session):
|
||||
"""
|
||||
Represents a connection to a MediaWiki API.
|
||||
|
||||
Cookies and other session information is preserved.
|
||||
|
||||
:Parameters:
|
||||
uri : str
|
||||
The base URI for the API to use. Usually ends in "api.php"
|
||||
user_agent : str
|
||||
The User-Agent to be sent with requests. Will raise a warning if
|
||||
left to default value.
|
||||
"""
|
||||
|
||||
def __init__(self, uri, *args, user_agent=DEFAULT_USER_AGENT, **kwargs):
|
||||
"""
|
||||
Constructs a new :class:`Session`.
|
||||
"""
|
||||
|
||||
if user_agent == DEFAULT_USER_AGENT:
|
||||
logger.warning("Sending requests with default User-Agent. " +
|
||||
"Set 'user_agent' on api.Session to quiet this " +
|
||||
"message.")
|
||||
|
||||
if 'headers' in kwargs:
|
||||
kwargs['headers']['User-Agent'] = str(user_agent)
|
||||
else:
|
||||
kwargs['headers'] = {'User-Agent': str(user_agent)}
|
||||
|
||||
super().__init__(uri, *args, **kwargs)
|
||||
|
||||
self.pages = Pages(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.Pages`.
|
||||
"""
|
||||
|
||||
self.revisions = Revisions(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.Revisions`.
|
||||
"""
|
||||
|
||||
self.recent_changes = RecentChanges(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.RecentChanges`.
|
||||
"""
|
||||
|
||||
self.site_info = SiteInfo(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.SiteInfo`.
|
||||
"""
|
||||
|
||||
self.user_contribs = UserContribs(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.UserContribs`.
|
||||
"""
|
||||
|
||||
self.users = Users(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.Users`.
|
||||
"""
|
||||
|
||||
self.deleted_revisions = DeletedRevisions(self)
|
||||
"""
|
||||
An instance of :class:`mw.api.DeletedRevisions`.
|
||||
"""
|
||||
|
||||
def login(self, username, password, token=None):
|
||||
"""
|
||||
Performs a login operation. This method usually makes two requests to
|
||||
API -- one to get a token and one to use the token to log in. If
|
||||
authentication fails, this method will throw an
|
||||
:class:`.errors.AuthenticationError`.
|
||||
|
||||
:Parameters:
|
||||
username : str
|
||||
Your username
|
||||
password : str
|
||||
Your password
|
||||
|
||||
:Returns:
|
||||
The response in a json :py:class:`dict`
|
||||
"""
|
||||
|
||||
doc = self.post(
|
||||
{
|
||||
'action': "login",
|
||||
'lgname': username,
|
||||
'lgpassword': password,
|
||||
'lgtoken': token, # If None, we'll be getting a token
|
||||
}
|
||||
)
|
||||
|
||||
|
||||
try:
|
||||
if doc['login']['result'] == "Success":
|
||||
return doc
|
||||
elif doc['login']['result'] == "NeedToken":
|
||||
|
||||
if token is not None:
|
||||
# Woops. We've been here before. Better error out.
|
||||
raise AuthenticationError(doc)
|
||||
else:
|
||||
token = doc['login']['token']
|
||||
return self.login(username, password, token=token)
|
||||
else:
|
||||
raise AuthenticationError(doc)
|
||||
|
||||
except KeyError as e:
|
||||
raise MalformedResponse(e.message, doc)
|
||||
|
||||
|
||||
def request(self, type, params, **kwargs):
|
||||
params.update({'format': "json"})
|
||||
|
||||
doc = super().request(type, params, **kwargs).json()
|
||||
|
||||
if 'error' in doc:
|
||||
raise APIError(doc)
|
||||
|
||||
return doc
|
||||
@@ -0,0 +1,4 @@
|
||||
# from . import errors
|
||||
from .db import DB
|
||||
from .collections import Pages, RecentChanges, Revisions, Archives, \
|
||||
AllRevisions, Users
|
||||
@@ -0,0 +1,4 @@
|
||||
from .pages import Pages
|
||||
from .recent_changes import RecentChanges
|
||||
from .revisions import Revisions, Archives, AllRevisions
|
||||
from .users import Users
|
||||
@@ -0,0 +1,11 @@
|
||||
class Collection:
|
||||
DIRECTIONS = {'newer', 'older'}
|
||||
|
||||
def __init__(self, db):
|
||||
self.db = db
|
||||
|
||||
def __str__(self):
|
||||
return self.__repr__()
|
||||
|
||||
def __repr__(self):
|
||||
return "{0}({1})".format(self.__class__.__name__, repr(self.db))
|
||||
@@ -0,0 +1,65 @@
|
||||
import logging
|
||||
|
||||
from ...util import none_or
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.database.collections.pages")
|
||||
|
||||
|
||||
class Pages(Collection):
|
||||
def get(self, page_id=None, namespace_title=None, rev_id=None):
|
||||
"""
|
||||
Gets a single page based on a legitimate identifier of the page. Note
|
||||
that namespace_title expects a tuple of namespace ID and title.
|
||||
|
||||
:Parameters:
|
||||
page_id : int
|
||||
Page ID
|
||||
namespace_title : ( int, str )
|
||||
the page's namespace ID and title
|
||||
rev_id : int
|
||||
a revision ID included in the page's history
|
||||
|
||||
:Returns:
|
||||
iterator over result rows
|
||||
"""
|
||||
|
||||
page_id = none_or(page_id, int)
|
||||
namespace_title = none_or(namespace_title, tuple)
|
||||
rev_id = none_or(rev_id, int)
|
||||
|
||||
query = """
|
||||
SELECT page.*
|
||||
FROM page
|
||||
"""
|
||||
values = []
|
||||
|
||||
if page_id is not None:
|
||||
query += """
|
||||
WHERE page_id = %s
|
||||
"""
|
||||
values.append(page_id)
|
||||
|
||||
if namespace_title is not None:
|
||||
namespace, title = namespace_title
|
||||
|
||||
query += " WHERE page_namespace = %s and page_title = %s "
|
||||
values.extend([int(namespace), str(title)])
|
||||
|
||||
elif rev_id is not None:
|
||||
query += """
|
||||
WHERE page_id = (SELECT rev_page FROM revision WHERE rev_id = %s)
|
||||
"""
|
||||
values.append(rev_id)
|
||||
|
||||
else:
|
||||
raise TypeError("Must specify a page identifier.")
|
||||
|
||||
cursor = self.db.shared_connection.cursor()
|
||||
cursor.execute(
|
||||
query,
|
||||
values
|
||||
)
|
||||
|
||||
for row in cursor:
|
||||
return row
|
||||
@@ -0,0 +1,128 @@
|
||||
import logging
|
||||
import time
|
||||
|
||||
from ...types import Timestamp
|
||||
from ...util import none_or
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.database.collections.pages")
|
||||
|
||||
|
||||
class RecentChanges(Collection):
|
||||
# (https://www.mediawiki.org/wiki/Manual:Recentchanges_table)
|
||||
TYPES = {
|
||||
'edit': 0, # edit of existing page
|
||||
'new': 1, # new page
|
||||
'move': 2, # Marked as obsolete
|
||||
'log': 3, # log action (introduced in MediaWiki 1.2)
|
||||
'move_over_redirect': 4, # Marked as obsolete
|
||||
'external': 5 # An external recent change. Primarily used by Wikidata
|
||||
}
|
||||
|
||||
def listen(self, last=None, types=None, max_wait=5):
|
||||
"""
|
||||
Listens to the recent changes table. Given no parameters, this function
|
||||
will return an iterator over the entire recentchanges table and then
|
||||
continue to "listen" for new changes to come in every 5 seconds.
|
||||
|
||||
:Parameters:
|
||||
last : dict
|
||||
a recentchanges row to pick up after
|
||||
types : set ( str )
|
||||
a set of recentchanges types to filter for
|
||||
max_wait : float
|
||||
the maximum number of seconds to wait between repeated queries
|
||||
|
||||
:Returns:
|
||||
A never-ending iterator over change rows.
|
||||
"""
|
||||
while True:
|
||||
if last is not None:
|
||||
after = last['rc_timestamp']
|
||||
after_id = last['rc_id']
|
||||
else:
|
||||
after = None
|
||||
after_id = None
|
||||
|
||||
start = time.time()
|
||||
rcs = self.query(after=after, after_id=after_id, direction="newer")
|
||||
|
||||
count = 0
|
||||
for rc in rcs:
|
||||
yield rc
|
||||
count += 1
|
||||
|
||||
time.sleep(max_wait - (time.time() - start))
|
||||
|
||||
def query(self, before=None, after=None, before_id=None, after_id=None,
|
||||
types=None, direction=None, limit=None):
|
||||
"""
|
||||
Queries the ``recentchanges`` table. See
|
||||
`<https://www.mediawiki.org/wiki/Manual:Recentchanges_table>`_
|
||||
|
||||
:Parameters:
|
||||
before : :class:`mw.Timestamp`
|
||||
The maximum timestamp
|
||||
after : :class:`mw.Timestamp`
|
||||
The minimum timestamp
|
||||
before_id : int
|
||||
The minimum ``rc_id``
|
||||
after_id : int
|
||||
The maximum ``rc_id``
|
||||
types : set ( str )
|
||||
Which types of changes to return?
|
||||
|
||||
* ``edit`` -- Edits to existing pages
|
||||
* ``new`` -- Edits that create new pages
|
||||
* ``move`` -- (obsolete)
|
||||
* ``log`` -- Log actions (introduced in MediaWiki 1.2)
|
||||
* ``move_over_redirect`` -- (obsolete)
|
||||
* ``external`` -- An external recent change. Primarily used by Wikidata
|
||||
|
||||
direction : str
|
||||
"older" or "newer"
|
||||
limit : int
|
||||
limit the number of records returned
|
||||
"""
|
||||
before = none_or(before, Timestamp)
|
||||
after = none_or(after, Timestamp)
|
||||
before_id = none_or(before_id, int)
|
||||
after_id = none_or(after_id, int)
|
||||
types = none_or(types, levels=self.TYPES)
|
||||
direction = none_or(direction, levels=self.DIRECTIONS)
|
||||
limit = none_or(limit, int)
|
||||
|
||||
query = """
|
||||
SELECT * FROM recentchanges
|
||||
WHERE 1
|
||||
"""
|
||||
values = []
|
||||
|
||||
if before is not None:
|
||||
query += " AND rc_timestamp < %s "
|
||||
values.append(before.short_format())
|
||||
if after is not None:
|
||||
query += " AND rc_timestamp < %s "
|
||||
values.append(after.short_format())
|
||||
if before_id is not None:
|
||||
query += " AND rc_id < %s "
|
||||
values.append(before_id)
|
||||
if after_id is not None:
|
||||
query += " AND rc_id < %s "
|
||||
values.append(after_id)
|
||||
if types is not None:
|
||||
query += " AND rc_type IN ({0}) ".format(
|
||||
",".join(self.TYPES[t] for t in types)
|
||||
)
|
||||
|
||||
if direction is not None:
|
||||
direction = ("ASC " if direction == "newer" else "DESC ")
|
||||
query += " ORDER BY rc_timestamp {0}, rc_id {0}".format(dir)
|
||||
|
||||
if limit is not None:
|
||||
query += " LIMIT %s "
|
||||
values.append(limit)
|
||||
|
||||
cursor.execute(query, values)
|
||||
for row in cursor:
|
||||
yield row
|
||||
@@ -0,0 +1,410 @@
|
||||
import logging
|
||||
import time
|
||||
from itertools import chain
|
||||
|
||||
from ...types import Timestamp
|
||||
from ...util import iteration, none_or
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.database.collections.revisions")
|
||||
|
||||
|
||||
class AllRevisions(Collection):
|
||||
def get(self, rev_id, include_page=False):
|
||||
"""
|
||||
Gets a single revisions by ID. Checks both the ``revision`` and
|
||||
``archive`` tables. This method throws a :class:`KeyError` if a
|
||||
revision cannot be found.
|
||||
|
||||
:Parameters:
|
||||
rev_id : int
|
||||
Revision ID
|
||||
include_page : bool
|
||||
Join revision returned against ``page``
|
||||
|
||||
:Returns:
|
||||
A revision row
|
||||
"""
|
||||
rev_id = int(rev_id)
|
||||
try:
|
||||
rev_row = self.db.revisions.get(rev_id, include_page=include_page)
|
||||
except KeyError as e:
|
||||
rev_row = self.db.archives.get(rev_id)
|
||||
|
||||
return rev_row
|
||||
|
||||
def query(self, *args, **kwargs):
|
||||
"""
|
||||
Queries revisions (excludes revisions to deleted pages)
|
||||
|
||||
:Parameters:
|
||||
page_id : int
|
||||
Page identifier. Filter revisions to this page.
|
||||
user_id : int
|
||||
User identifier. Filter revisions to those made by this user.
|
||||
user_text : str
|
||||
User text (user_name or IP address). Filter revisions to those
|
||||
made by this user.
|
||||
before : :class:`mw.Timestamp`
|
||||
Filter revisions to those made before this timestamp.
|
||||
after : :class:`mw.Timestamp`
|
||||
Filter revisions to those made after this timestamp.
|
||||
before_id : int
|
||||
Filter revisions to those with an ID before this ID
|
||||
after_id : int
|
||||
Filter revisions to those with an ID after this ID
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
limit : int
|
||||
Limit the number of results
|
||||
include_page : bool
|
||||
Join revisions returned against ``page``
|
||||
|
||||
:Returns:
|
||||
An iterator over revision rows.
|
||||
"""
|
||||
|
||||
revisions = self.db.revisions.query(*args, **kwargs)
|
||||
archives = self.db.archives.query(*args, **kwargs)
|
||||
|
||||
if 'direction' in kwargs:
|
||||
direction = kwargs['direction']
|
||||
if direction not in self.DIRECTIONS:
|
||||
raise TypeError("direction must be in {0}".format(self.DIRECTIONS))
|
||||
|
||||
if direction == "newer":
|
||||
collated_revisions = iteration.sequence(
|
||||
revisions,
|
||||
archives,
|
||||
compare=lambda r1, r2:\
|
||||
(r1['rev_timestamp'], r1['rev_id']) <=
|
||||
(r2['rev_timestamp'], r2['rev_id'])
|
||||
)
|
||||
else: # direction == "older"
|
||||
collated_revisions = iteration.sequence(
|
||||
revisions,
|
||||
archives,
|
||||
compare=lambda r1, r2:\
|
||||
(r1['rev_timestamp'], r1['rev_id']) >=
|
||||
(r2['rev_timestamp'], r2['rev_id'])
|
||||
)
|
||||
else:
|
||||
collated_revisions = chain(revisions, archives)
|
||||
|
||||
if 'limit' in kwargs:
|
||||
limit = kwargs['limit']
|
||||
|
||||
for i, rev in enumerate(collated_revisions):
|
||||
yield rev
|
||||
if i >= limit:
|
||||
break
|
||||
|
||||
else:
|
||||
for rev in collated_revisions:
|
||||
yield rev
|
||||
|
||||
|
||||
class Revisions(Collection):
|
||||
|
||||
def get(self, rev_id, include_page=False):
|
||||
"""
|
||||
Gets a single revisions by ID. Checks the ``revision`` table. This
|
||||
method throws a :class:`KeyError` if a revision cannot be found.
|
||||
|
||||
:Parameters:
|
||||
rev_id : int
|
||||
Revision ID
|
||||
include_page : bool
|
||||
Join revision returned against ``page``
|
||||
|
||||
:Returns:
|
||||
A revision row
|
||||
"""
|
||||
rev_id = int(rev_id)
|
||||
|
||||
query = """
|
||||
SELECT *, FALSE AS archived FROM revision
|
||||
"""
|
||||
if include_page:
|
||||
query += """
|
||||
INNER JOIN page ON page_id = rev_page
|
||||
"""
|
||||
|
||||
query += " WHERE rev_id = %s"
|
||||
|
||||
cursor.execute(query, [rev_id])
|
||||
|
||||
for row in cursor:
|
||||
return row
|
||||
|
||||
raise KeyError(rev_id)
|
||||
|
||||
def query(self, page_id=None, user_id=None, user_text=None,
|
||||
before=None, after=None, before_id=None, after_id=None,
|
||||
direction=None, limit=None, include_page=False):
|
||||
"""
|
||||
Queries revisions (excludes revisions to deleted pages)
|
||||
|
||||
:Parameters:
|
||||
page_id : int
|
||||
Page identifier. Filter revisions to this page.
|
||||
user_id : int
|
||||
User identifier. Filter revisions to those made by this user.
|
||||
user_text : str
|
||||
User text (user_name or IP address). Filter revisions to those
|
||||
made by this user.
|
||||
before : :class:`mw.Timestamp`
|
||||
Filter revisions to those made before this timestamp.
|
||||
after : :class:`mw.Timestamp`
|
||||
Filter revisions to those made after this timestamp.
|
||||
before_id : int
|
||||
Filter revisions to those with an ID before this ID
|
||||
after_id : int
|
||||
Filter revisions to those with an ID after this ID
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
limit : int
|
||||
Limit the number of results
|
||||
include_page : bool
|
||||
Join revisions returned against ``page``
|
||||
|
||||
:Returns:
|
||||
An iterator over revision rows.
|
||||
"""
|
||||
start_time = time.time()
|
||||
|
||||
page_id = none_or(page_id, int)
|
||||
user_id = none_or(user_id, int)
|
||||
user_text = none_or(user_text, str)
|
||||
before = none_or(before, Timestamp)
|
||||
after = none_or(after, Timestamp)
|
||||
before_id = none_or(before_id, int)
|
||||
after_id = none_or(after_id, int)
|
||||
direction = none_or(direction, levels=self.DIRECTIONS)
|
||||
include_page = bool(include_page)
|
||||
|
||||
query = """
|
||||
SELECT *, FALSE AS archived FROM revision
|
||||
"""
|
||||
|
||||
if include_page:
|
||||
query += """
|
||||
INNER JOIN page ON page_id = rev_page
|
||||
"""
|
||||
|
||||
query += """
|
||||
WHERE 1
|
||||
"""
|
||||
values = []
|
||||
|
||||
if page_id is not None:
|
||||
query += " AND rev_page = %s "
|
||||
values.append(page_id)
|
||||
if user_id is not None:
|
||||
query += " AND rev_user = %s "
|
||||
values.append(user_id)
|
||||
if user_text is not None:
|
||||
query += " AND rev_user_text = %s "
|
||||
values.append(user_text)
|
||||
if before is not None:
|
||||
query += " AND rev_timestamp < %s "
|
||||
values.append(before.short_format())
|
||||
if after is not None:
|
||||
query += " AND rev_timestamp > %s "
|
||||
values.append(after.short_format())
|
||||
if before_id is not None:
|
||||
query += " AND rev_id < %s "
|
||||
values.append(before_id)
|
||||
if after_id is not None:
|
||||
query += " AND rev_id > %s "
|
||||
values.append(after_id)
|
||||
|
||||
if direction is not None:
|
||||
|
||||
direction = ("ASC " if direction == "newer" else "DESC ")
|
||||
|
||||
if before_id != None or after_id != None:
|
||||
query += " ORDER BY rev_id {0}, rev_timestamp {0}".format(direction)
|
||||
else:
|
||||
query += " ORDER BY rev_timestamp {0}, rev_id {0}".format(direction)
|
||||
|
||||
if limit is not None:
|
||||
query += " LIMIT %s "
|
||||
values.append(limit)
|
||||
|
||||
cursor = self.db.shared_connection.cursor()
|
||||
cursor.execute(query, values)
|
||||
count = 0
|
||||
for row in cursor:
|
||||
yield row
|
||||
count += 1
|
||||
|
||||
logger.debug("%s revisions read in %s seconds" % (count, time.time() - start_time))
|
||||
|
||||
|
||||
class Archives(Collection):
|
||||
def get(self, rev_id):
|
||||
"""
|
||||
Gets a single revisions by ID. Checks the ``archive`` table. This
|
||||
method throws a :class:`KeyError` if a revision cannot be found.
|
||||
|
||||
:Parameters:
|
||||
rev_id : int
|
||||
Revision ID
|
||||
|
||||
:Returns:
|
||||
A revision row
|
||||
"""
|
||||
rev_id = int(rev_id)
|
||||
|
||||
query = """
|
||||
SELECT
|
||||
ar_id,
|
||||
ar_rev_id AS rev_id,
|
||||
ar_page_id AS rev_page,
|
||||
ar_page_id AS page_id,
|
||||
ar_title AS page_title,
|
||||
ar_namespace AS page_namespace,
|
||||
ar_text_id AS rev_text_id,
|
||||
ar_comment AS rev_comment,
|
||||
ar_user AS rev_user,
|
||||
ar_user_text AS rev_user_text,
|
||||
ar_timestamp AS rev_timestamp,
|
||||
ar_minor_edit AS rev_minor_edit,
|
||||
ar_deleted AS rev_deleted,
|
||||
ar_len AS rev_len,
|
||||
ar_parent_id AS rev_parent_id,
|
||||
ar_sha1 AS rev_sha1,
|
||||
TRUE AS archived
|
||||
FROM archive
|
||||
WHERE ar_rev_id = %s
|
||||
"""
|
||||
|
||||
cursor.execute(query, [rev_id])
|
||||
for row in cursor:
|
||||
return row
|
||||
|
||||
raise KeyError(rev_id)
|
||||
|
||||
def query(self, page_id=None, user_id=None, user_text=None,
|
||||
before=None, after=None, before_id=None, after_id=None,
|
||||
before_ar_id=None, after_ar_id=None,
|
||||
direction=None, limit=None, include_page=True):
|
||||
"""
|
||||
Queries archived revisions (revisions of deleted pages)
|
||||
|
||||
:Parameters:
|
||||
page_id : int
|
||||
Page identifier. Filter revisions to this page.
|
||||
user_id : int
|
||||
User identifier. Filter revisions to those made by this user.
|
||||
user_text : str
|
||||
User text (user_name or IP address). Filter revisions to those
|
||||
made by this user.
|
||||
before : :class:`mw.Timestamp`
|
||||
Filter revisions to those made before this timestamp.
|
||||
after : :class:`mw.Timestamp`
|
||||
Filter revisions to those made after this timestamp.
|
||||
before_id : int
|
||||
Filter revisions to those with an ID before this ID
|
||||
after_id : int
|
||||
Filter revisions to those with an ID after this ID
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
limit : int
|
||||
Limit the number of results
|
||||
include_page : bool
|
||||
This field is ignored. It's only here for compatibility with
|
||||
:class:`mw.database.Revision`.
|
||||
|
||||
:Returns:
|
||||
An iterator over revision rows.
|
||||
"""
|
||||
page_id = none_or(page_id, int)
|
||||
user_id = none_or(user_id, int)
|
||||
before = none_or(before, Timestamp)
|
||||
after = none_or(after, Timestamp)
|
||||
before_id = none_or(before_id, int)
|
||||
after_id = none_or(after_id, int)
|
||||
direction = none_or(direction, levels=self.DIRECTIONS)
|
||||
limit = none_or(limit, int)
|
||||
|
||||
start_time = time.time()
|
||||
cursor = self.db.shared_connection.cursor()
|
||||
|
||||
query = """
|
||||
SELECT
|
||||
ar_id,
|
||||
ar_rev_id AS rev_id,
|
||||
ar_page_id AS rev_page,
|
||||
ar_page_id AS page_id,
|
||||
ar_title AS page_title,
|
||||
ar_namespace AS page_namespace,
|
||||
ar_text_id AS rev_text_id,
|
||||
ar_comment AS rev_comment,
|
||||
ar_user AS rev_user,
|
||||
ar_user_text AS rev_user_text,
|
||||
ar_timestamp AS rev_timestamp,
|
||||
ar_minor_edit AS rev_minor_edit,
|
||||
ar_deleted AS rev_deleted,
|
||||
ar_len AS rev_len,
|
||||
ar_parent_id AS rev_parent_id,
|
||||
ar_sha1 AS rev_sha1,
|
||||
TRUE AS archived
|
||||
FROM archive
|
||||
"""
|
||||
|
||||
query += """
|
||||
WHERE 1
|
||||
"""
|
||||
values = []
|
||||
|
||||
if page_id is not None:
|
||||
query += " AND ar_page_id = %s "
|
||||
values.append(page_id)
|
||||
if user_id is not None:
|
||||
query += " AND ar_user = %s "
|
||||
values.append(user_id)
|
||||
if user_text is not None:
|
||||
query += " AND ar_user_text = %s "
|
||||
values.append(user_text)
|
||||
if before is not None:
|
||||
query += " AND ar_timestamp < %s "
|
||||
values.append(before.short_format())
|
||||
if after is not None:
|
||||
query += " AND ar_timestamp > %s "
|
||||
values.append(after.short_format())
|
||||
if before_id is not None:
|
||||
query += " AND ar_rev_id < %s "
|
||||
values.append(before_id)
|
||||
if after_id is not None:
|
||||
query += " AND ar_rev_id > %s "
|
||||
values.append(after_id)
|
||||
if before_ar_id is not None:
|
||||
query += " AND ar_id < ? "
|
||||
values.append(before_ar_id)
|
||||
if after_ar_id is not None:
|
||||
query += " AND ar_id > ? "
|
||||
values.append(after_ar_id)
|
||||
|
||||
if direction is not None:
|
||||
dir = ("ASC " if direction == "newer" else "DESC ")
|
||||
|
||||
if before is not None or after is not None:
|
||||
query += " ORDER BY ar_timestamp {0}, ar_rev_id {0}".format(dir)
|
||||
elif before_id is not None or after_id is not None:
|
||||
query += " ORDER BY ar_rev_id {0}, ar_timestamp {0}".format(dir)
|
||||
else:
|
||||
query += " ORDER BY ar_id {0}".format(dir)
|
||||
|
||||
if limit is not None:
|
||||
query += " LIMIT %s "
|
||||
values.append(limit)
|
||||
|
||||
cursor.execute(query, values)
|
||||
count = 0
|
||||
for row in cursor:
|
||||
yield row
|
||||
count += 1
|
||||
|
||||
logger.debug("%s revisions read in %s seconds" % (count, time.time() - start_time))
|
||||
@@ -0,0 +1,154 @@
|
||||
import logging
|
||||
import time
|
||||
|
||||
from ...types import Timestamp
|
||||
from ...util import none_or
|
||||
from .collection import Collection
|
||||
|
||||
logger = logging.getLogger("mw.database.collections.users")
|
||||
|
||||
|
||||
class Users(Collection):
|
||||
CREATION_ACTIONS = {'newusers', 'create', 'create2', 'autocreate',
|
||||
'byemail'}
|
||||
|
||||
def get(self, user_id=None, user_name=None):
|
||||
"""
|
||||
Gets a single user row from the database. Raises a :class:`KeyError`
|
||||
if a user cannot be found.
|
||||
|
||||
:Parameters:
|
||||
user_id : int
|
||||
User ID
|
||||
user_name : str
|
||||
User's name
|
||||
|
||||
:Returns:
|
||||
A user row.
|
||||
"""
|
||||
user_id = none_or(user_id, int)
|
||||
user_name = none_or(user_name, str)
|
||||
|
||||
query = """
|
||||
SELECT user.*
|
||||
FROM user
|
||||
"""
|
||||
values = []
|
||||
|
||||
if user_id is not None:
|
||||
query += """
|
||||
WHERE user_id = %s
|
||||
"""
|
||||
values.append(user_id)
|
||||
|
||||
elif user_name is not None:
|
||||
query += """
|
||||
WHERE user_name = %s
|
||||
"""
|
||||
values.append(user_name)
|
||||
|
||||
else:
|
||||
raise TypeError("Must specify a user identifier.")
|
||||
|
||||
cursor = self.db.shared_connection.cursor()
|
||||
cursor.execute(
|
||||
query,
|
||||
values
|
||||
)
|
||||
|
||||
for row in cursor:
|
||||
return row
|
||||
|
||||
raise KeyError(user_id if user_id is not None else user_name)
|
||||
|
||||
def query(self, registered_before=None, registered_after=None,
|
||||
before_id=None, after_id=None, limit=None,
|
||||
direction=None, self_created_only=False):
|
||||
"""
|
||||
Queries users based on various filtering parameters.
|
||||
|
||||
:Parameters:
|
||||
registered_before : :class:`mw.Timestamp`
|
||||
A timestamp to search before (inclusive)
|
||||
registered_after : :class:`mw.Timestamp`
|
||||
A timestamp to search after (inclusive)
|
||||
before_id : int
|
||||
A user_id to search before (inclusive)
|
||||
after_id : int
|
||||
A user_ud to search after (inclusive)
|
||||
direction : str
|
||||
"newer" or "older"
|
||||
limit : int
|
||||
Limit the results to at most this number
|
||||
self_creations_only : bool
|
||||
limit results to self_created user accounts
|
||||
|
||||
:Returns:
|
||||
an iterator over ``user`` table rows
|
||||
"""
|
||||
start_time = time.time()
|
||||
|
||||
registered_before = none_or(registered_before, Timestamp)
|
||||
registered_after = none_or(registered_after, Timestamp)
|
||||
before_id = none_or(before_id, str)
|
||||
after_id = none_or(after_id, str)
|
||||
direction = none_or(direction, levels=self.DIRECTIONS)
|
||||
limit = none_or(limit, int)
|
||||
self_created_only = bool(self_created_only)
|
||||
|
||||
query = """
|
||||
SELECT user.*
|
||||
FROM user
|
||||
"""
|
||||
values = []
|
||||
|
||||
if self_created_only:
|
||||
query += """
|
||||
INNER JOIN logging ON
|
||||
log_user = user_id
|
||||
log_type = "newusers" AND
|
||||
log_action = "create"
|
||||
"""
|
||||
|
||||
query += "WHERE 1 "
|
||||
|
||||
if registered_before is not None:
|
||||
query += "AND user_registration <= %s "
|
||||
values.append(registered_before.short_format())
|
||||
if registered_after is not None:
|
||||
query += "AND user_registration >= %s "
|
||||
values.append(registered_after.short_format())
|
||||
if before_id is not None:
|
||||
query += "AND user_id <= %s "
|
||||
values.append(before_id)
|
||||
if after_id is not None:
|
||||
query += "AND user_id >= %s "
|
||||
values.append(after_id)
|
||||
|
||||
query += "GROUP BY user_id " # In case of duplicate log events
|
||||
|
||||
if direction is not None:
|
||||
if registered_before is not None or registered_after is not None:
|
||||
if direction == "newer":
|
||||
query += "ORDER BY user_registration ASC "
|
||||
else:
|
||||
query += "ORDER BY user_registration DESC "
|
||||
else:
|
||||
if direction == "newer":
|
||||
query += "ORDER BY user_id ASC "
|
||||
else:
|
||||
query += "ORDER BY user_id DESC "
|
||||
|
||||
if limit is not None:
|
||||
query += "LIMIT %s "
|
||||
values.append(limit)
|
||||
|
||||
cursor = self.db.shared_connection.cursor()
|
||||
cursor.execute(query, values)
|
||||
|
||||
count = 0
|
||||
for row in cursor:
|
||||
yield row
|
||||
count += 1
|
||||
|
||||
logger.debug("%s users queried in %s seconds" % (count, time.time() - start_time))
|
||||
134
mediawiki_dump_tools/Mediawiki-Utilities/mw/database/db.py
Normal file
134
mediawiki_dump_tools/Mediawiki-Utilities/mw/database/db.py
Normal file
@@ -0,0 +1,134 @@
|
||||
import getpass
|
||||
import logging
|
||||
import os
|
||||
|
||||
import pymysql
|
||||
import pymysql.cursors
|
||||
|
||||
from .collections import AllRevisions, Archives, Pages, Revisions, Users
|
||||
|
||||
logger = logging.getLogger("mw.database.db")
|
||||
|
||||
|
||||
class DB:
|
||||
"""
|
||||
Represents a connection to a MySQL database.
|
||||
|
||||
:Parameters:
|
||||
connection = :class:`oursql.Connection`
|
||||
A connection to a MediaWiki database
|
||||
"""
|
||||
|
||||
def __init__(self, connection):
|
||||
self.shared_connection = connection
|
||||
self.shared_connection.cursorclass = pymysql.cursors.DictCursor
|
||||
|
||||
self.revisions = Revisions(self)
|
||||
"""
|
||||
An instance of :class:`mw.database.Revisions`.
|
||||
"""
|
||||
|
||||
self.archives = Archives(self)
|
||||
"""
|
||||
An instance of :class:`mw.database.Archives`.
|
||||
"""
|
||||
|
||||
self.all_revisions = AllRevisions(self)
|
||||
"""
|
||||
An instance of :class:`mw.database.AllRevisions`.
|
||||
"""
|
||||
|
||||
self.pages = Pages(self)
|
||||
"""
|
||||
An instance of :class:`mw.database.Pages`.
|
||||
"""
|
||||
|
||||
self.users = Users(self)
|
||||
"""
|
||||
An instance of :class:`mw.database.Users`.
|
||||
"""
|
||||
|
||||
def __repr__(self):
|
||||
return "%s(%s)" % (
|
||||
self.__class__.__name__,
|
||||
", ".join(
|
||||
[repr(arg) for arg in self.args] +
|
||||
["%s=%r" % (k, v) for k, v in self.kwargs.items()]
|
||||
)
|
||||
)
|
||||
|
||||
def __str__(self):
|
||||
return self.__repr__()
|
||||
|
||||
@classmethod
|
||||
def add_arguments(cls, parser, defaults=None):
|
||||
"""
|
||||
Adds the arguments to an :class:`argparse.ArgumentParser` in order to
|
||||
create a database connection.
|
||||
"""
|
||||
defaults = defaults if defaults is not None else defaults
|
||||
|
||||
default_host = defaults.get('host', "localhost")
|
||||
parser.add_argument(
|
||||
'--host', '-h',
|
||||
help="MySQL database host to connect to (defaults to {0})".format(default_host),
|
||||
default=default_host
|
||||
)
|
||||
|
||||
default_database = defaults.get('database', getpass.getuser())
|
||||
parser.add_argument(
|
||||
'--database', '-d',
|
||||
help="MySQL database name to connect to (defaults to {0})".format(default_database),
|
||||
default=default_database
|
||||
)
|
||||
|
||||
default_defaults_file = defaults.get('defaults-file', os.path.expanduser("~/.my.cnf"))
|
||||
parser.add_argument(
|
||||
'--defaults-file',
|
||||
help="MySQL defaults file (defaults to {0})".format(default_defaults_file),
|
||||
default=default_defaults_file
|
||||
)
|
||||
|
||||
default_user = defaults.get('user', getpass.getuser())
|
||||
parser.add_argument(
|
||||
'--user', '-u',
|
||||
help="MySQL user (defaults to %s)".format(default_user),
|
||||
default=default_user
|
||||
)
|
||||
return parser
|
||||
|
||||
@classmethod
|
||||
def from_arguments(cls, args):
|
||||
"""
|
||||
Constructs a :class:`~mw.database.DB`.
|
||||
Consumes :class:`argparse.ArgumentParser` arguments given by
|
||||
:meth:`add_arguments` in order to create a :class:`DB`.
|
||||
|
||||
:Parameters:
|
||||
args : :class:`argparse.Namespace`
|
||||
A collection of argument values returned by :class:`argparse.ArgumentParser`'s :meth:`parse_args()`
|
||||
"""
|
||||
connection = pymysql.connect(
|
||||
args.host,
|
||||
args.user,
|
||||
database=args.database,
|
||||
read_default_file=args.defaults_file
|
||||
)
|
||||
return cls(connection)
|
||||
|
||||
@classmethod
|
||||
def from_params(cls, *args, **kwargs):
|
||||
"""
|
||||
Constructs a :class:`~mw.database.DB`. Passes `*args` and `**kwargs`
|
||||
to :meth:`oursql.connect` and configures the connection.
|
||||
|
||||
:Parameters:
|
||||
args : :class:`argparse.Namespace`
|
||||
A collection of argument values returned by :class:`argparse.ArgumentParser`'s :meth:`parse_args()`
|
||||
"""
|
||||
kwargs['cursorclass'] = pymysql.cursors.DictCursor
|
||||
if kwargs['db']:
|
||||
kwargs['database'] = kwargs['db']
|
||||
del kwargs['db']
|
||||
connection = pymysql.connect(*args, **kwargs)
|
||||
return cls(connection)
|
||||
@@ -0,0 +1,14 @@
|
||||
"""
|
||||
A package with utilities for managing the persistent word analysis across text
|
||||
versions of a document. `PersistenceState` is the highest level of the
|
||||
interface and the part of the system that's most interesting externally. `Word`s
|
||||
are also very important. The current implementation of `Word` only accounts for
|
||||
how the number of revisions in which a Word is visible. If persistent word
|
||||
views (or something similar) is intended to be kept, refactoring will be
|
||||
necessary.
|
||||
"""
|
||||
|
||||
from .state import State
|
||||
from .tokens import Tokens, Token
|
||||
from . import defaults
|
||||
from . import api
|
||||
@@ -0,0 +1,85 @@
|
||||
from .. import reverts
|
||||
from ...util import none_or
|
||||
from .state import State
|
||||
|
||||
|
||||
def track(session, rev_id, page_id=None, revert_radius=reverts.defaults.RADIUS,
|
||||
future_revisions=reverts.defaults.RADIUS, properties=None):
|
||||
"""
|
||||
Computes a persistence score for a revision by processing the revisions
|
||||
that took place around it.
|
||||
|
||||
:Parameters:
|
||||
session : :class:`mw.api.Session`
|
||||
An API session to make use of
|
||||
rev_id : int
|
||||
the ID of the revision to check
|
||||
page_id : int
|
||||
the ID of the page the revision occupies (slower if not provided)
|
||||
revert_radius : int
|
||||
a positive integer indicating the maximum number of revisions that can be reverted
|
||||
"""
|
||||
|
||||
if not hasattr(session, "revisions"):
|
||||
raise TypeError("session is wrong type. Expected a mw.api.Session.")
|
||||
|
||||
rev_id = int(rev_id)
|
||||
page_id = none_or(page_id, int)
|
||||
revert_radius = int(revert_radius)
|
||||
if revert_radius < 1:
|
||||
raise TypeError("invalid radius. Expected a positive integer.")
|
||||
properties = set(properties) if properties is not None else set()
|
||||
|
||||
|
||||
# If we don't have the page_id, we're going to need to look them up
|
||||
if page_id is None:
|
||||
rev = session.revisions.get(rev_id, properties={'ids'})
|
||||
page_id = rev['page']['pageid']
|
||||
|
||||
# Load history and current rev
|
||||
current_and_past_revs = list(session.revisions.query(
|
||||
pageids={page_id},
|
||||
limit=revert_radius + 1,
|
||||
start_id=rev_id,
|
||||
direction="older",
|
||||
properties={'ids', 'timestamp', 'content', 'sha1'} | properties
|
||||
))
|
||||
|
||||
try:
|
||||
# Extract current rev and reorder history
|
||||
current_rev, past_revs = (
|
||||
current_and_past_revs[0], # Current rev is the first one returned
|
||||
reversed(current_and_past_revs[1:]) # The rest are past revs, but they are in the wrong order
|
||||
)
|
||||
except IndexError:
|
||||
# Only way to get here is if there isn't enough history. Couldn't be
|
||||
# reverted. Just return None.
|
||||
return None
|
||||
|
||||
# Load future revisions
|
||||
future_revs = session.revisions.query(
|
||||
pageids={page_id},
|
||||
limit=future_revisions,
|
||||
start_id=rev_id + 1, # Ensures that we skip the current revision
|
||||
direction="newer",
|
||||
properties={'ids', 'timestamp', 'content', 'sha1'} | properties
|
||||
)
|
||||
|
||||
state = State(revert_radius=revert_radius)
|
||||
|
||||
# Process old revisions
|
||||
for rev in past_revs:
|
||||
state.process(rev.get('*', ""), rev, rev.get('sha1'))
|
||||
|
||||
# Process current revision
|
||||
_, tokens_added, _ = state.process(current_rev.get('*'), current_rev,
|
||||
current_rev.get('sha1'))
|
||||
|
||||
# Process new revisions
|
||||
future_revs = list(future_revs)
|
||||
for rev in future_revs:
|
||||
state.process(rev.get('*', ""), rev, rev.get('sha1'))
|
||||
|
||||
return current_rev, tokens_added, future_revs
|
||||
|
||||
score = track
|
||||
@@ -0,0 +1,11 @@
|
||||
from . import tokenization, difference
|
||||
|
||||
TOKENIZE = tokenization.wikitext_split
|
||||
"""
|
||||
The standard tokenizing function.
|
||||
"""
|
||||
|
||||
DIFF = difference.sequence_matcher
|
||||
"""
|
||||
The standard diff function
|
||||
"""
|
||||
@@ -0,0 +1,49 @@
|
||||
from difflib import SequenceMatcher
|
||||
|
||||
|
||||
def sequence_matcher(old, new):
|
||||
"""
|
||||
Generates a sequence of operations using :class:`difflib.SequenceMatcher`.
|
||||
|
||||
:Parameters:
|
||||
old : list( `hashable` )
|
||||
Old tokens
|
||||
new : list( `hashable` )
|
||||
New tokens
|
||||
|
||||
Returns:
|
||||
Minimal operations needed to convert `old` to `new`
|
||||
"""
|
||||
sm = SequenceMatcher(None, list(old), list(new))
|
||||
return sm.get_opcodes()
|
||||
|
||||
|
||||
def apply(ops, old, new):
|
||||
"""
|
||||
Applies operations (delta) to copy items from `old` to `new`.
|
||||
|
||||
:Parameters:
|
||||
ops : list((op, a1, a2, b1, b2))
|
||||
Operations to perform
|
||||
old : list( `hashable` )
|
||||
Old tokens
|
||||
new : list( `hashable` )
|
||||
New tokens
|
||||
:Returns:
|
||||
An iterator over elements matching `new` but copied from `old`
|
||||
"""
|
||||
for code, a_start, a_end, b_start, b_end in ops:
|
||||
if code == "insert":
|
||||
for t in new[b_start:b_end]:
|
||||
yield t
|
||||
elif code == "replace":
|
||||
for t in new[b_start:b_end]:
|
||||
yield t
|
||||
elif code == "equal":
|
||||
for t in old[a_start:a_end]:
|
||||
yield t
|
||||
elif code == "delete":
|
||||
pass
|
||||
else:
|
||||
assert False, \
|
||||
"encounted an unrecognized operation code: " + repr(code)
|
||||
@@ -0,0 +1,149 @@
|
||||
from hashlib import sha1
|
||||
|
||||
from . import defaults
|
||||
from .. import reverts
|
||||
from .tokens import Token, Tokens
|
||||
|
||||
|
||||
class Version:
|
||||
__slots__ = ('tokens')
|
||||
|
||||
def __init__(self):
|
||||
self.tokens = None
|
||||
|
||||
|
||||
class State:
|
||||
"""
|
||||
Represents the state of word persistence in a page.
|
||||
See `<https://meta.wikimedia.org/wiki/Research:Content_persistence>`_
|
||||
|
||||
:Parameters:
|
||||
tokenize : function( `str` ) --> list( `str` )
|
||||
A tokenizing function
|
||||
diff : function(list( `str` ), list( `str` )) --> list( `ops` )
|
||||
A function to perform a difference between token lists
|
||||
revert_radius : int
|
||||
a positive integer indicating the maximum revision distance that a revert can span.
|
||||
revert_detector : :class:`mw.lib.reverts.Detector`
|
||||
a revert detector to start process with
|
||||
:Example:
|
||||
>>> from pprint import pprint
|
||||
>>> from mw.lib import persistence
|
||||
>>>
|
||||
>>> state = persistence.State()
|
||||
>>>
|
||||
>>> pprint(state.process("Apples are red.", revision=1))
|
||||
([Token(text='Apples', revisions=[1]),
|
||||
Token(text=' ', revisions=[1]),
|
||||
Token(text='are', revisions=[1]),
|
||||
Token(text=' ', revisions=[1]),
|
||||
Token(text='red', revisions=[1]),
|
||||
Token(text='.', revisions=[1])],
|
||||
[Token(text='Apples', revisions=[1]),
|
||||
Token(text=' ', revisions=[1]),
|
||||
Token(text='are', revisions=[1]),
|
||||
Token(text=' ', revisions=[1]),
|
||||
Token(text='red', revisions=[1]),
|
||||
Token(text='.', revisions=[1])],
|
||||
[])
|
||||
>>> pprint(state.process("Apples are blue.", revision=2))
|
||||
([Token(text='Apples', revisions=[1, 2]),
|
||||
Token(text=' ', revisions=[1, 2]),
|
||||
Token(text='are', revisions=[1, 2]),
|
||||
Token(text=' ', revisions=[1, 2]),
|
||||
Token(text='blue', revisions=[2]),
|
||||
Token(text='.', revisions=[1, 2])],
|
||||
[Token(text='blue', revisions=[2])],
|
||||
[Token(text='red', revisions=[1])])
|
||||
>>> pprint(state.process("Apples are red.", revision=3)) # A revert!
|
||||
([Token(text='Apples', revisions=[1, 2, 3]),
|
||||
Token(text=' ', revisions=[1, 2, 3]),
|
||||
Token(text='are', revisions=[1, 2, 3]),
|
||||
Token(text=' ', revisions=[1, 2, 3]),
|
||||
Token(text='red', revisions=[1, 3]),
|
||||
Token(text='.', revisions=[1, 2, 3])],
|
||||
[],
|
||||
[])
|
||||
"""
|
||||
|
||||
def __init__(self, tokenize=defaults.TOKENIZE, diff=defaults.DIFF,
|
||||
revert_radius=reverts.defaults.RADIUS,
|
||||
revert_detector=None):
|
||||
self.tokenize = tokenize
|
||||
self.diff = diff
|
||||
|
||||
# Either pass a detector or the revert radius so I can make one
|
||||
if revert_detector is None:
|
||||
self.revert_detector = reverts.Detector(int(revert_radius))
|
||||
else:
|
||||
self.revert_detector = revert_detector
|
||||
|
||||
# Stores the last tokens
|
||||
self.last = None
|
||||
|
||||
def process(self, text, revision=None, checksum=None):
|
||||
"""
|
||||
Modifies the internal state based a change to the content and returns
|
||||
the sets of words added and removed.
|
||||
|
||||
:Parameters:
|
||||
text : str
|
||||
The text content of a revision
|
||||
revision : `mixed`
|
||||
Revision meta data
|
||||
checksum : str
|
||||
A checksum hash of the text content (will be generated if not provided)
|
||||
|
||||
:Returns:
|
||||
Three :class:`~mw.lib.persistence.Tokens` lists
|
||||
|
||||
current_tokens : :class:`~mw.lib.persistence.Tokens`
|
||||
A sequence of :class:`~mw.lib.persistence.Token` for the
|
||||
processed revision
|
||||
tokens_added : :class:`~mw.lib.persistence.Tokens`
|
||||
A set of tokens that were inserted by the processed revision
|
||||
tokens_removed : :class:`~mw.lib.persistence.Tokens`
|
||||
A sequence of :class:`~mw.lib.persistence.Token` removed by the
|
||||
processed revision
|
||||
|
||||
"""
|
||||
if checksum is None:
|
||||
checksum = sha1(bytes(text, 'utf8')).hexdigest()
|
||||
|
||||
version = Version()
|
||||
|
||||
revert = self.revert_detector.process(checksum, version)
|
||||
if revert is not None: # Revert
|
||||
|
||||
# Empty words.
|
||||
tokens_added = Tokens()
|
||||
tokens_removed = Tokens()
|
||||
|
||||
# Extract reverted_to revision
|
||||
_, _, reverted_to = revert
|
||||
version.tokens = reverted_to.tokens
|
||||
|
||||
else:
|
||||
|
||||
if self.last is None: # First version of the page!
|
||||
|
||||
version.tokens = Tokens(Token(t) for t in self.tokenize(text))
|
||||
tokens_added = version.tokens
|
||||
tokens_removed = Tokens()
|
||||
|
||||
else:
|
||||
|
||||
# NOTICE: HEAVY COMPUTATION HERE!!!
|
||||
#
|
||||
# OK. It's not that heavy. It's just performing a diff,
|
||||
# but you're still going to spend most of your time here.
|
||||
# Diffs usually run in O(n^2) -- O(n^3) time and most tokenizers
|
||||
# produce a lot of tokens.
|
||||
version.tokens, tokens_added, tokens_removed = \
|
||||
self.last.tokens.compare(self.tokenize(text), self.diff)
|
||||
|
||||
version.tokens.persist(revision)
|
||||
|
||||
self.last = version
|
||||
|
||||
return version.tokens, tokens_added, tokens_removed
|
||||
@@ -0,0 +1,12 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from .. import difference
|
||||
|
||||
|
||||
def test_sequence_matcher():
|
||||
t1 = "foobar derp hepl derpl"
|
||||
t2 = "fooasldal 3 hepl asl a derpl"
|
||||
|
||||
ops = difference.sequence_matcher(t1, t2)
|
||||
|
||||
eq_("".join(difference.apply(ops, t1, t2)), t2)
|
||||
@@ -0,0 +1,25 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from ..state import State
|
||||
|
||||
|
||||
def test_state():
|
||||
contents_revisions = [
|
||||
("Apples are red.", 0),
|
||||
("Apples are blue.", 1),
|
||||
("Apples are red.", 2),
|
||||
("Apples are tasty and red.", 3),
|
||||
("Apples are tasty and blue.", 4)
|
||||
]
|
||||
|
||||
state = State()
|
||||
|
||||
token_sets = [state.process(c, r) for c, r in contents_revisions]
|
||||
|
||||
for i, (content, revision) in enumerate(contents_revisions):
|
||||
eq_("".join(token_sets[i][0].texts()), content)
|
||||
|
||||
eq_(token_sets[0][0][0].text, "Apples")
|
||||
eq_(len(token_sets[0][0][0].revisions), 5)
|
||||
eq_(token_sets[0][0][4].text, "red")
|
||||
eq_(len(token_sets[0][0][4].revisions), 3)
|
||||
@@ -0,0 +1,10 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from .. import tokenization
|
||||
|
||||
|
||||
def test_wikitext_split():
|
||||
eq_(
|
||||
list(tokenization.wikitext_split("foo bar herp {{derp}}")),
|
||||
["foo", " ", "bar", " ", "herp", " ", "{{", "derp", "}}"]
|
||||
)
|
||||
@@ -0,0 +1,16 @@
|
||||
import re
|
||||
|
||||
|
||||
def wikitext_split(text):
|
||||
"""
|
||||
Performs the simplest possible split of latin character-based languages
|
||||
and wikitext.
|
||||
|
||||
:Parameters:
|
||||
text : str
|
||||
Text to split.
|
||||
"""
|
||||
return re.findall(
|
||||
r"[\w]+|\[\[|\]\]|\{\{|\}\}|\n+| +|&\w+;|'''|''|=+|\{\||\|\}|\|\-|.",
|
||||
text
|
||||
)
|
||||
@@ -0,0 +1,98 @@
|
||||
class Token:
|
||||
"""
|
||||
Represents a chunk of text and the revisions of a page that it survived.
|
||||
"""
|
||||
__slots__ = ('text', 'revisions')
|
||||
|
||||
def __init__(self, text, revisions=None):
|
||||
self.text = text
|
||||
"""
|
||||
The text of the token.
|
||||
"""
|
||||
|
||||
self.revisions = revisions if revisions is not None else []
|
||||
"""
|
||||
The meta data for the revisions that the token has appeared within.
|
||||
"""
|
||||
|
||||
def persist(self, revision):
|
||||
self.revisions.append(revision)
|
||||
|
||||
def __repr__(self):
|
||||
return "{0}({1})".format(
|
||||
self.__class__.__name__,
|
||||
", ".join([
|
||||
"text={0}".format(repr(self.text)),
|
||||
"revisions={0}".format(repr(self.revisions))
|
||||
])
|
||||
)
|
||||
|
||||
|
||||
class Tokens(list):
|
||||
"""
|
||||
Represents a :class:`list` of :class:`~mw.lib.persistence.Token` with some
|
||||
useful helper functions.
|
||||
|
||||
:Example:
|
||||
|
||||
>>> from mw.lib.persistence import Token, Tokens
|
||||
>>>
|
||||
>>> tokens = Tokens()
|
||||
>>> tokens.append(Token("foo"))
|
||||
>>> tokens.extend([Token(" "), Token("bar")])
|
||||
>>>
|
||||
>>> tokens[0]
|
||||
Token(text='foo', revisions=[])
|
||||
>>>
|
||||
>>> "".join(tokens.texts())
|
||||
'foo bar'
|
||||
"""
|
||||
|
||||
def __init__(self, *args, **kwargs):
|
||||
super().__init__(*args, **kwargs)
|
||||
|
||||
def persist(self, revision):
|
||||
for token in self:
|
||||
token.persist(revision)
|
||||
|
||||
def texts(self):
|
||||
for token in self:
|
||||
yield token.text
|
||||
|
||||
def compare(self, new, diff):
|
||||
old = self.texts()
|
||||
|
||||
return self.apply_diff(diff(old, new), self, new)
|
||||
|
||||
@classmethod
|
||||
def apply_diff(cls, ops, old, new):
|
||||
|
||||
tokens = cls()
|
||||
tokens_added = cls()
|
||||
tokens_removed = cls()
|
||||
|
||||
for code, a_start, a_end, b_start, b_end in ops:
|
||||
if code == "insert":
|
||||
for token_text in new[b_start:b_end]:
|
||||
token = Token(token_text)
|
||||
tokens.append(token)
|
||||
tokens_added.append(token)
|
||||
|
||||
elif code == "replace":
|
||||
for token_text in new[b_start:b_end]:
|
||||
token = Token(token_text)
|
||||
tokens.append(token)
|
||||
tokens_added.append(token)
|
||||
|
||||
tokens_removed.extend(t for t in old[a_start:a_end])
|
||||
|
||||
elif code == "equal":
|
||||
tokens.extend(old[a_start:a_end])
|
||||
elif code == "delete":
|
||||
tokens_removed.extend(old[a_start:a_end])
|
||||
|
||||
else:
|
||||
assert False, \
|
||||
"encounted an unrecognized operation code: " + repr(code)
|
||||
|
||||
return (tokens, tokens_added, tokens_removed)
|
||||
@@ -0,0 +1,24 @@
|
||||
"""
|
||||
This module provides a set of utilities for detecting identity reverts in
|
||||
revisioned content.
|
||||
|
||||
To detect reverts in a stream of revisions to a single page, you can use
|
||||
:func:`detect`. If you'll be detecting reverts in a collection of pages or
|
||||
would, for some other reason, prefer to process revisions one at a time,
|
||||
:class:`Detector` and it's :meth:`~Detector.process` will allow you to do so.
|
||||
|
||||
To detect reverts one-at-time and arbitrarily, you can user the `check()`
|
||||
functions:
|
||||
|
||||
* :func:`database.check` and :func:`database.check_row` use a :class:`mw.database.DB`
|
||||
* :func:`api.check` and :func:`api.check_rev` use a :class:`mw.api.Session`
|
||||
|
||||
Note that these functions are less performant than detecting reverts in a
|
||||
stream of page revisions. This can be practical when trying to identify
|
||||
reverted revisions in a user's contribution history.
|
||||
"""
|
||||
from .detector import Detector, Revert
|
||||
from .functions import detect, reverts
|
||||
from . import database
|
||||
from . import api
|
||||
from . import defaults
|
||||
134
mediawiki_dump_tools/Mediawiki-Utilities/mw/lib/reverts/api.py
Normal file
134
mediawiki_dump_tools/Mediawiki-Utilities/mw/lib/reverts/api.py
Normal file
@@ -0,0 +1,134 @@
|
||||
from itertools import chain
|
||||
|
||||
from . import defaults
|
||||
from ...types import Timestamp
|
||||
from ...util import none_or
|
||||
from .dummy_checksum import DummyChecksum
|
||||
from .functions import detect
|
||||
|
||||
|
||||
def check_rev(session, rev, **kwargs):
|
||||
"""
|
||||
Checks whether a revision (database row) was reverted (identity) and returns
|
||||
a named tuple of Revert(reverting, reverteds, reverted_to).
|
||||
|
||||
:Parameters:
|
||||
session : :class:`mw.api.Session`
|
||||
An API session to make use of
|
||||
rev : dict
|
||||
a revision dict containing 'revid' and 'page.id'
|
||||
radius : int
|
||||
a positive integer indicating the maximum number of revisions that can be reverted
|
||||
before : :class:`mw.Timestamp`
|
||||
if set, limits the search for *reverting* revisions to those which were saved before this timestamp
|
||||
properties : set( str )
|
||||
a set of properties to include in revisions (see :class:`mw.api.Revisions`)
|
||||
"""
|
||||
|
||||
# extract rev_id, sha1, page_id
|
||||
if 'revid' in rev:
|
||||
rev_id = rev['revid']
|
||||
else:
|
||||
raise TypeError("rev must have 'rev_id'")
|
||||
if 'page' in rev:
|
||||
page_id = rev['page']['id']
|
||||
elif 'pageid' in rev:
|
||||
page_id = rev['pageid']
|
||||
else:
|
||||
raise TypeError("rev must have 'page' or 'pageid'")
|
||||
|
||||
# run the regular check
|
||||
return check(session, rev_id, page_id=page_id, **kwargs)
|
||||
|
||||
|
||||
def check(session, rev_id, page_id=None, radius=defaults.RADIUS,
|
||||
before=None, window=None, properties=None):
|
||||
"""
|
||||
Checks whether a revision was reverted (identity) and returns a named tuple
|
||||
of Revert(reverting, reverteds, reverted_to).
|
||||
|
||||
:Parameters:
|
||||
session : :class:`mw.api.Session`
|
||||
An API session to make use of
|
||||
rev_id : int
|
||||
the ID of the revision to check
|
||||
page_id : int
|
||||
the ID of the page the revision occupies (slower if not provided)
|
||||
radius : int
|
||||
a positive integer indicating the maximum number of revisions
|
||||
that can be reverted
|
||||
before : :class:`mw.Timestamp`
|
||||
if set, limits the search for *reverting* revisions to those which
|
||||
were saved before this timestamp
|
||||
window : int
|
||||
if set, limits the search for *reverting* revisions to those which
|
||||
were saved within `window` seconds after the reverted edit
|
||||
properties : set( str )
|
||||
a set of properties to include in revisions (see :class:`mw.api.Revisions`)
|
||||
"""
|
||||
|
||||
if not hasattr(session, "revisions"):
|
||||
raise TypeError("session wrong type. Expected a mw.api.Session.")
|
||||
|
||||
rev_id = int(rev_id)
|
||||
radius = int(radius)
|
||||
if radius < 1:
|
||||
raise TypeError("invalid radius. Expected a positive integer.")
|
||||
|
||||
page_id = none_or(page_id, int)
|
||||
before = none_or(before, Timestamp)
|
||||
properties = set(properties) if properties is not None else set()
|
||||
|
||||
# If we don't have the page_id, we're going to need to look them up
|
||||
if page_id is None:
|
||||
rev = session.revisions.get(rev_id, properties={'ids'})
|
||||
page_id = rev['page']['pageid']
|
||||
|
||||
# Load history and current rev
|
||||
current_and_past_revs = list(session.revisions.query(
|
||||
pageids={page_id},
|
||||
limit=radius + 1,
|
||||
start_id=rev_id,
|
||||
direction="older",
|
||||
properties={'ids', 'timestamp', 'sha1'} | properties
|
||||
))
|
||||
|
||||
try:
|
||||
# Extract current rev and reorder history
|
||||
current_rev, past_revs = (
|
||||
current_and_past_revs[0], # Current rev is the first one returned
|
||||
reversed(current_and_past_revs[1:]) # The rest are past revs, but they are in the wrong order
|
||||
)
|
||||
except IndexError:
|
||||
# Only way to get here is if there isn't enough history. Couldn't be
|
||||
# reverted. Just return None.
|
||||
return None
|
||||
|
||||
if window is not None and before is None:
|
||||
before = Timestamp(current_rev['timestamp']) + window
|
||||
|
||||
# Load future revisions
|
||||
future_revs = session.revisions.query(
|
||||
pageids={page_id},
|
||||
limit=radius,
|
||||
start_id=rev_id + 1, # Ensures that we skip the current revision
|
||||
end=before,
|
||||
direction="newer",
|
||||
properties={'ids', 'timestamp', 'sha1'} | properties
|
||||
)
|
||||
|
||||
# Convert to an iterable of (checksum, rev) pairs for detect() to consume
|
||||
checksum_revisions = chain(
|
||||
((rev['sha1'] if 'sha1' in rev else DummyChecksum(), rev)
|
||||
for rev in past_revs),
|
||||
[(current_rev.get('sha1', DummyChecksum()), current_rev)],
|
||||
((rev['sha1'] if 'sha1' in rev else DummyChecksum(), rev)
|
||||
for rev in future_revs),
|
||||
)
|
||||
|
||||
for revert in detect(checksum_revisions, radius=radius):
|
||||
# Check that this is a relevant revert
|
||||
if rev_id in [rev['revid'] for rev in revert.reverteds]:
|
||||
return revert
|
||||
|
||||
return None
|
||||
@@ -0,0 +1,148 @@
|
||||
import random
|
||||
from itertools import chain
|
||||
|
||||
from . import defaults
|
||||
from ...types import Timestamp
|
||||
from ...util import none_or
|
||||
from .dummy_checksum import DummyChecksum
|
||||
from .functions import detect
|
||||
|
||||
HEX = "1234567890abcdef"
|
||||
|
||||
def random_sha1():
|
||||
return ''.join(random.choice(HEX) for i in range(40))
|
||||
|
||||
"""
|
||||
Simple constant used in order to not do weird things with a dummy revision.
|
||||
"""
|
||||
|
||||
|
||||
def check_row(db, rev_row, **kwargs):
|
||||
"""
|
||||
Checks whether a revision (database row) was reverted (identity) and returns
|
||||
a named tuple of Revert(reverting, reverteds, reverted_to).
|
||||
|
||||
:Parameters:
|
||||
db : :class:`mw.database.DB`
|
||||
A database connection to make use of.
|
||||
rev_row : dict
|
||||
a revision row containing 'rev_id' and 'rev_page' or 'page_id'
|
||||
radius : int
|
||||
a positive integer indicating the the maximum number of revisions that can be reverted
|
||||
check_archive : bool
|
||||
should the archive table be checked for reverting revisions?
|
||||
before : `Timestamp`
|
||||
if set, limits the search for *reverting* revisions to those which were saved before this timestamp
|
||||
"""
|
||||
|
||||
# extract rev_id, sha1, page_id
|
||||
if 'rev_id' in rev_row:
|
||||
rev_id = rev_row['rev_id']
|
||||
else:
|
||||
raise TypeError("rev_row must have 'rev_id'")
|
||||
if 'page_id' in rev_row:
|
||||
page_id = rev_row['page_id']
|
||||
elif 'rev_page' in rev_row:
|
||||
page_id = rev_row['rev_page']
|
||||
else:
|
||||
raise TypeError("rev_row must have 'page_id' or 'rev_page'")
|
||||
|
||||
# run the regular check
|
||||
return check(db, rev_id, page_id=page_id, **kwargs)
|
||||
|
||||
|
||||
def check(db, rev_id, page_id=None, radius=defaults.RADIUS, check_archive=False,
|
||||
before=None, window=None):
|
||||
|
||||
"""
|
||||
Checks whether a revision was reverted (identity) and returns a named tuple
|
||||
of Revert(reverting, reverteds, reverted_to).
|
||||
|
||||
:Parameters:
|
||||
db : `mw.database.DB`
|
||||
A database connection to make use of.
|
||||
rev_id : int
|
||||
the ID of the revision to check
|
||||
page_id : int
|
||||
the ID of the page the revision occupies (slower if not provided)
|
||||
radius : int
|
||||
a positive integer indicating the maximum number of revisions that can be reverted
|
||||
check_archive : bool
|
||||
should the archive table be checked for reverting revisions?
|
||||
before : `Timestamp`
|
||||
if set, limits the search for *reverting* revisions to those which were saved before this timestamp
|
||||
window : int
|
||||
if set, limits the search for *reverting* revisions to those which
|
||||
were saved within `window` seconds after the reverted edit
|
||||
"""
|
||||
|
||||
if not hasattr(db, "revisions") and hasattr(db, "all_revisions"):
|
||||
raise TypeError("db wrong type. Expected a mw.database.DB.")
|
||||
|
||||
rev_id = int(rev_id)
|
||||
radius = int(radius)
|
||||
if radius < 1:
|
||||
raise TypeError("invalid radius. Expected a positive integer.")
|
||||
page_id = none_or(page_id, int)
|
||||
check_archive = bool(check_archive)
|
||||
before = none_or(before, Timestamp)
|
||||
|
||||
# If we are searching the archive, we'll need to use `all_revisions`.
|
||||
if check_archive:
|
||||
dbrevs = db.all_revisions
|
||||
else:
|
||||
dbrevs = db.revisions
|
||||
|
||||
# If we don't have the sha1 or page_id, we're going to need to look them up
|
||||
if page_id is None:
|
||||
row = dbrevs.get(rev_id=rev_id)
|
||||
page_id = row['rev_page']
|
||||
|
||||
# Load history and current rev
|
||||
current_and_past_revs = list(dbrevs.query(
|
||||
page_id=page_id,
|
||||
limit=radius + 1,
|
||||
before_id=rev_id + 1, # Ensures that we capture the current revision
|
||||
direction="older"
|
||||
))
|
||||
|
||||
try:
|
||||
# Extract current rev and reorder history
|
||||
current_rev, past_revs = (
|
||||
current_and_past_revs[0], # Current rev is the first one returned
|
||||
reversed(current_and_past_revs[1:]) # The rest are past revs, but they are in the wrong order
|
||||
)
|
||||
except IndexError:
|
||||
# Only way to get here is if there isn't enough history. Couldn't be
|
||||
# reverted. Just return None.
|
||||
return None
|
||||
|
||||
if window is not None and before is None:
|
||||
before = Timestamp(current_rev['rev_timestamp']) + window
|
||||
|
||||
# Load future revisions
|
||||
future_revs = dbrevs.query(
|
||||
page_id=page_id,
|
||||
limit=radius,
|
||||
after_id=rev_id,
|
||||
before=before,
|
||||
direction="newer"
|
||||
)
|
||||
|
||||
# Convert to an iterable of (checksum, rev) pairs for detect() to consume
|
||||
checksum_revisions = chain(
|
||||
((rev['rev_sha1'] if rev['rev_sha1'] is not None \
|
||||
else DummyChecksum(), rev)
|
||||
for rev in past_revs),
|
||||
[(current_rev['rev_sha1'] or DummyChecksum(), current_rev)],
|
||||
((rev['rev_sha1'] if rev['rev_sha1'] is not None \
|
||||
else DummyChecksum(), rev)
|
||||
for rev in future_revs)
|
||||
)
|
||||
|
||||
for revert in detect(checksum_revisions, radius=radius):
|
||||
# Check that this is a relevant revert
|
||||
if rev_id in [rev['rev_id'] for rev in revert.reverteds]:
|
||||
return revert
|
||||
|
||||
return None
|
||||
@@ -0,0 +1,24 @@
|
||||
RADIUS = 15
|
||||
"""
|
||||
TODO: Better documentation here. For the time being, see:
|
||||
|
||||
Priedhorsky, R., Chen, J., Lam, S. T. K., Panciera, K., Terveen, L., &
|
||||
Riedl, J. (2007, November). Creating, destroying, and restoring value in
|
||||
Wikipedia. In Proceedings of the 2007 international ACM conference on
|
||||
Supporting group work (pp. 259-268). ACM.
|
||||
"""
|
||||
|
||||
|
||||
class DUMMY_SHA1: pass
|
||||
"""
|
||||
Used in when checking for reverts when the checksum of the revision of interest
|
||||
is unknown.
|
||||
|
||||
>>> DUMMY_SHA1 in {"aaa", "bbb"} # or any 40 character hex
|
||||
False
|
||||
>>>
|
||||
>>> DUMMY_SHA1 == DUMMY_SHA1
|
||||
True
|
||||
>>> {DUMMY_SHA1, DUMMY_SHA1}
|
||||
{<class '__main__.DUMMY_SHA1'>}
|
||||
"""
|
||||
@@ -0,0 +1,83 @@
|
||||
from collections import namedtuple
|
||||
|
||||
from ...util import ordered
|
||||
from . import defaults
|
||||
|
||||
Revert = namedtuple("Revert", ['reverting', 'reverteds', 'reverted_to'])
|
||||
"""
|
||||
Represents a revert event. This class behaves like
|
||||
:class:`collections.namedtuple`. Note that the datatypes of `reverting`,
|
||||
`reverteds` and `reverted_to` is not specified since those types will depend
|
||||
on the revision data provided during revert detection.
|
||||
|
||||
:Members:
|
||||
**reverting**
|
||||
The reverting revision data : `mixed`
|
||||
**reverteds**
|
||||
The reverted revision data (ordered chronologically) : list( `mixed` )
|
||||
**reverted_to**
|
||||
The reverted-to revision data : `mixed`
|
||||
"""
|
||||
|
||||
|
||||
class Detector(ordered.HistoricalMap):
|
||||
"""
|
||||
Detects revert events in a stream of revisions (to the same page) based on
|
||||
matching checksums. To detect reverts, construct an instance of this class and call
|
||||
:meth:`process` in chronological order (``direction == "newer"``).
|
||||
|
||||
See `<https://meta.wikimedia.org/wiki/R:Identity_revert>`_
|
||||
|
||||
:Parameters:
|
||||
radius : int
|
||||
a positive integer indicating the maximum revision distance that a revert can span.
|
||||
|
||||
:Example:
|
||||
>>> from mw.lib import reverts
|
||||
>>> detector = reverts.Detector()
|
||||
>>>
|
||||
>>> detector.process("aaa", {'rev_id': 1})
|
||||
>>> detector.process("bbb", {'rev_id': 2})
|
||||
>>> detector.process("aaa", {'rev_id': 3})
|
||||
Revert(reverting={'rev_id': 3}, reverteds=[{'rev_id': 2}], reverted_to={'rev_id': 1})
|
||||
>>> detector.process("ccc", {'rev_id': 4})
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self, radius=defaults.RADIUS):
|
||||
"""
|
||||
:Parameters:
|
||||
radius : int
|
||||
a positive integer indicating the maximum revision distance that a revert can span.
|
||||
"""
|
||||
if radius < 1:
|
||||
raise TypeError("invalid radius. Expected a positive integer.")
|
||||
super().__init__(maxlen=radius + 1)
|
||||
|
||||
def process(self, checksum, revision=None):
|
||||
"""
|
||||
Process a new revision and detect a revert if it occurred. Note that
|
||||
you can pass whatever you like as `revision` and it will be returned in
|
||||
the case that a revert occurs.
|
||||
|
||||
:Parameters:
|
||||
checksum : str
|
||||
Any identity-machable string-based hash of revision content
|
||||
revision : `mixed`
|
||||
Revision meta data. Note that any data will just be returned in the
|
||||
case of a revert.
|
||||
|
||||
:Returns:
|
||||
a :class:`~mw.lib.reverts.Revert` if one occured or `None`
|
||||
"""
|
||||
revert = None
|
||||
|
||||
if checksum in self: # potential revert
|
||||
|
||||
reverteds = list(self.up_to(checksum))
|
||||
|
||||
if len(reverteds) > 0: # If no reverted revisions, this is a noop
|
||||
revert = Revert(revision, reverteds, self[checksum])
|
||||
|
||||
self.insert(checksum, revision)
|
||||
return revert
|
||||
@@ -0,0 +1,24 @@
|
||||
class DummyChecksum():
|
||||
"""
|
||||
Used in when checking for reverts when the checksum of the revision of interest
|
||||
is unknown. DummyChecksums won't match eachother or anything else, but they
|
||||
will match themselves and they are hashable.
|
||||
|
||||
>>> dummy1 = DummyChecksum()
|
||||
>>> dummy1
|
||||
<#140687347334280>
|
||||
>>> dummy1 == dummy1
|
||||
True
|
||||
>>>
|
||||
>>> dummy2 = DummyChecksum()
|
||||
>>> dummy2
|
||||
<#140687347334504>
|
||||
>>> dummy1 == dummy2
|
||||
False
|
||||
>>>
|
||||
>>> {"foo", "bar", dummy1, dummy1, dummy2}
|
||||
{<#140687347334280>, 'foo', <#140687347334504>, 'bar'}
|
||||
"""
|
||||
|
||||
def __str__(self): repr(self)
|
||||
def __repr__(self): return "<#" + str(id(self)) + ">"
|
||||
@@ -0,0 +1,46 @@
|
||||
from .detector import Detector
|
||||
from . import defaults
|
||||
|
||||
|
||||
def detect(checksum_revisions, radius=defaults.RADIUS):
|
||||
"""
|
||||
Detects reverts that occur in a sequence of revisions. Note that,
|
||||
`revision` data meta will simply be returned in the case of a revert.
|
||||
|
||||
This function serves as a convenience wrapper around calls to
|
||||
:class:`Detector`'s :meth:`~Detector.process`
|
||||
method.
|
||||
|
||||
:Parameters:
|
||||
checksum_revisions : iter( ( checksum : str, revision : `mixed` ) )
|
||||
an iterable over tuples of checksum and revision meta data
|
||||
radius : int
|
||||
a positive integer indicating the maximum revision distance that a revert can span.
|
||||
|
||||
:Return:
|
||||
a iterator over :class:`Revert`
|
||||
|
||||
:Example:
|
||||
>>> from mw.lib import reverts
|
||||
>>>
|
||||
>>> checksum_revisions = [
|
||||
... ("aaa", {'rev_id': 1}),
|
||||
... ("bbb", {'rev_id': 2}),
|
||||
... ("aaa", {'rev_id': 3}),
|
||||
... ("ccc", {'rev_id': 4})
|
||||
... ]
|
||||
>>>
|
||||
>>> list(reverts.detect(checksum_revisions))
|
||||
[Revert(reverting={'rev_id': 3}, reverteds=[{'rev_id': 2}], reverted_to={'rev_id': 1})]
|
||||
|
||||
"""
|
||||
|
||||
revert_detector = Detector(radius)
|
||||
|
||||
for checksum, revision in checksum_revisions:
|
||||
revert = revert_detector.process(checksum, revision)
|
||||
if revert is not None:
|
||||
yield revert
|
||||
|
||||
# For backwards compatibility
|
||||
reverts = detect
|
||||
@@ -0,0 +1,33 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from ..detector import Detector
|
||||
|
||||
|
||||
def test_detector():
|
||||
detector = Detector(2)
|
||||
|
||||
eq_(detector.process("a", {'id': 1}), None)
|
||||
|
||||
# Check noop
|
||||
eq_(detector.process("a", {'id': 2}), None)
|
||||
|
||||
# Short revert
|
||||
eq_(detector.process("b", {'id': 3}), None)
|
||||
eq_(
|
||||
detector.process("a", {'id': 4}),
|
||||
({'id': 4}, [{'id': 3}], {'id': 2})
|
||||
)
|
||||
|
||||
# Medium revert
|
||||
eq_(detector.process("c", {'id': 5}), None)
|
||||
eq_(detector.process("d", {'id': 6}), None)
|
||||
eq_(
|
||||
detector.process("a", {'id': 7}),
|
||||
({'id': 7}, [{'id': 6}, {'id': 5}], {'id': 4})
|
||||
)
|
||||
|
||||
# Long (undetected) revert
|
||||
eq_(detector.process("e", {'id': 8}), None)
|
||||
eq_(detector.process("f", {'id': 9}), None)
|
||||
eq_(detector.process("g", {'id': 10}), None)
|
||||
eq_(detector.process("a", {'id': 11}), None)
|
||||
@@ -0,0 +1,23 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from ..functions import reverts
|
||||
|
||||
|
||||
def test_reverts():
|
||||
checksum_revisions = [
|
||||
("a", {'id': 1}),
|
||||
("b", {'id': 2}),
|
||||
("c", {'id': 3}),
|
||||
("a", {'id': 4}),
|
||||
("d", {'id': 5}),
|
||||
("b", {'id': 6}),
|
||||
("a", {'id': 7})
|
||||
]
|
||||
|
||||
expected = [
|
||||
({'id': 4}, [{'id': 3}, {'id': 2}], {'id': 1}),
|
||||
({'id': 7}, [{'id': 6}, {'id': 5}], {'id': 4})
|
||||
]
|
||||
|
||||
for revert in reverts(checksum_revisions, radius=2):
|
||||
eq_(revert, expected.pop(0))
|
||||
@@ -0,0 +1,4 @@
|
||||
from .functions import cluster, sessions
|
||||
from .event import Event
|
||||
from .cache import Cache, Session
|
||||
from . import defaults
|
||||
@@ -0,0 +1,121 @@
|
||||
import logging
|
||||
from collections import namedtuple
|
||||
|
||||
from ...util import Heap
|
||||
from ...types import Timestamp
|
||||
from . import defaults
|
||||
from .event import Event, unpack_events
|
||||
|
||||
|
||||
logger = logging.getLogger("mw.lib.sessions.cache")
|
||||
|
||||
Session = namedtuple("Session", ["user", "events"])
|
||||
"""
|
||||
Represents a user session (a cluster over events for a user). This class
|
||||
behaves like :class:`collections.namedtuple`. Note that the datatypes of
|
||||
`events`, is not specified since those types will depend on the revision data
|
||||
provided during revert detection.
|
||||
|
||||
:Members:
|
||||
**user**
|
||||
A hashable user identifier : `hashable`
|
||||
**events**
|
||||
A list of event data : list( `mixed` )
|
||||
"""
|
||||
|
||||
|
||||
class Cache:
|
||||
"""
|
||||
A cache of recent user session. Since sessions expire once activities stop
|
||||
for at least `cutoff` seconds, this class manages a cache of *active*
|
||||
sessions.
|
||||
|
||||
:Parameters:
|
||||
cutoff : int
|
||||
Maximum amount of time in seconds between session events
|
||||
|
||||
:Example:
|
||||
>>> from mw.lib import sessions
|
||||
>>>
|
||||
>>> cache = sessions.Cache(cutoff=3600)
|
||||
>>>
|
||||
>>> list(cache.process("Willy on wheels", 100000, {'rev_id': 1}))
|
||||
[]
|
||||
>>> list(cache.process("Walter", 100001, {'rev_id': 2}))
|
||||
[]
|
||||
>>> list(cache.process("Willy on wheels", 100001, {'rev_id': 3}))
|
||||
[]
|
||||
>>> list(cache.process("Walter", 100035, {'rev_id': 4}))
|
||||
[]
|
||||
>>> list(cache.process("Willy on wheels", 103602, {'rev_id': 5}))
|
||||
[Session(user='Willy on wheels', events=[{'rev_id': 1}, {'rev_id': 3}])]
|
||||
>>> list(cache.get_active_sessions())
|
||||
[Session(user='Walter', events=[{'rev_id': 2}, {'rev_id': 4}]), Session(user='Willy on wheels', events=[{'rev_id': 5}])]
|
||||
|
||||
|
||||
"""
|
||||
|
||||
def __init__(self, cutoff=defaults.CUTOFF):
|
||||
self.cutoff = int(cutoff)
|
||||
|
||||
self.recently_active = Heap()
|
||||
self.active_users = {}
|
||||
|
||||
def process(self, user, timestamp, data=None):
|
||||
"""
|
||||
Processes a user event.
|
||||
|
||||
:Parameters:
|
||||
user : `hashable`
|
||||
A hashable value to identify a user (`int` or `str` are OK)
|
||||
timestamp : :class:`mw.Timestamp`
|
||||
The timestamp of the event
|
||||
data : `mixed`
|
||||
Event meta data
|
||||
|
||||
:Returns:
|
||||
A generator of :class:`~mw.lib.sessions.Session` expired after
|
||||
processing the user event.
|
||||
"""
|
||||
event = Event(user, Timestamp(timestamp), data)
|
||||
|
||||
for user, events in self._clear_expired(event.timestamp):
|
||||
yield Session(user, unpack_events(events))
|
||||
|
||||
# Apply revision
|
||||
if event.user in self.active_users:
|
||||
events = self.active_users[event.user]
|
||||
else:
|
||||
events = []
|
||||
self.active_users[event.user] = events
|
||||
self.recently_active.push((event.timestamp, events))
|
||||
|
||||
events.append(event)
|
||||
|
||||
def get_active_sessions(self):
|
||||
"""
|
||||
Retrieves the active, unexpired sessions.
|
||||
|
||||
:Returns:
|
||||
A generator of :class:`~mw.lib.sessions.Session`
|
||||
|
||||
"""
|
||||
for last_timestamp, events in self.recently_active:
|
||||
yield Session(events[-1].user, unpack_events(events))
|
||||
|
||||
def _clear_expired(self, timestamp):
|
||||
|
||||
# Cull old sessions
|
||||
while (len(self.recently_active) > 0 and
|
||||
timestamp - self.recently_active.peek()[0] >= self.cutoff):
|
||||
|
||||
_, events = self.recently_active.pop()
|
||||
|
||||
if timestamp - events[-1].timestamp >= self.cutoff:
|
||||
del self.active_users[events[-1].user]
|
||||
yield events[-1].user, events
|
||||
else:
|
||||
self.recently_active.push((events[-1].timestamp, events))
|
||||
|
||||
def __repr__(self):
|
||||
return "%s(%s)".format(self.__class__.__name__, repr(self.cutoff))
|
||||
@@ -0,0 +1,6 @@
|
||||
CUTOFF = 60 * 60
|
||||
"""
|
||||
TODO: Better documentation here.
|
||||
For the time being, see
|
||||
`<https://meta.wikimedia.org/wiki/Research:Edit_session>`_
|
||||
"""
|
||||
@@ -0,0 +1,19 @@
|
||||
import logging
|
||||
from collections import namedtuple
|
||||
|
||||
logger = logging.getLogger("mw.lib.sessions.event")
|
||||
|
||||
|
||||
# class Event:
|
||||
# __slots__ = ('user', 'timestamp', 'data')
|
||||
#
|
||||
# def __init__(self, user, timestamp, data=None):
|
||||
# self.user = user
|
||||
# self.timestamp = Timestamp(timestamp)
|
||||
# self.data = data
|
||||
|
||||
Event = namedtuple("Event", ['user', 'timestamp', 'data'])
|
||||
|
||||
|
||||
def unpack_events(events):
|
||||
return list(e.data for e in events)
|
||||
@@ -0,0 +1,68 @@
|
||||
import logging
|
||||
|
||||
from .cache import Cache
|
||||
from . import defaults
|
||||
|
||||
logger = logging.getLogger("mw.lib.sessions.functions")
|
||||
|
||||
|
||||
def cluster(user_events, cutoff=defaults.CUTOFF):
|
||||
"""
|
||||
Clusters user sessions from a sequence of user events. Note that,
|
||||
`event` data will simply be returned in the case of a revert.
|
||||
|
||||
This function serves as a convenience wrapper around calls to
|
||||
:class:`~mw.lib.sessions.Cache`'s :meth:`~mw.lib.sessions.Cache.process`
|
||||
method.
|
||||
|
||||
:Parameters:
|
||||
user_events : iter( (user, timestamp, event) )
|
||||
an iterable over tuples of user, timestamp and event data.
|
||||
|
||||
* user : `hashable`
|
||||
* timestamp : :class:`mw.Timestamp`
|
||||
* event : `mixed`
|
||||
|
||||
cutoff : int
|
||||
the maximum time between events within a user session
|
||||
|
||||
:Returns:
|
||||
a iterator over :class:`~mw.lib.sessions.Session`
|
||||
|
||||
:Example:
|
||||
>>> from mw.lib import sessions
|
||||
>>>
|
||||
>>> user_events = [
|
||||
... ("Willy on wheels", 100000, {'rev_id': 1}),
|
||||
... ("Walter", 100001, {'rev_id': 2}),
|
||||
... ("Willy on wheels", 100001, {'rev_id': 3}),
|
||||
... ("Walter", 100035, {'rev_id': 4}),
|
||||
... ("Willy on wheels", 103602, {'rev_id': 5})
|
||||
... ]
|
||||
>>>
|
||||
>>> for user, events in sessions.cluster(user_events):
|
||||
... (user, events)
|
||||
...
|
||||
('Willy on wheels', [{'rev_id': 1}, {'rev_id': 3}])
|
||||
('Walter', [{'rev_id': 2}, {'rev_id': 4}])
|
||||
('Willy on wheels', [{'rev_id': 5}])
|
||||
|
||||
|
||||
"""
|
||||
|
||||
# Construct the session manager
|
||||
cache = Cache(cutoff)
|
||||
|
||||
# Apply the events
|
||||
for user, timestamp, event in user_events:
|
||||
|
||||
for session in cache.process(user, timestamp, event):
|
||||
yield session
|
||||
|
||||
# Yield the left-overs
|
||||
for session in cache.get_active_sessions():
|
||||
yield session
|
||||
|
||||
|
||||
# For backwards compatibility
|
||||
sessions = cluster
|
||||
@@ -0,0 +1,22 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from ..cache import Cache
|
||||
|
||||
|
||||
def test_session_manager():
|
||||
cache = Cache(cutoff=2)
|
||||
|
||||
user_sessions = list(cache.process("foo", 1))
|
||||
eq_(user_sessions, [])
|
||||
|
||||
user_sessions = list(cache.process("bar", 2))
|
||||
eq_(user_sessions, [])
|
||||
|
||||
user_sessions = list(cache.process("foo", 2))
|
||||
eq_(user_sessions, [])
|
||||
|
||||
user_sessions = list(cache.process("bar", 10))
|
||||
eq_(len(user_sessions), 2)
|
||||
|
||||
user_sessions = list(cache.get_active_sessions())
|
||||
eq_(len(user_sessions), 1)
|
||||
@@ -0,0 +1,50 @@
|
||||
from itertools import chain
|
||||
|
||||
from nose.tools import eq_
|
||||
from .. import defaults
|
||||
from ..functions import sessions
|
||||
|
||||
|
||||
EVENTS = {
|
||||
"foo": [
|
||||
[
|
||||
("foo", 1234567890, 1),
|
||||
("foo", 1234567892, 2),
|
||||
("foo", 1234567894, 3)
|
||||
],
|
||||
[
|
||||
("foo", 1234567894 + defaults.CUTOFF, 4),
|
||||
("foo", 1234567897 + defaults.CUTOFF, 5)
|
||||
]
|
||||
],
|
||||
"bar": [
|
||||
[
|
||||
("bar", 1234567891, 6),
|
||||
("bar", 1234567892, 7),
|
||||
("bar", 1234567893, 8)
|
||||
],
|
||||
[
|
||||
("bar", 1234567895 + defaults.CUTOFF, 9),
|
||||
("bar", 1234567898 + defaults.CUTOFF, 0)
|
||||
]
|
||||
]
|
||||
}
|
||||
|
||||
|
||||
def test_group_events():
|
||||
events = []
|
||||
events.extend(chain(*EVENTS['foo']))
|
||||
events.extend(chain(*EVENTS['bar']))
|
||||
|
||||
events.sort()
|
||||
|
||||
user_sessions = sessions(events)
|
||||
|
||||
counts = {
|
||||
'foo': 0,
|
||||
'bar': 0
|
||||
}
|
||||
|
||||
for user, session in user_sessions:
|
||||
eq_(list(e[2] for e in EVENTS[user][counts[user]]), list(session))
|
||||
counts[user] += 1
|
||||
@@ -0,0 +1,2 @@
|
||||
from .functions import normalize
|
||||
from .parser import Parser
|
||||
@@ -0,0 +1,25 @@
|
||||
def normalize(title):
|
||||
"""
|
||||
Normalizes a page title to the database format. E.g. spaces are converted
|
||||
to underscores and the first character in the title is converted to
|
||||
upper-case.
|
||||
|
||||
:Parameters:
|
||||
title : str
|
||||
A page title
|
||||
:Returns:
|
||||
The normalized title.
|
||||
:Example:
|
||||
>>> from mw.lib import title
|
||||
>>>
|
||||
>>> title.normalize("foo bar")
|
||||
'Foo_bar'
|
||||
|
||||
"""
|
||||
if title is None:
|
||||
return title
|
||||
else:
|
||||
if len(title) > 0:
|
||||
return (title[0].upper() + title[1:]).replace(" ", "_")
|
||||
else:
|
||||
return ""
|
||||
171
mediawiki_dump_tools/Mediawiki-Utilities/mw/lib/title/parser.py
Normal file
171
mediawiki_dump_tools/Mediawiki-Utilities/mw/lib/title/parser.py
Normal file
@@ -0,0 +1,171 @@
|
||||
from ...types import Namespace
|
||||
from ...util import autovivifying, none_or
|
||||
from .functions import normalize
|
||||
|
||||
|
||||
class Parser:
|
||||
"""
|
||||
Constructs a page name parser from a set of :class:`mw.Namespace`. Such a
|
||||
parser can be used to convert a full page name (namespace included with a
|
||||
colon; e.g, ``"Talk:Foo"``) into a namespace ID and
|
||||
:func:`mw.lib.title.normalize`'d page title (e.g., ``(1, "Foo")``).
|
||||
|
||||
:Parameters:
|
||||
namespaces : set( :class:`mw.Namespace` )
|
||||
:Example:
|
||||
>>> from mw import Namespace
|
||||
>>> from mw.lib import title
|
||||
>>>
|
||||
>>> parser = title.Parser(
|
||||
... [
|
||||
... Namespace(0, "", case="first-letter"),
|
||||
... Namespace(1, "Discuss\u00e3o", canonical="Talk", case="first-letter"),
|
||||
... Namespace(2, "Usu\u00e1rio(a)", canonical="User", aliases={"U"}, case="first-letter")
|
||||
... ]
|
||||
... )
|
||||
>>>
|
||||
>>> parser.parse("Discuss\u00e3o:Foo") # Using the standard name
|
||||
(1, 'Foo')
|
||||
>>> parser.parse("Talk:Foo bar") # Using the cannonical name
|
||||
(1, 'Foo_bar')
|
||||
>>> parser.parse("U:Foo bar") # Using an alias
|
||||
(2, 'Foo_bar')
|
||||
>>> parser.parse("Herpderp:Foo bar") # Psuedo namespace
|
||||
(0, 'Herpderp:Foo_bar')
|
||||
"""
|
||||
|
||||
def __init__(self, namespaces=None):
|
||||
namespaces = none_or(namespaces, set)
|
||||
|
||||
self.ids = {}
|
||||
self.names = {}
|
||||
|
||||
if namespaces is not None:
|
||||
for namespace in namespaces:
|
||||
self.add_namespace(namespace)
|
||||
|
||||
def parse(self, page_name):
|
||||
"""
|
||||
Parses a page name to extract the namespace.
|
||||
|
||||
:Parameters:
|
||||
page_name : str
|
||||
A page name including the namespace prefix and a colon (if not Main)
|
||||
|
||||
:Returns:
|
||||
A tuple of (namespace : `int`, title : `str`)
|
||||
"""
|
||||
parts = page_name.split(":", 1)
|
||||
if len(parts) == 1:
|
||||
ns_id = 0
|
||||
title = normalize(page_name)
|
||||
else:
|
||||
ns_name, title = parts
|
||||
ns_name, title = normalize(ns_name), normalize(title)
|
||||
|
||||
if self.contains_name(ns_name):
|
||||
ns_id = self.get_namespace(name=ns_name).id
|
||||
else:
|
||||
ns_id = 0
|
||||
title = normalize(page_name)
|
||||
|
||||
return ns_id, title
|
||||
|
||||
def add_namespace(self, namespace):
|
||||
"""
|
||||
Adds a namespace to the parser.
|
||||
|
||||
:Parameters:
|
||||
namespace : :class:`mw.Namespace`
|
||||
A namespace
|
||||
"""
|
||||
self.ids[namespace.id] = namespace
|
||||
self.names[namespace.name] = namespace
|
||||
|
||||
for alias in namespace.aliases:
|
||||
self.names[alias] = namespace
|
||||
|
||||
if namespace.canonical is not None:
|
||||
self.names[namespace.canonical] = namespace
|
||||
|
||||
def contains_name(self, name):
|
||||
return normalize(name) in self.names
|
||||
|
||||
def get_namespace(self, id=None, name=None):
|
||||
"""
|
||||
Gets a namespace from the parser. Throws a :class:`KeyError` if a
|
||||
namespace cannot be found.
|
||||
|
||||
:Parameters:
|
||||
id : int
|
||||
A namespace ID
|
||||
name : str
|
||||
A namespace name (standard, cannonical names and aliases
|
||||
will be searched)
|
||||
:Returns:
|
||||
A :class:`mw.Namespace`.
|
||||
"""
|
||||
if id is not None:
|
||||
return self.ids[int(id)]
|
||||
else:
|
||||
return self.names[normalize(name)]
|
||||
|
||||
@classmethod
|
||||
def from_site_info(cls, si_doc):
|
||||
"""
|
||||
Constructs a parser from the result of a :meth:`mw.api.SiteInfo.query`.
|
||||
|
||||
:Parameters:
|
||||
si_doc : dict
|
||||
The result of a site_info request.
|
||||
|
||||
:Returns:
|
||||
An initialized :class:`mw.lib.title.Parser`
|
||||
"""
|
||||
aliases = autovivifying.Dict(vivifier=lambda k: [])
|
||||
# get aliases
|
||||
if 'namespacealiases' in si_doc:
|
||||
for alias_doc in si_doc['namespacealiases']:
|
||||
aliases[alias_doc['id']].append(alias_doc['*'])
|
||||
|
||||
namespaces = []
|
||||
for ns_doc in si_doc['namespaces'].values():
|
||||
namespaces.append(
|
||||
Namespace.from_doc(ns_doc, aliases)
|
||||
)
|
||||
|
||||
return Parser(namespaces)
|
||||
|
||||
@classmethod
|
||||
def from_api(cls, session):
|
||||
"""
|
||||
Constructs a parser from a :class:`mw.api.Session`
|
||||
|
||||
:Parameters:
|
||||
session : :class:`mw.api.Session`
|
||||
An open API session
|
||||
|
||||
:Returns:
|
||||
An initialized :class:`mw.lib.title.Parser`
|
||||
"""
|
||||
si_doc = session.site_info.query(
|
||||
properties={'namespaces', 'namespacealiases'}
|
||||
)
|
||||
|
||||
return cls.from_site_info(si_doc)
|
||||
|
||||
@classmethod
|
||||
def from_dump(cls, dump):
|
||||
"""
|
||||
Constructs a parser from a :class:`mw.xml_dump.Iterator`. Note that
|
||||
XML database dumps do not include namespace aliases or cannonical names
|
||||
so the parser that will be constructed will only work in common cases.
|
||||
|
||||
:Parameters:
|
||||
dump : :class:`mw.xml_dump.Iterator`
|
||||
An XML dump iterator
|
||||
|
||||
:Returns:
|
||||
An initialized :class:`mw.lib.title.Parser`
|
||||
"""
|
||||
return cls(dump.namespaces)
|
||||
@@ -0,0 +1,10 @@
|
||||
from nose.tools import eq_
|
||||
|
||||
from ..functions import normalize
|
||||
|
||||
|
||||
def test_normalize():
|
||||
eq_("Foobar", normalize("Foobar")) # Same
|
||||
eq_("Foobar", normalize("foobar")) # Capitalize
|
||||
eq_("FooBar", normalize("fooBar")) # Late capital
|
||||
eq_("Foo_bar", normalize("Foo bar")) # Space
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user