Commit Graph

24 Commits

Author SHA1 Message Date
Nate E TeBlunthuis
f784c77f60 add namespace filter parameter 2018-09-03 11:13:48 -07:00
317bafb50d Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence 2018-08-23 19:00:49 -07:00
7cd0bf3b9e Add parameter for selecting specific namespaces. 2018-08-23 18:49:32 -07:00
d93769c21f Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence 2018-08-23 18:27:09 -07:00
Nate E TeBlunthuis
afd40c1a45 Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence 2018-08-23 18:25:51 -07:00
Nate E TeBlunthuis
e4222c45dd add namespace filter parameter 2018-08-23 18:25:08 -07:00
Nate E TeBlunthuis
829ffcffae Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence 2018-08-23 18:23:36 -07:00
Nate E TeBlunthuis
776b73519a add namespace filter parameter 2018-08-23 18:23:23 -07:00
Nate E TeBlunthuis
5b6aaad862 add namespace filter parameter 2018-08-23 18:02:56 -07:00
f468d1a5b6 add support for persistence with segment matching 2018-08-20 16:08:16 -07:00
bf396ad366 Prefix page titles with namespace names. 2018-07-09 22:11:17 -07:00
dba793c6ac migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project. 2018-07-05 01:16:00 -07:00
d77b0a4965 migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy. 2018-07-04 19:06:07 -07:00
7db6288923 migrate reverts to python-mwreverts 2018-07-04 15:29:48 -07:00
a883cb536b add note to readme about dependency on compression software 2018-07-04 15:20:52 -07:00
e925ac9da1 add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq. 2018-07-04 15:08:30 -07:00
d2746879d0 create baseline tests for xml dump processing 2018-07-03 23:43:47 -07:00
Benjamin Mako Hill
ba886ecf4c a number of small updates and fixes
- fix regex for filename/filetype matches
- unload all files not just ones with end with xml in 7z archives
- fix bug that broke stdout
- minor cosmetic fixes
- updated mediawiki-utilities submodule to latest version
2018-05-17 14:37:20 -07:00
3f9da40747 support 7z archives with multiple files. add urlencode paraeter 2017-12-07 15:10:56 -08:00
Benjamin Mako Hill
5d7dceb9e4 fix code to work with bzip files 2017-02-06 18:25:17 -08:00
Benjamin Mako Hill
7d8ec932dd added list of compressed dump files to .gitignore 2015-07-23 12:16:31 -07:00
Benjamin Mako Hill
d934700ee9 added support to parse namespaces from title
This is necessary for wikis (e.g., Wikia XML dumps) that do not include
namespace metadata as tags within each <page>.
2015-07-23 12:12:20 -07:00
Benjamin Mako Hill
108c8442b2 added README file to document the submodule 2015-07-22 19:55:08 -07:00
Benjamin Mako Hill
eeb0742cc6 created new repository for wikiq with Mediawiki-Utilities as a submodule 2015-07-22 19:44:52 -07:00