daf1851cbb
Use dask to parallelize and scale user level datasets
2018-08-14 14:37:03 -07:00
418fa020e5
Merge branch 'user_level_wikiq' of code.communitydata.cc:mediawiki_dump_tools into user_level_wikiq
2018-08-12 21:34:12 -07:00
311810a36c
refactor wikiq to seperate script from classes and functions. Code reuse in testing.
2018-08-12 21:33:19 -07:00
118b8b1722
move tests to test folder
2018-08-12 18:10:46 -07:00
f69e8b44a6
move tests to test folder
2018-08-12 18:05:59 -07:00
bf396ad366
Prefix page titles with namespace names.
2018-07-09 22:11:17 -07:00
dba793c6ac
migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.
2018-07-05 01:16:00 -07:00
d77b0a4965
migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.
2018-07-04 19:06:07 -07:00
7db6288923
migrate reverts to python-mwreverts
2018-07-04 15:29:48 -07:00
a883cb536b
add note to readme about dependency on compression software
2018-07-04 15:20:52 -07:00
e925ac9da1
add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.
2018-07-04 15:08:30 -07:00
d2746879d0
create baseline tests for xml dump processing
2018-07-03 23:43:47 -07:00
Benjamin Mako Hill
ba886ecf4c
a number of small updates and fixes
...
- fix regex for filename/filetype matches
- unload all files not just ones with end with xml in 7z archives
- fix bug that broke stdout
- minor cosmetic fixes
- updated mediawiki-utilities submodule to latest version
2018-05-17 14:37:20 -07:00
3f9da40747
support 7z archives with multiple files. add urlencode paraeter
2017-12-07 15:10:56 -08:00
Benjamin Mako Hill
5d7dceb9e4
fix code to work with bzip files
2017-02-06 18:25:17 -08:00
Benjamin Mako Hill
7d8ec932dd
added list of compressed dump files to .gitignore
2015-07-23 12:16:31 -07:00
Benjamin Mako Hill
d934700ee9
added support to parse namespaces from title
...
This is necessary for wikis (e.g., Wikia XML dumps) that do not include
namespace metadata as tags within each <page>.
2015-07-23 12:12:20 -07:00
Benjamin Mako Hill
108c8442b2
added README file to document the submodule
2015-07-22 19:55:08 -07:00
Benjamin Mako Hill
eeb0742cc6
created new repository for wikiq with Mediawiki-Utilities as a submodule
2015-07-22 19:44:52 -07:00