39b4e5698f
Use dask to parallelize and scale user level datasets
2018-08-14 14:44:21 -07:00
418fa020e5
Merge branch 'user_level_wikiq' of code.communitydata.cc:mediawiki_dump_tools into user_level_wikiq
2018-08-12 21:34:12 -07:00
311810a36c
refactor wikiq to seperate script from classes and functions. Code reuse in testing.
2018-08-12 21:33:19 -07:00
118b8b1722
move tests to test folder
2018-08-12 18:10:46 -07:00
f69e8b44a6
move tests to test folder
2018-08-12 18:05:59 -07:00
bf396ad366
Prefix page titles with namespace names.
2018-07-09 22:11:17 -07:00
dba793c6ac
migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.
2018-07-05 01:16:00 -07:00
d77b0a4965
migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.
2018-07-04 19:06:07 -07:00
7db6288923
migrate reverts to python-mwreverts
2018-07-04 15:29:48 -07:00
a883cb536b
add note to readme about dependency on compression software
2018-07-04 15:20:52 -07:00
e925ac9da1
add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.
2018-07-04 15:08:30 -07:00
d2746879d0
create baseline tests for xml dump processing
2018-07-03 23:43:47 -07:00
ba886ecf4c
a number of small updates and fixes
...
- fix regex for filename/filetype matches
- unload all files not just ones with end with xml in 7z archives
- fix bug that broke stdout
- minor cosmetic fixes
- updated mediawiki-utilities submodule to latest version
2018-05-17 14:37:20 -07:00
3f9da40747
support 7z archives with multiple files. add urlencode paraeter
2017-12-07 15:10:56 -08:00
5d7dceb9e4
fix code to work with bzip files
2017-02-06 18:25:17 -08:00
7d8ec932dd
added list of compressed dump files to .gitignore
2015-07-23 12:16:31 -07:00
d934700ee9
added support to parse namespaces from title
...
This is necessary for wikis (e.g., Wikia XML dumps) that do not include
namespace metadata as tags within each <page>.
2015-07-23 12:12:20 -07:00
108c8442b2
added README file to document the submodule
2015-07-22 19:55:08 -07:00
eeb0742cc6
created new repository for wikiq with Mediawiki-Utilities as a submodule
2015-07-22 19:44:52 -07:00