Copyright (C) 2018 Nathan TeBlunthuis. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the file entitled "fdl-1.3.md".
Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects"
Overview
This archive contains code and data for reproducing the analysis for "Replication Data for Revisiting 'The Rise and Decline' in a Population of Peer Production Projects". Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis.
The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar.
If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files.
The data files are created in a three stage process. The first stage uses the program "wikiq" to create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which contains edit metadata from mediawiki xml dumps. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript.
A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs.
If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.
These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets.
Building the manuscript using knitr
This requires working latex, latexmk, and knitr
installations. Depending on your operating system you might install
these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you
can upload the necessary files to a project on Sharelatex.com or
Overleaf.com.
- Download
code.tar. This has everything you need to typeset the manuscript. - Unpack the tar archive. On a unix system this can be done by running
tar xf code.tar. - Navigate to code/paper_source.
- Install R dependencies. In R. run
install.packages(c("data.table","scales","ggplot2","lubridate","texreg")) - On a unix system you should be able to run
maketo build the manuscriptgeneralizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on ShareLatex.com.
Loading intermediate datasets
The intermediate datasets are found in the intermediate_data.7z
archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are
RDS (R data set) files and can be loaded in R using the readRDS. For
example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to
work with these datasets using a tool other than R, you might prefer
to work with the .tab files.
Running the analysis
Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example.
- Download
code.tarandintermediate_data.7zto your working folder and extract both archives. On a unix system this can be done with the commandtar xf code.tar && 7z x intermediate_data.7z. - Install R
dependencies.
install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). - On a unix system you can simply run
regen.all.shto fit the models, build the plots and create the RDS files.
Generating datasets
Building the intermediate files
The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.
- Download
all.edits.RDS,userroles_data.7z,selected.wikis.csv, andcode.tar. Unpackcode.taranduserroles_data.7z. On a unix system this can be done usingtar xf code.tar && 7z x userroles_data.7z. - Install R dependencies. In R run
install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")). - Run
01_build_datasets.R.
Building all.edits.RDS
The intermediate RDS files used in the analysis are created from
all.edits.RDS. To replicate building all.edits.RDS, you only need to
run 01_build_datasets.R when the intermediate RDS files and
all.edits.RDS files do not exist in the working
directory. all.edits.RDS is generated from the tsv files generated
by wikiq. This may take several hours. By default building the dataset
will use all available CPU cores. If you want to change this, modify
line 26 of lib-01-build_newcomer_table.R.
- Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and
code.tar. Unpack the files. On a unix system this can be done by
running
7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf code.tar. - Run
01_build_datasets.Rto generate all.edits.RDS and the intermediate files.
Running Wikiq to generate tsv files
If you want to regenerate the datasets all the way from the xml dumps
and data from the Wikia api you will have to run the python script
wikiq. This is a fairly computationally intensive process. It may
over a day unless you can run the computations in parallel.
- Download
code.tar,wikia_mediawiki_xml_dumps.7z.001,wikia_mediawiki_xml_dumps.7z.002, anduserroles_data.7z. Extract the archives. On a Unix system this can be done by runningtar xf code.tar && 7z x wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z. - Have python3 and python3-pip installed. Using pip3 install
argparse.pip3 install argparse. - Edit
runwikiq.shto set N_THREADS. - Run
runwikiq.shto generate the tsv files.
Obtaining Bot and Admin data from the Wikia API
For the purposes of supporting an audit of our research project, this repository includes the code that we used to obtain Bot and Admin data from the Wikia API. Unfortunantly, since we ran the script, the API has changed and this code does not work.
Our research group maintains a tool for scraping the Wikia API available at https://code.communitydata.cc/wikia_userroles_scraper. This can be used to download user roles for the wikis in this dataset. Follow the instructions found in that package.