Copyright (C) 2018 Nathan TeBlunthuis. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the file entitled "fdl-1.3.md".

Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects"

Overview

This archive contains code and data for reproducing the analysis for "Replication Data for Revisiting 'The Rise and Decline' in a Population of Peer Production Projects". Depending on what you hope to do with the data you probabbly do not want to download all of the files. Depending on your computation resources you may not be able to run all stages of the analysis.

The code for all stages of the analysis, including typesetting the manuscript and running the analysis, is in code.tar.

If you only want to run the final analysis or to play with datasets used in the analysis of the paper, you want intermediate_data.7z or the uncompressed tab and csv files.

The data files are created in a three stage process. The first stage uses the program "wikiq" to create tsv files that have edit data for each wiki. The second stage generates all.edits.RDS file which contains edit metadata from mediawiki xml dumps. This file is expensive to generate and at 1.5GB is pretty big. The third stage builds smaller intermediate files that contain the analytical variables from these tsv files. The fourth stage uses the intermediate files to generate smaller RDS files that contain the results. Finally, knitr and latex typeset the manuscript.

A stage will only run if the outputs from the previous stages do not exist. So if the intermediate files exist they will not be regenerated. Only the final analysis will run. The exception is that stage 4, fitting models and generating plots, always runs.

If you only want to replicate from the second stage onward, you want wikiq_tsvs.7z. If you want to replicate everything, you want wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.

These instructions work backwards from building the manuscript using knitr, loading the datasets, running the analysis, to building the intermediate datasets.

Building the manuscript using knitr

This requires working latex, latexmk, and knitr installations. Depending on your operating system you might install these packages in different ways. On Debian Linux you can run apt install r-cran-knitr latexmk texlive-latex-extra. Alternatively, you can upload the necessary files to a project on Sharelatex.com or Overleaf.com.

Download code.tar. This has everything you need to typeset the manuscript.
Unpack the tar archive. On a unix system this can be done by running tar xf code.tar.
Navigate to code/paper_source.
Install R dependencies. In R. run install.packages(c("data.table","scales","ggplot2","lubridate","texreg"))
On a unix system you should be able to run make to build the manuscript generalizable_wiki.pdf. Otherwise you should try uploading all of the files (including the tables, figure, and knitr folders) to a new project on ShareLatex.com.

Loading intermediate datasets

The intermediate datasets are found in the intermediate_data.7z archive. They can be extracted on a unix system using the command 7z x intermediate_data.7z. The files are 95MB uncompressed. These are RDS (R data set) files and can be loaded in R using the readRDS. For example newcomer.ds <- readRDS("newcomers.RDS"). If you wish to work with these datasets using a tool other than R, you might prefer to work with the .tab files.

Running the analysis

Fitting the models may not work on machines with less than 32GB of RAM. If you have trouble, you may find the functions in lib-01-sample-datasets.R useful to create stratified samples of data for fitting models. See line 89 of 02_model_newcomer_survival.R for an example.

Download code.tar and intermediate_data.7z to your working folder and extract both archives. On a unix system this can be done with the command tar xf code.tar && 7z x intermediate_data.7z.
Install R dependencies. install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")).
On a unix system you can simply run regen.all.sh to fit the models, build the plots and create the RDS files.

Generating datasets

Building the intermediate files

The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.

Download all.edits.RDS, userroles_data.7z,selected.wikis.csv, and code.tar. Unpack code.tar and userroles_data.7z. On a unix system this can be done using tar xf code.tar && 7z x userroles_data.7z.
Install R dependencies. In R run install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2")).
Run 01_build_datasets.R.

Building all.edits.RDS

The intermediate RDS files used in the analysis are created from all.edits.RDS. To replicate building all.edits.RDS, you only need to run 01_build_datasets.R when the intermediate RDS files and all.edits.RDS files do not exist in the working directory. all.edits.RDS is generated from the tsv files generated by wikiq. This may take several hours. By default building the dataset will use all available CPU cores. If you want to change this, modify line 26 of lib-01-build_newcomer_table.R.

Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and code.tar. Unpack the files. On a unix system this can be done by running 7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf code.tar.
Run 01_build_datasets.R to generate all.edits.RDS and the intermediate files.

Running Wikiq to generate tsv files

If you want to regenerate the datasets all the way from the xml dumps and data from the Wikia api you will have to run the python script wikiq. This is a fairly computationally intensive process. It may over a day unless you can run the computations in parallel.

Download code.tar, wikia_mediawiki_xml_dumps.7z.001, wikia_mediawiki_xml_dumps.7z.002, and userroles_data.7z. Extract the archives. On a Unix system this can be done by running tar xf code.tar && 7z x wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z.
Have python3 and python3-pip installed. Using pip3 install argparse. pip3 install argparse.
Edit runwikiq.sh to set N_THREADS.
Run runwikiq.sh to generate the tsv files.

Obtaining Bot and Admin data from the Wikia API

For the purposes of supporting an audit of our research project, this repository includes the code that we used to obtain Bot and Admin data from the Wikia API. Unfortunantly, since we ran the script, the API has changed and this code does not work.

Our research group maintains a tool for scraping the Wikia API available at https://code.communitydata.cc/wikia_userroles_scraper. This can be used to download user roles for the wikis in this dataset. Follow the instructions found in that package.

Languages

TeX 45.4%

Python 38.6%

R 14.5%

Makefile 1.2%

Emacs Lisp 0.2%