rises_declines_wikia_code/README.md

Copyright (C)  2018  Nathan TeBlunthuis.
Permission is granted to copy, distribute and/or modify this document
under the terms of the GNU Free Documentation License, Version 1.3
or any later version published by the Free Software Foundation;
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
A copy of the license is included in the file entitled "fdl-1.3.md".

# Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects" #

## Overview ##

This archive contains code and data for reproducing the analysis for
"Replication Data for Revisiting 'The Rise and Decline' in a
Population of Peer Production Projects". Depending on what you hope to
do with the data you probabbly do not want to download all of the
files. Depending on your computation resources you may not be able to
run all stages of the analysis.

The code for all stages of the analysis, including typesetting the
manuscript and running the analysis, is in code.tar.

If you only want to run the final analysis or to play with datasets
used in the analysis of the paper, you want intermediate_data.7z or
the uncompressed tab and csv files.

The data files are created in a three stage process. The first stage
uses the program "wikiq" to create tsv files that have edit data for
each wiki. The second stage generates all.edits.RDS file which
contains edit metadata from mediawiki xml dumps. This file is
expensive to generate and at 1.5GB is pretty big.  The third stage
builds smaller intermediate files that contain the analytical
variables from these tsv files. The fourth stage uses the intermediate
files to generate smaller RDS files that contain the results. Finally,
knitr and latex typeset the manuscript.

A stage will only run if the outputs from the previous stages do not
exist. So if the intermediate files exist they will not be
regenerated. Only the final analysis will run. The exception is that
stage 4, fitting models and generating plots, always runs.

If you only want to replicate from the second stage onward, you want
wikiq_tsvs.7z. If you want to replicate everything, you want
wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.

These instructions work backwards from building the manuscript using
knitr, loading the datasets, running the analysis, to building the
intermediate datasets.

## Building the manuscript using knitr ##
This requires working latex, latexmk, and knitr
installations. Depending on your operating system you might install
these packages in different ways. On Debian Linux you can run `apt
install r-cran-knitr latexmk texlive-latex-extra`. Alternatively, you
can upload the necessary files to a project on Sharelatex.com or
Overleaf.com.

1. Download `code.tar`. This has everything you need to typeset the manuscript.
2. Unpack the tar archive. On a unix system this can be done by running `tar xf code.tar`.
3. Navigate to code/paper_source.
4. Install R dependencies. In R. run `install.packages(c("data.table","scales","ggplot2","lubridate","texreg"))`
5. On a unix system you should be able to run `make` to build the
   manuscript `generalizable_wiki.pdf`. Otherwise you should try
   uploading all of the files (including the tables, figure, and knitr
   folders) to a new project on ShareLatex.com.

## Loading intermediate datasets ##
The intermediate datasets are found in the `intermediate_data.7z`
archive. They can be extracted on a unix system using the command `7z
x intermediate_data.7z`. The files are 95MB uncompressed. These are
RDS (R data set) files and can be loaded in R using the `readRDS`. For
example `newcomer.ds <- readRDS("newcomers.RDS")`.  If you wish to
work with these datasets using a tool other than R, you might prefer
to work with the .tab files.

## Running the analysis ##

Fitting the models may not work on machines with less than 32GB of
RAM. If you have trouble, you may find the functions in
lib-01-sample-datasets.R useful to create stratified samples of data
for fitting models. See line 89 of 02_model_newcomer_survival.R for an
example.

1. Download `code.tar` and `intermediate_data.7z` to your working
   folder and extract both archives. On a unix system this can be done
   with the command `tar xf code.tar && 7z x intermediate_data.7z`.
2. Install R
   dependencies. `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
3. On a unix system you can simply run `regen.all.sh` to fit the
   models, build the plots and create the RDS files.

## Generating datasets ##

### Building the intermediate files ###
The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.

1. Download `all.edits.RDS`, `userroles_data.7z`,`selected.wikis.csv`,
   and `code.tar`. Unpack `code.tar` and `userroles_data.7z`. On a
   unix system this can be done using `tar xf code.tar && 7z x
   userroles_data.7z`.
2. Install R dependencies. In R run
   `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
3. Run `01_build_datasets.R`.

### Building all.edits.RDS ###

The intermediate RDS files used in the analysis are created from
`all.edits.RDS`. To replicate building all.edits.RDS, you only need to
run 01_build_datasets.R when the intermediate RDS files and
`all.edits.RDS` files do not exist in the working
directory. `all.edits.RDS` is generated from the tsv files generated
by wikiq. This may take several hours. By default building the dataset
will use all available CPU cores. If you want to change this, modify
line 26 of `lib-01-build_newcomer_table.R`.

1. Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and
   code.tar. Unpack the files. On a unix system this can be done by
   running `7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf
   code.tar`.
2. Run `01_build_datasets.R` to generate all.edits.RDS and the intermediate files.


### Running Wikiq to generate tsv files ###
If you want to regenerate the datasets all the way from the xml dumps
and data from the Wikia api you will have to run the python script
`wikiq`. This is a fairly computationally intensive process. It may
over a day unless you can run the computations in parallel.

1. Download `code.tar`, `wikia_mediawiki_xml_dumps.7z.001`,
   `wikia_mediawiki_xml_dumps.7z.002`, and
   `userroles_data.7z`. Extract the archives. On a Unix system this
   can be done by running `tar xf code.tar && 7z x
   wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z`.
2. Have python3 and python3-pip installed. Using pip3 install `argparse`. `pip3 install argparse`.
3. Edit `runwikiq.sh` to set N_THREADS.
4. Run `runwikiq.sh` to generate the tsv files.

### Obtaining Bot and Admin data from the Wikia API ###
For the purposes of supporting an audit of our research project, this
repository includes the code that we used to obtain Bot and Admin data
from the Wikia API. Unfortunantly, since we ran the script, the API
has changed and this code does not work.

Our research group maintains a tool for scraping the Wikia API
available at https://code.communitydata.cc/wikia_userroles_scraper. This can
be used to download user roles for the wikis in this dataset. Follow
the instructions found in that package.