Initial commit

p# new file: runwikiq.sh
2018-06-02 15:32:19 -07:00
commit 72633c193b
202 changed files with 21929 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,147 @@
+Copyright (C)  2018  Nathan TeBlunthuis.
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3
+or any later version published by the Free Software Foundation;
+with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+A copy of the license is included in the file entitled "fdl-1.3.md".
+
+# Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects" #
+
+## Overview ##
+
+This archive contains code and data for reproducing the analysis for
+"Replication Data for Revisiting 'The Rise and Decline' in a
+Population of Peer Production Projects". Depending on what you hope to
+do with the data you probabbly do not want to download all of the
+files. Depending on your computation resources you may not be able to
+run all stages of the analysis.
+
+The code for all stages of the analysis, including typesetting the
+manuscript and running the analysis, is in code.tar.
+
+If you only want to run the final analysis or to play with datasets
+used in the analysis of the paper, you want intermediate_data.7z or
+the uncompressed tab and csv files.
+
+The data files are created in a three stage process. The first stage
+uses the program "wikiq" to create tsv files that have edit data for
+each wiki. The second stage generates all.edits.RDS file which
+contains edit metadata from mediawiki xml dumps. This file is
+expensive to generate and at 1.5GB is pretty big.  The third stage
+builds smaller intermediate files that contain the analytical
+variables from these tsv files. The fourth stage uses the intermediate
+files to generate smaller RDS files that contain the results. Finally,
+knitr and latex typeset the manuscript. 
+
+A stage will only run if the outputs from the previous stages do not
+exist. So if the intermediate files exist they will not be
+regenerated. Only the final analysis will run. The exception is that
+stage 4, fitting models and generating plots, always runs.
+
+If you only want to replicate from the second stage onward, you want
+wikiq_tsvs.7z. If you want to replicate everything, you want
+wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.
+
+These instructions work backwards from building the manuscript using
+knitr, loading the datasets, running the analysis, to building the
+intermediate datasets.
+
+## Building the manuscript using knitr ##
+This requires working latex, latexmk, and knitr
+installations. Depending on your operating system you might install
+these packages in different ways. On Debian Linux you can run `apt
+install r-cran-knitr latexmk texlive-latex-extra`. Alternatively, you
+can upload the necessary files to a project on Sharelatex.com or
+Overleaf.com.
+
+1. Download `code.tar`. This has everything you need to typeset the manuscript. 
+2. Unpack the tar archive. On a unix system this can be done by running `tar xf code.tar`.
+3. Navigate to code/paper_source.
+4. Install R dependencies. In R. run `install.packages(c("data.table","scales","ggplot2","lubridate","texreg"))`
+5. On a unix system you should be able to run `make` to build the
+   manuscript `generalizable_wiki.pdf`. Otherwise you should try
+   uploading all of the files (including the tables, figure, and knitr
+   folders) to a new project on ShareLatex.com.
+
+## Loading intermediate datasets ##
+The intermediate datasets are found in the `intermediate_data.7z`
+archive. They can be extracted on a unix system using the command `7z
+x intermediate_data.7z`. The files are 95MB uncompressed. These are
+RDS (R data set) files and can be loaded in R using the `readRDS`. For
+example `newcomer.ds <- readRDS("newcomers.RDS")`.  If you wish to
+work with these datasets using a tool other than R, you might prefer
+to work with the .tab files.
+
+## Running the analysis ##
+
+Fitting the models may not work on machines with less than 32GB of
+RAM. If you have trouble, you may find the functions in
+lib-01-sample-datasets.R useful to create stratified samples of data
+for fitting models. See line 89 of 02_model_newcomer_survival.R for an
+example.
+
+1. Download `code.tar` and `intermediate_data.7z` to your working
+   folder and extract both archives. On a unix system this can be done
+   with the command `tar xf code.tar && 7z x intermediate_data.7z`.
+2. Install R
+   dependencies. `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
+3. On a unix system you can simply run `regen.all.sh` to fit the
+   models, build the plots and create the RDS files.
+
+## Generating datasets ##
+
+### Building the intermediate files ###
+The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.
+
+1. Download `all.edits.RDS`, `userroles_data.7z`,`selected.wikis.csv`,
+   and `code.tar`. Unpack `code.tar` and `userroles_data.7z`. On a
+   unix system this can be done using `tar xf code.tar && 7z x
+   userroles_data.7z`.
+2. Install R dependencies. In R run
+   `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
+3. Run `01_build_datasets.R`.
+
+### Building all.edits.RDS ###
+
+The intermediate RDS files used in the analysis are created from
+`all.edits.RDS`. To replicate building all.edits.RDS, you only need to
+run 01_build_datasets.R when the intermediate RDS files and
+`all.edits.RDS` files do not exist in the working
+directory. `all.edits.RDS` is generated from the tsv files generated
+by wikiq. This may take several hours. By default building the dataset
+will use all available CPU cores. If you want to change this, modify
+line 26 of `lib-01-build_newcomer_table.R`.
+
+1. Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and
+   code.tar. Unpack the files. On a unix system this can be done by
+   running `7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf
+   code.tar`.
+2. Run `01_build_datasets.R` to generate all.edits.RDS and the intermediate files. 
+
+
+### Running Wikiq to generate tsv files ### 
+If you want to regenerate the datasets all the way from the xml dumps
+and data from the Wikia api you will have to run the python script
+`wikiq`. This is a fairly computationally intensive process. It may
+over a day unless you can run the computations in parallel.
+
+1. Download `code.tar`, `wikia_mediawiki_xml_dumps.7z.001`,
+   `wikia_mediawiki_xml_dumps.7z.002`, and
+   `userroles_data.7z`. Extract the archives. On a Unix system this
+   can be done by running `tar xf code.tar && 7z x
+   wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z`.
+2. Have python3 and python3-pip installed. Using pip3 install `argparse`. `pip3 install argparse`.
+3. Edit `runwikiq.sh` to set N_THREADS. 
+4. Run `runwikiq.sh` to generate the tsv files.
+
+### Obtaining Bot and Admin data from the Wikia API ###
+For the purposes of supporting an audit of our research project, this
+repository includes the code that we used to obtain Bot and Admin data
+from the Wikia API. Unfortunantly, since we ran the script, the API
+has changed and this code does not work. 
+
+Our research group maintains a tool for scraping the Wikia API
+available at https://code.communitydata.cc/wikia_userroles_scraper. This can
+be used to download user roles for the wikis in this dataset. Follow
+the instructions found in that package.
+