148 lines
7.2 KiB
Markdown
148 lines
7.2 KiB
Markdown
Copyright (C) 2018 Nathan TeBlunthuis.
|
|
Permission is granted to copy, distribute and/or modify this document
|
|
under the terms of the GNU Free Documentation License, Version 1.3
|
|
or any later version published by the Free Software Foundation;
|
|
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
|
|
A copy of the license is included in the file entitled "fdl-1.3.md".
|
|
|
|
# Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects" #
|
|
|
|
## Overview ##
|
|
|
|
This archive contains code and data for reproducing the analysis for
|
|
"Replication Data for Revisiting 'The Rise and Decline' in a
|
|
Population of Peer Production Projects". Depending on what you hope to
|
|
do with the data you probabbly do not want to download all of the
|
|
files. Depending on your computation resources you may not be able to
|
|
run all stages of the analysis.
|
|
|
|
The code for all stages of the analysis, including typesetting the
|
|
manuscript and running the analysis, is in code.tar.
|
|
|
|
If you only want to run the final analysis or to play with datasets
|
|
used in the analysis of the paper, you want intermediate_data.7z or
|
|
the uncompressed tab and csv files.
|
|
|
|
The data files are created in a three stage process. The first stage
|
|
uses the program "wikiq" to create tsv files that have edit data for
|
|
each wiki. The second stage generates all.edits.RDS file which
|
|
contains edit metadata from mediawiki xml dumps. This file is
|
|
expensive to generate and at 1.5GB is pretty big. The third stage
|
|
builds smaller intermediate files that contain the analytical
|
|
variables from these tsv files. The fourth stage uses the intermediate
|
|
files to generate smaller RDS files that contain the results. Finally,
|
|
knitr and latex typeset the manuscript.
|
|
|
|
A stage will only run if the outputs from the previous stages do not
|
|
exist. So if the intermediate files exist they will not be
|
|
regenerated. Only the final analysis will run. The exception is that
|
|
stage 4, fitting models and generating plots, always runs.
|
|
|
|
If you only want to replicate from the second stage onward, you want
|
|
wikiq_tsvs.7z. If you want to replicate everything, you want
|
|
wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.
|
|
|
|
These instructions work backwards from building the manuscript using
|
|
knitr, loading the datasets, running the analysis, to building the
|
|
intermediate datasets.
|
|
|
|
## Building the manuscript using knitr ##
|
|
This requires working latex, latexmk, and knitr
|
|
installations. Depending on your operating system you might install
|
|
these packages in different ways. On Debian Linux you can run `apt
|
|
install r-cran-knitr latexmk texlive-latex-extra`. Alternatively, you
|
|
can upload the necessary files to a project on Sharelatex.com or
|
|
Overleaf.com.
|
|
|
|
1. Download `code.tar`. This has everything you need to typeset the manuscript.
|
|
2. Unpack the tar archive. On a unix system this can be done by running `tar xf code.tar`.
|
|
3. Navigate to code/paper_source.
|
|
4. Install R dependencies. In R. run `install.packages(c("data.table","scales","ggplot2","lubridate","texreg"))`
|
|
5. On a unix system you should be able to run `make` to build the
|
|
manuscript `generalizable_wiki.pdf`. Otherwise you should try
|
|
uploading all of the files (including the tables, figure, and knitr
|
|
folders) to a new project on ShareLatex.com.
|
|
|
|
## Loading intermediate datasets ##
|
|
The intermediate datasets are found in the `intermediate_data.7z`
|
|
archive. They can be extracted on a unix system using the command `7z
|
|
x intermediate_data.7z`. The files are 95MB uncompressed. These are
|
|
RDS (R data set) files and can be loaded in R using the `readRDS`. For
|
|
example `newcomer.ds <- readRDS("newcomers.RDS")`. If you wish to
|
|
work with these datasets using a tool other than R, you might prefer
|
|
to work with the .tab files.
|
|
|
|
## Running the analysis ##
|
|
|
|
Fitting the models may not work on machines with less than 32GB of
|
|
RAM. If you have trouble, you may find the functions in
|
|
lib-01-sample-datasets.R useful to create stratified samples of data
|
|
for fitting models. See line 89 of 02_model_newcomer_survival.R for an
|
|
example.
|
|
|
|
1. Download `code.tar` and `intermediate_data.7z` to your working
|
|
folder and extract both archives. On a unix system this can be done
|
|
with the command `tar xf code.tar && 7z x intermediate_data.7z`.
|
|
2. Install R
|
|
dependencies. `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
|
|
3. On a unix system you can simply run `regen.all.sh` to fit the
|
|
models, build the plots and create the RDS files.
|
|
|
|
## Generating datasets ##
|
|
|
|
### Building the intermediate files ###
|
|
The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.
|
|
|
|
1. Download `all.edits.RDS`, `userroles_data.7z`,`selected.wikis.csv`,
|
|
and `code.tar`. Unpack `code.tar` and `userroles_data.7z`. On a
|
|
unix system this can be done using `tar xf code.tar && 7z x
|
|
userroles_data.7z`.
|
|
2. Install R dependencies. In R run
|
|
`install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
|
|
3. Run `01_build_datasets.R`.
|
|
|
|
### Building all.edits.RDS ###
|
|
|
|
The intermediate RDS files used in the analysis are created from
|
|
`all.edits.RDS`. To replicate building all.edits.RDS, you only need to
|
|
run 01_build_datasets.R when the intermediate RDS files and
|
|
`all.edits.RDS` files do not exist in the working
|
|
directory. `all.edits.RDS` is generated from the tsv files generated
|
|
by wikiq. This may take several hours. By default building the dataset
|
|
will use all available CPU cores. If you want to change this, modify
|
|
line 26 of `lib-01-build_newcomer_table.R`.
|
|
|
|
1. Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and
|
|
code.tar. Unpack the files. On a unix system this can be done by
|
|
running `7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf
|
|
code.tar`.
|
|
2. Run `01_build_datasets.R` to generate all.edits.RDS and the intermediate files.
|
|
|
|
|
|
### Running Wikiq to generate tsv files ###
|
|
If you want to regenerate the datasets all the way from the xml dumps
|
|
and data from the Wikia api you will have to run the python script
|
|
`wikiq`. This is a fairly computationally intensive process. It may
|
|
over a day unless you can run the computations in parallel.
|
|
|
|
1. Download `code.tar`, `wikia_mediawiki_xml_dumps.7z.001`,
|
|
`wikia_mediawiki_xml_dumps.7z.002`, and
|
|
`userroles_data.7z`. Extract the archives. On a Unix system this
|
|
can be done by running `tar xf code.tar && 7z x
|
|
wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z`.
|
|
2. Have python3 and python3-pip installed. Using pip3 install `argparse`. `pip3 install argparse`.
|
|
3. Edit `runwikiq.sh` to set N_THREADS.
|
|
4. Run `runwikiq.sh` to generate the tsv files.
|
|
|
|
### Obtaining Bot and Admin data from the Wikia API ###
|
|
For the purposes of supporting an audit of our research project, this
|
|
repository includes the code that we used to obtain Bot and Admin data
|
|
from the Wikia API. Unfortunantly, since we ran the script, the API
|
|
has changed and this code does not work.
|
|
|
|
Our research group maintains a tool for scraping the Wikia API
|
|
available at https://code.communitydata.cc/wikia_userroles_scraper. This can
|
|
be used to download user roles for the wikis in this dataset. Follow
|
|
the instructions found in that package.
|
|
|