Initial commit
p# new file: runwikiq.sh
This commit is contained in:
147
README.md
Normal file
147
README.md
Normal file
@@ -0,0 +1,147 @@
|
||||
Copyright (C) 2018 Nathan TeBlunthuis.
|
||||
Permission is granted to copy, distribute and/or modify this document
|
||||
under the terms of the GNU Free Documentation License, Version 1.3
|
||||
or any later version published by the Free Software Foundation;
|
||||
with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
|
||||
A copy of the license is included in the file entitled "fdl-1.3.md".
|
||||
|
||||
# Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects" #
|
||||
|
||||
## Overview ##
|
||||
|
||||
This archive contains code and data for reproducing the analysis for
|
||||
"Replication Data for Revisiting 'The Rise and Decline' in a
|
||||
Population of Peer Production Projects". Depending on what you hope to
|
||||
do with the data you probabbly do not want to download all of the
|
||||
files. Depending on your computation resources you may not be able to
|
||||
run all stages of the analysis.
|
||||
|
||||
The code for all stages of the analysis, including typesetting the
|
||||
manuscript and running the analysis, is in code.tar.
|
||||
|
||||
If you only want to run the final analysis or to play with datasets
|
||||
used in the analysis of the paper, you want intermediate_data.7z or
|
||||
the uncompressed tab and csv files.
|
||||
|
||||
The data files are created in a three stage process. The first stage
|
||||
uses the program "wikiq" to create tsv files that have edit data for
|
||||
each wiki. The second stage generates all.edits.RDS file which
|
||||
contains edit metadata from mediawiki xml dumps. This file is
|
||||
expensive to generate and at 1.5GB is pretty big. The third stage
|
||||
builds smaller intermediate files that contain the analytical
|
||||
variables from these tsv files. The fourth stage uses the intermediate
|
||||
files to generate smaller RDS files that contain the results. Finally,
|
||||
knitr and latex typeset the manuscript.
|
||||
|
||||
A stage will only run if the outputs from the previous stages do not
|
||||
exist. So if the intermediate files exist they will not be
|
||||
regenerated. Only the final analysis will run. The exception is that
|
||||
stage 4, fitting models and generating plots, always runs.
|
||||
|
||||
If you only want to replicate from the second stage onward, you want
|
||||
wikiq_tsvs.7z. If you want to replicate everything, you want
|
||||
wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.
|
||||
|
||||
These instructions work backwards from building the manuscript using
|
||||
knitr, loading the datasets, running the analysis, to building the
|
||||
intermediate datasets.
|
||||
|
||||
## Building the manuscript using knitr ##
|
||||
This requires working latex, latexmk, and knitr
|
||||
installations. Depending on your operating system you might install
|
||||
these packages in different ways. On Debian Linux you can run `apt
|
||||
install r-cran-knitr latexmk texlive-latex-extra`. Alternatively, you
|
||||
can upload the necessary files to a project on Sharelatex.com or
|
||||
Overleaf.com.
|
||||
|
||||
1. Download `code.tar`. This has everything you need to typeset the manuscript.
|
||||
2. Unpack the tar archive. On a unix system this can be done by running `tar xf code.tar`.
|
||||
3. Navigate to code/paper_source.
|
||||
4. Install R dependencies. In R. run `install.packages(c("data.table","scales","ggplot2","lubridate","texreg"))`
|
||||
5. On a unix system you should be able to run `make` to build the
|
||||
manuscript `generalizable_wiki.pdf`. Otherwise you should try
|
||||
uploading all of the files (including the tables, figure, and knitr
|
||||
folders) to a new project on ShareLatex.com.
|
||||
|
||||
## Loading intermediate datasets ##
|
||||
The intermediate datasets are found in the `intermediate_data.7z`
|
||||
archive. They can be extracted on a unix system using the command `7z
|
||||
x intermediate_data.7z`. The files are 95MB uncompressed. These are
|
||||
RDS (R data set) files and can be loaded in R using the `readRDS`. For
|
||||
example `newcomer.ds <- readRDS("newcomers.RDS")`. If you wish to
|
||||
work with these datasets using a tool other than R, you might prefer
|
||||
to work with the .tab files.
|
||||
|
||||
## Running the analysis ##
|
||||
|
||||
Fitting the models may not work on machines with less than 32GB of
|
||||
RAM. If you have trouble, you may find the functions in
|
||||
lib-01-sample-datasets.R useful to create stratified samples of data
|
||||
for fitting models. See line 89 of 02_model_newcomer_survival.R for an
|
||||
example.
|
||||
|
||||
1. Download `code.tar` and `intermediate_data.7z` to your working
|
||||
folder and extract both archives. On a unix system this can be done
|
||||
with the command `tar xf code.tar && 7z x intermediate_data.7z`.
|
||||
2. Install R
|
||||
dependencies. `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
|
||||
3. On a unix system you can simply run `regen.all.sh` to fit the
|
||||
models, build the plots and create the RDS files.
|
||||
|
||||
## Generating datasets ##
|
||||
|
||||
### Building the intermediate files ###
|
||||
The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.
|
||||
|
||||
1. Download `all.edits.RDS`, `userroles_data.7z`,`selected.wikis.csv`,
|
||||
and `code.tar`. Unpack `code.tar` and `userroles_data.7z`. On a
|
||||
unix system this can be done using `tar xf code.tar && 7z x
|
||||
userroles_data.7z`.
|
||||
2. Install R dependencies. In R run
|
||||
`install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
|
||||
3. Run `01_build_datasets.R`.
|
||||
|
||||
### Building all.edits.RDS ###
|
||||
|
||||
The intermediate RDS files used in the analysis are created from
|
||||
`all.edits.RDS`. To replicate building all.edits.RDS, you only need to
|
||||
run 01_build_datasets.R when the intermediate RDS files and
|
||||
`all.edits.RDS` files do not exist in the working
|
||||
directory. `all.edits.RDS` is generated from the tsv files generated
|
||||
by wikiq. This may take several hours. By default building the dataset
|
||||
will use all available CPU cores. If you want to change this, modify
|
||||
line 26 of `lib-01-build_newcomer_table.R`.
|
||||
|
||||
1. Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and
|
||||
code.tar. Unpack the files. On a unix system this can be done by
|
||||
running `7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf
|
||||
code.tar`.
|
||||
2. Run `01_build_datasets.R` to generate all.edits.RDS and the intermediate files.
|
||||
|
||||
|
||||
### Running Wikiq to generate tsv files ###
|
||||
If you want to regenerate the datasets all the way from the xml dumps
|
||||
and data from the Wikia api you will have to run the python script
|
||||
`wikiq`. This is a fairly computationally intensive process. It may
|
||||
over a day unless you can run the computations in parallel.
|
||||
|
||||
1. Download `code.tar`, `wikia_mediawiki_xml_dumps.7z.001`,
|
||||
`wikia_mediawiki_xml_dumps.7z.002`, and
|
||||
`userroles_data.7z`. Extract the archives. On a Unix system this
|
||||
can be done by running `tar xf code.tar && 7z x
|
||||
wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z`.
|
||||
2. Have python3 and python3-pip installed. Using pip3 install `argparse`. `pip3 install argparse`.
|
||||
3. Edit `runwikiq.sh` to set N_THREADS.
|
||||
4. Run `runwikiq.sh` to generate the tsv files.
|
||||
|
||||
### Obtaining Bot and Admin data from the Wikia API ###
|
||||
For the purposes of supporting an audit of our research project, this
|
||||
repository includes the code that we used to obtain Bot and Admin data
|
||||
from the Wikia API. Unfortunantly, since we ran the script, the API
|
||||
has changed and this code does not work.
|
||||
|
||||
Our research group maintains a tool for scraping the Wikia API
|
||||
available at https://code.communitydata.cc/wikia_userroles_scraper. This can
|
||||
be used to download user roles for the wikis in this dataset. Follow
|
||||
the instructions found in that package.
|
||||
|
||||
Reference in New Issue
Block a user