updates to the README
- updated to point to the current git repo - fixed a series of typos and style stuff - fixed two filename inconsistencies - changed sharelatex to overleaf
This commit is contained in:
27
README.md
27
README.md
@@ -1,10 +1,9 @@
|
||||
---
|
||||
title: Software and data for "A Computational Analysis of Social Media Scholarship"
|
||||
output: html_document
|
||||
css: ./simple.css
|
||||
---
|
||||
|
||||
<link rel='stylesheet' href='./simple.css'>
|
||||
|
||||
> **Authors:** [Jeremy Foote](http://jeremydfoote.com/), [Aaron Shaw](http://aaronshaw.org/), [Benjamin Mako Hill](https://mako.cc/academic/)<br />
|
||||
> **Archival copies of code and data:** <https://dx.doi.org/10.7910/DVN/W31PH5><br />
|
||||
> **License:** see [COPYING file](COPYING): code is released under [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html) or any later version; chapter is released as [CC BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode).
|
||||
@@ -28,7 +27,7 @@ This document is meant to be read alongside our chapter. The rest of this docume
|
||||
We will be as explicit as possible in this document, and try to make it accessible to less-technical readers. However, we do make a few assumptions:
|
||||
|
||||
* You have access to and basic familiarity with [a POSIX command line interface](https://en.wikipedia.org/wiki/POSIX). The instructions here are written for and tested using [Debian](https://www.debian.org/) and [Ubuntu](https://www.ubuntu.com/) GNU/Linux. That said, these instructions should work without modification on most Linux systems. Although MacOS users may need to [tweak a few things](MacInstallNotes), they should work there, too. Microsoft Windows users will likely need to tweak more things. This is particularly true for the last step—building the paper itself. If you can get a simple example like [this one](https://github.com/yihui/knitr-examples/blob/master/005-latex.Rtex) working, then there's a decent chance you can get the chapter to build.
|
||||
* You have [Python 3.x](https://www.python.org/downloads/) installed. For many users, you will already have it installed. Debian and Ubuntu users can install it with `apt install python3`. Others can download it from [the Python download page](https://www.python.org/downloads/)
|
||||
* You have [Python 3.x](https://www.python.org/downloads/) installed. Many users will already have it installed. Debian and Ubuntu users can install it with `apt install python3`. Others can download it from [the Python download page](https://www.python.org/downloads/)
|
||||
* You have [GNU R 3.x](https://www.r-project.org/) installed. Debian and Ubuntu users can install it with `apt install r-base`. Others can install it from [the R homepage](https://www.r-project.org/). In our testing we used versions GNU R versions 3.3.2 and 3.4.1.
|
||||
* To conduct the bibliometric network analysis, you'll need the [igraph library](http://igraph.org/). To install it on Debian or Ubuntu you can run `apt install libigraph0v5`.
|
||||
* You will also need the following Python libraries:
|
||||
@@ -86,7 +85,7 @@ As part of writing this paper, we did the work of downloading the metadata from
|
||||
|
||||
Scopus won't let us make that dataset publicly accessible on the web, but we can grant you access to it through [the Harvard Dataverse](https://dataverse.harvard.edu/). Just request it through Dataverse. We ask that you only use the data for the purpose of reproducing this chapter.
|
||||
|
||||
Once you have downloaded the data, unpack it in the same directory that you unpacked the `code_and_data.tar.gz` file. Now you should have a third subdirectory: `raw_data`.
|
||||
Once you have downloaded the data, unpack it in the same directory that you unpacked the `code_and_paper.tar.gz` file. Now you should have a third subdirectory: `raw_data`.
|
||||
|
||||
#### 2.2. Option 2: Get the Data from Scopus Yourself
|
||||
|
||||
@@ -165,7 +164,7 @@ Doing this will require two final steps:
|
||||
|
||||
The code used for our bibliometric analysis is contained within the `code/bibliometrics/` subdirectory.
|
||||
|
||||
We've included two copies of our Python code for our bibliometric analysis in the files `00_citation_network_analysis.py` and `00_citation_network_analysis.ipynb`. We will describe using the former in this section. If you have [Jupyter](https://jupyter.org/) installed you can open the file in a a notebook format used by many scientists by running `jupyter-notebook citation_network_analysis.ipynb`. If you want to try Jupyter, Debian and Ubuntu users can install it with `apt install jupyter-notebook` and other users can download it [here](https://jupyter.org/install.html).
|
||||
We've included two copies of our Python code for our bibliometric analysis in the files `00_citation_network_analysis.py` and `00_citation_network_analysis.ipynb`. We will describe using the former in this section. If you have [Jupyter](https://jupyter.org/) installed you can open the file in a notebook format used by many scientists by running `jupyter-notebook citation_network_analysis.ipynb`. If you want to try Jupyter, Debian and Ubuntu users can install it with `apt install jupyter-notebook` and other users can download it [here](https://jupyter.org/install.html).
|
||||
|
||||
Our bibliometric analysis code does require one additional piece of software called [Infomap](http://www.mapequation.org/) which we use to identify clusters in our citation network. There are some [instructions online](https://github.com/mapequation/infomap) but you can download and install it with the following commands run from the `code/bibliometrics` subdirectory:
|
||||
|
||||
@@ -174,7 +173,7 @@ Our bibliometric analysis code does require one additional piece of software cal
|
||||
cd infomap
|
||||
make
|
||||
|
||||
Once you have Infomap installed, running our bibliometric analysis all done with a single Python command run from the root directory:
|
||||
Once you have Infomap installed, our bibliometric analysis is all done with a single Python command run from the root directory:
|
||||
|
||||
python3 code/bibliometrics/00_citation_network_analysis.py
|
||||
|
||||
@@ -204,9 +203,9 @@ Getting things just right will take some fiddling! We've included our Gephi file
|
||||
|
||||
### 5. Topic Modeling Analysis
|
||||
|
||||
The `code/topic_modeling` directory applies [latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)(LDA) topic modeling to the social media abstracts.
|
||||
The `code/topic_modeling` directory applies [latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) (LDA) topic modeling to the social media abstracts.
|
||||
|
||||
LDA takes in a set of documents and produces a set of topics and a distribution of topics for each document. The first file takes in the abstract file, and creates two outputs: The abstracts together with their topic distribution and a set of topics and the top words associated with each.
|
||||
LDA takes in a set of documents and produces a set of topics and a distribution of topics for each document. The first file takes in the abstract file, and creates two outputs: the abstracts together with their topic distribution, and a set of topics and the top words associated with each.
|
||||
|
||||
Our topic modeling analysis includes the following steps:
|
||||
|
||||
@@ -214,20 +213,20 @@ Our topic modeling analysis includes the following steps:
|
||||
|
||||
python3 code/topic_modeling/00_topics_extraction.py
|
||||
|
||||
> Note that this will take some time —5-10 minutes on a decent laptop.
|
||||
> Note that this will take some time—5-10 minutes on a decent laptop.
|
||||
|
||||
2. Run a second file which (a) makes a couple of tables of the top words for each topic and (b) generates some summary statistics for how the topics change over time. These statistics are used, e.g., to create Figure 4. You can run this with:
|
||||
|
||||
python3 code/topic_modeling/01_make_paper_files.py
|
||||
|
||||
Topic modeling is a stochastic process, and you may notice differences—potentially large differences—between the results in our chapter and the results when you run it. If the order of topics has changed (or if you think other labels would be appropriate for some topics), then you can adjust the topic names by editing the `topic_names` list in the `01_make_papers.py` file.
|
||||
Topic modeling is a stochastic process, and you may notice differences—potentially large differences—between the results in our chapter and the results when you run it. If the order of topics has changed (or if you think other labels would be appropriate for some topics), then you can adjust the topic names by editing the `topic_names` list in the `01_make_paper_files.py` file.
|
||||
|
||||
### 6. Prediction Analysis
|
||||
|
||||
|
||||
For the prediction analysis, we use features of the papers to predict whether or not a paper gets cited.
|
||||
|
||||
These commands require a computer with a large amount of memory (i.e., RAM). We had trouble running these steps on our laptops which did not have enough memory. If you don't have access to such a computer, then you can change the `n_features` variable in the `00_ngram_extraction.py` file from `100000` to something like `3000`. This will change how many terms are included in the prediction analysis, but shouldn't make an important difference in the results.
|
||||
These commands require a computer with a large amount of memory (i.e., RAM). We had trouble running these steps on our laptops, which did not have enough memory. If you don't have access to such a computer, then you can change the `n_features` variable in the `00_ngram_extraction.py` file from `100000` to something like `3000`. This will change how many terms are included in the prediction analysis, but shouldn't make an important difference in the results.
|
||||
|
||||
We ran the following steps:
|
||||
|
||||
@@ -271,7 +270,7 @@ After this, you should change to the `paper` directory and simply run the comman
|
||||
|
||||
This will produce a quickly scrolling output to standard out, and if everything has worked, then in the end it will produce a bunch of files in the `paper` directory, one of which will be the final PDF file!
|
||||
|
||||
An alternative approach that does not involve installing software is to upload the entire paper subdirectory (including the paper/data subdirectory) as a paper repository to the service [ShareLaTeX](https://www.sharelatex.com/). In order to make it work, you'll also need to rename the file ending with `.Rnw` to `.Rtex`. We spent some of the time writing our paper using ShareLaTeX so this should work.
|
||||
An alternative approach that does not involve installing software is to upload the entire paper subdirectory (including the paper/data subdirectory) as a paper repository to the service [Overleaf](https://www.overleaf.com/). In order to make it work, you'll also need to rename the file ending with `.Rnw` to `.Rtex`. We spent some of the time writing our paper using Overleaf, so this should work.
|
||||
|
||||
### Help us Find Errors, Improvements, and Updates
|
||||
|
||||
@@ -283,6 +282,4 @@ You are welcome to get in touch with us if you have questions and we have provid
|
||||
* [Aaron Shaw](http://aaronshaw.org/) <<aaronshaw@northwestern.edu>>
|
||||
* [Benjamin Mako Hill](https://mako.cc/academic/) <<makohill@uw.edu>>
|
||||
|
||||
If you can fix issues you run into, find ways to clarify our instructions, or make fixes to our code, please tell us! We'd love to add helpful improvements to these materials.
|
||||
|
||||
In addition to the [archival version in the Dataverse](https://dx.doi.org/10.7910/DVN/W31PH5), we have hosted our code in a [git revision control management](https://git-scm.com/) repository here: <https://code.communitydata.cc/social-media-chapter.git> Even the page you are reading is included in our repository. If you notice any typos or errors, please send us your fix! To do so, you can follow the [instructions we have posted](https://code.communitydata.cc/) for sending updated versions of our code or documentation. Your work and improvements can help others trying to replicate and learn from this project.
|
||||
In addition to the [archival version in the Dataverse](https://dx.doi.org/10.7910/DVN/W31PH5), we have hosted our code in a [git revision control management](https://git-scm.com/) repository here: <https://gitea.communitydata.science/jdfoote/social-media-chapter>
|
||||
|
||||
Reference in New Issue
Block a user