initial commit
This commit is contained in:
154
r_lectures/w08-R_lecture.Rmd
Normal file
154
r_lectures/w08-R_lecture.Rmd
Normal file
@@ -0,0 +1,154 @@
|
|||||||
|
---
|
||||||
|
title: "Week 8 R Lecture"
|
||||||
|
author: "Aaron Shaw"
|
||||||
|
date: "May 16, 2019"
|
||||||
|
output: html_document
|
||||||
|
---
|
||||||
|
|
||||||
|
```{r setup, include=FALSE}
|
||||||
|
knitr::opts_chunk$set(echo = TRUE)
|
||||||
|
```
|
||||||
|
This week's R tutorial materials focus on the basics of correlations and linear regressions. I'll work with the `mtcars` dataset that comes built-in with R.
|
||||||
|
|
||||||
|
## Correlations
|
||||||
|
|
||||||
|
Calculating correlation coefficients is straightforward: use the `cor()` function:
|
||||||
|
```{r}
|
||||||
|
with(mtcars, cor(mpg, hp))
|
||||||
|
```
|
||||||
|
All you prius drivers out there will be shocked to learn that miles-per-gallon is negatively correlated with horsepower.
|
||||||
|
|
||||||
|
The `cor()` function works with two variables or with more—the following generates a correlation matrix for the whole dataset!
|
||||||
|
```{r}
|
||||||
|
cor(mtcars)
|
||||||
|
```
|
||||||
|
|
||||||
|
Note that if you are calculating correlations with variables that are not distributed normally you should use `cor(method="spearman")` because it calculates rank-based correlations (look it up online for more details).
|
||||||
|
|
||||||
|
## Fitting a linear model (with one variable)
|
||||||
|
|
||||||
|
Linear models are fit using the `lm()` command. As with `aov()`, the `lm()` function requires a formula as an input and is usually presented with a call to `summary()`. You can enter the formula directly in the call to `lm()` or define it separately. For this example, I'll regress `mpg` on a single predictor, `hp`:
|
||||||
|
```{r}
|
||||||
|
model1 <- lm(mpg ~ hp, data=mtcars)
|
||||||
|
|
||||||
|
summary(model1)
|
||||||
|
```
|
||||||
|
Notice how much information the output of `summary()` gives you for a linear model! You have details about the residuals, the usual information about the coefficients, standard errors, t-values, etc., little stars corresponding to conventional significance levels, $R^2$ values, degrees of freedom, F-statistics (remember those?) and p-values for the overall model fit.
|
||||||
|
|
||||||
|
There's even more under the hood. Try looking at all the different things in the model object R has created:
|
||||||
|
```{r}
|
||||||
|
names(model1)
|
||||||
|
|
||||||
|
```
|
||||||
|
You can directly inspect the residuals using `model1$residuals`. This makes plotting and other diagnostic activities pretty straightforward:
|
||||||
|
```{r}
|
||||||
|
summary(model1$residuals)
|
||||||
|
```
|
||||||
|
|
||||||
|
More on that in a moment. In the meantime, you can also use the items generated by the call to `summary()` as well:
|
||||||
|
```{r}
|
||||||
|
names(summary(model1))
|
||||||
|
summary(model1)$coefficients
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
There are also functions to help you do things with the model such as predict the fitted values for new data. For example, if I found some new cars with horsepowers ranging from 90-125, what would this model predict for the corresponding mpg for each car?
|
||||||
|
```{r}
|
||||||
|
new.data <- data.frame(hp=seq(90,125,5))
|
||||||
|
predict(model1, new.data, type="response")
|
||||||
|
```
|
||||||
|
A call to predict can also provide standard errors around these predictions (which you could use, for example, to construct a 95% confidence interval around the model-predicted values):
|
||||||
|
```{r}
|
||||||
|
predict(model1, new.data, type="response", se.fit = TRUE)
|
||||||
|
```
|
||||||
|
Linear model objects also have a built-in method for generating confidence intervals around the values of $\beta$:
|
||||||
|
```{r}
|
||||||
|
confint(model1, "hp", level=0.95) # Note that I provide the variable name in quotes
|
||||||
|
```
|
||||||
|
Feeling old-fashioned? You can always calculate residuals or confidence intervals (or anything else) "by hand":
|
||||||
|
```{r}
|
||||||
|
# Residuals
|
||||||
|
mtcars$mpg - model1$fitted.values
|
||||||
|
|
||||||
|
# 95% CI for the coefficient on horsepower
|
||||||
|
est <- model1$coefficients["hp"]
|
||||||
|
se <- summary(model1)$coefficients[2,2]
|
||||||
|
|
||||||
|
est + 1.96 * c(-1,1) * se
|
||||||
|
```
|
||||||
|
|
||||||
|
## Plotting residuals
|
||||||
|
|
||||||
|
You can generate diagnostic plots of residuals in various ways:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
hist(residuals(model1))
|
||||||
|
hist(model1$residuals)
|
||||||
|
```
|
||||||
|
|
||||||
|
Plot the residuals against the original predictor variable:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
library(ggplot2)
|
||||||
|
|
||||||
|
qplot(x=mtcars$hp, y=residuals(model1), geom="point")
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
Quantile-quantile plots can be done using `qqnorm()` on the residuals:
|
||||||
|
```{r}
|
||||||
|
qqnorm(residuals(model1))
|
||||||
|
```
|
||||||
|
The easiest way to generate a few generic diagnostic plots in ggplot is documented pretty well on StackExchange and elsewhere:
|
||||||
|
```{r}
|
||||||
|
library(ggfortify)
|
||||||
|
|
||||||
|
autoplot(model1)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Adding additional variables (multiple regression—really useful next week)
|
||||||
|
|
||||||
|
You can, of course, have models with many variables. This might happen by creating a brand new formula or using a command `update.formula()` to...well, you probably guessed it:
|
||||||
|
```{r}
|
||||||
|
f1 <- formula(mpg ~ hp)
|
||||||
|
|
||||||
|
f2 <- formula(mpg ~ hp + disp + cyl + vs)
|
||||||
|
|
||||||
|
f2a <- update.formula(f1, . ~ . + disp + cyl + vs) ## Same as f2 above
|
||||||
|
|
||||||
|
model2 <- lm(f2, data=mtcars)
|
||||||
|
|
||||||
|
summary(model2)
|
||||||
|
```
|
||||||
|
Estimating linear models with predictor variables that are not continuous (numeric or integers) is no problem. Just go for it:
|
||||||
|
```{r}
|
||||||
|
mtcars$cyl <- factor(mtcars$cyl)
|
||||||
|
mtcars$vs <- as.logical(mtcars$vs)
|
||||||
|
|
||||||
|
## Refit the same model:
|
||||||
|
model2 <- lm(f2, data=mtcars)
|
||||||
|
summary(model2)
|
||||||
|
```
|
||||||
|
We'll talk more about how to interpret these results with categorical predictors next week, but for now you can see that R has no trouble handling multiple types or classes of variables in a regression model.
|
||||||
|
|
||||||
|
## Producing nice regression tables
|
||||||
|
Generating regression tables directly from your statistical software is very important for preventing mistakes and typos. There are many ways to do this and a variety of packages that may be helpful (LaTex users: see [this StackExchange post](https://stackoverflow.com/questions/5465314/tools-for-making-latex-tables-in-r) for a big list).
|
||||||
|
|
||||||
|
One especially easy-to-use package that can output text and html (both eminently paste-able into a variety of typesetting/word-processing systems) is called `stargazer`. Here I use it to generate an ASCII table summarizing the two models we've fit in this tutorial.
|
||||||
|
```{r}
|
||||||
|
library(stargazer)
|
||||||
|
|
||||||
|
stargazer(model1, model2, type="text")
|
||||||
|
```
|
||||||
|
|
||||||
|
## Back to ANOVAs for a moment
|
||||||
|
|
||||||
|
You may recall that I mentioned that R actually calls `lm()` when it estimates an ANOVA. As I said before, I'm not going to walk through the details, but an important thing to note is that the F-statistics and the p-values for those F-statistics are identical when you use `aov()` and when you use `lm()`. That means that you already know what hypothesis is being tested there and how to interpret that part of the regression model output.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
summary(aov(data=mtcars, mpg ~ factor(cyl)))
|
||||||
|
|
||||||
|
summary(lm(data=mtcars, mpg ~ factor(cyl)))
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user