updating to w06 content outline. still no substance
This commit is contained in:
@@ -1,7 +1,8 @@
|
|||||||
---
|
---
|
||||||
title: "Week 7 R Lecture"
|
title: "Week 6 R lecture"
|
||||||
author: "Jeremy Foote"
|
subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525"
|
||||||
date: "April 4, 2019"
|
author: "Aaron Shaw"
|
||||||
|
date: "May 3, 2019"
|
||||||
output: html_document
|
output: html_document
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -9,97 +10,17 @@ output: html_document
|
|||||||
knitr::opts_chunk$set(echo = TRUE)
|
knitr::opts_chunk$set(echo = TRUE)
|
||||||
```
|
```
|
||||||
|
|
||||||
## Categorical Data
|
## T-tests
|
||||||
|
You learned the theory/concepts behind t-tests last week, so here's a brief run-down on how to use built-in functions in R to conduct them and interpret the results.
|
||||||
|
|
||||||
The goal of this script is to help you think about analyzing categorical data, including proportions, tables, chi-squared tests, and simulation.
|
## ANOVAs
|
||||||
|
|
||||||
### Estimating proportions
|
Analogous situation with t-tests. Here's a brief introduction to how they work in R.
|
||||||
|
|
||||||
If a survey of 50 randomly sampled Chicagoans found that 45% of them thought that Giordano's made the best deep dish pizza, what would be the 95% confidence interval for the true proportion of Chicagoans who prefer Giordano's?
|
## Visualizing confidence intervals
|
||||||
|
|
||||||
Can we reject the hypothesis that 50% of Chicagoans prefer Giordano's?
|
We spent a lot of time on confidence intervals in the past few weeks. Since they can be so useful, surely we should learn some approaches to incorporating them into data visualizations.
|
||||||
|
|
||||||
|
## Date/time arithmetic
|
||||||
|
|
||||||
```{r}
|
Last, but not least, another wrinkle in time...or at least how to manage date-time objects in R.
|
||||||
est = .45
|
|
||||||
sample_size = 50
|
|
||||||
SE = sqrt(est*(1-est)/sample_size)
|
|
||||||
|
|
||||||
conf_int = c(est - 1.96 * SE, est + 1.96 * SE)
|
|
||||||
conf_int
|
|
||||||
```
|
|
||||||
|
|
||||||
What if we had the same result but had sampled 500 people?
|
|
||||||
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
est = .45
|
|
||||||
sample_size = 500
|
|
||||||
SE = sqrt(est*(1-est)/sample_size)
|
|
||||||
|
|
||||||
conf_int = c(est - 1.96 * SE, est + 1.96 * SE)
|
|
||||||
conf_int
|
|
||||||
```
|
|
||||||
|
|
||||||
### Tabular Data
|
|
||||||
|
|
||||||
The Iris dataset is composed of measurements of flower dimensions. It comes packaged with R and is often used in examples. Here we make a table of how often each species in the dataset has a sepal width greater than 3.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
|
|
||||||
table(iris$Species, iris$Sepal.Width > 3)
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
The chi-squared test is a test of how much the frequencies we see in a table differ from what we would expect if there was no difference between the groups.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
|
|
||||||
chisq.test(table(iris$Species, iris$Sepal.Width > 3))
|
|
||||||
```
|
|
||||||
|
|
||||||
The incredibly low p-value means that it is very unlikely that these came from the same distribution and that sepal width differs by species.
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Using Simulation
|
|
||||||
|
|
||||||
When the assumptions of Chi-squared tests aren't met, we can use simulation to approximate how likely a given result is.
|
|
||||||
|
|
||||||
The book uses the example of a medical practitioner who has 3 complications out of 62 procedures, while the typical rate is 10%.
|
|
||||||
|
|
||||||
The null hypothesis is that this practitioner's true rate is also 10%, so we're trying to figure out how rare it would be to have 3 or fewer complications, if the true rate is 10%.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
# We write a function that we are going to replicate
|
|
||||||
simulation <- function(rate = .1, n = 62){
|
|
||||||
# Draw n random numbers from a uniform distribution from 0 to 1
|
|
||||||
draws = runif(n)
|
|
||||||
# If rate = .4, on average, .4 of the draws will be less than .4
|
|
||||||
# So, we consider those draws where the value is less than `rate` as complications
|
|
||||||
complication_count = sum(draws < rate)
|
|
||||||
# Then, we return the total count
|
|
||||||
return(complication_count)
|
|
||||||
}
|
|
||||||
|
|
||||||
# The replicate function runs a function many times
|
|
||||||
|
|
||||||
simulated_complications <- replicate(5000, simulation())
|
|
||||||
|
|
||||||
```
|
|
||||||
|
|
||||||
We can look at our simulated complications
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
|
|
||||||
hist(simulated_complications)
|
|
||||||
```
|
|
||||||
|
|
||||||
And determine how many of them are as extreme or more extreme than the value we saw. This is the p-value.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
|
|
||||||
sum(simulated_complications <= 3)/length(simulated_complications)
|
|
||||||
```
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user