w04 materials
This commit is contained in:
@@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
title: "Week 3 R lecture"
|
title: "Week 4 R lecture"
|
||||||
subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525"
|
subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525"
|
||||||
author: "Aaron Shaw"
|
author: "Aaron Shaw"
|
||||||
date: "April 18, 2019"
|
date: "April 18, 2019"
|
||||||
@@ -25,35 +25,117 @@ class(a.good.time)
|
|||||||
a.good.time
|
a.good.time
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Binomial and factorial functions
|
||||||
|
|
||||||
|
In Chapter 3 (and in last week's problem set), you needed to calculate some binomial choice arithmetic and/or factorials. They weren't absolutely necessary for the problem set, but here are the corresponding functions in R.
|
||||||
|
|
||||||
|
Let's say we want to calculate how many possible pairs you can draw from a population of ten individuals, a.k.a., $10 \choose 2$ or, instead you wanted to calculate $10!$
|
||||||
|
```{r}
|
||||||
|
choose(10,2)
|
||||||
|
|
||||||
|
factorial(10)
|
||||||
|
```
|
||||||
|
|
||||||
## Distribution functions
|
## Distribution functions
|
||||||
distribution functions: lets focus on *unif(): the key is on page 222 of Verzani
|
|
||||||
The “d” functions return the p.d.f. of the distribution
|
|
||||||
dunif(x=1, min=0, max=3) # 1/3 of the area is the to the left 1
|
|
||||||
The “p” functions return the c.d.f. of the distribution.
|
|
||||||
dunif(q=2, min=0, max=3) #1/(b-a) is 2/3
|
|
||||||
The “q” functions return the quantiles.
|
|
||||||
qunif(p=0.5, min=0, max=3) # half way between 0 and 3
|
|
||||||
|
|
||||||
The “r” functions return random samples from a distribution.
|
R has a number of built-in functions to help you work with distributions in various ways that also started to come up in *OpenIntro* Chapter 3. I will introduce a couple of points about them here, but I also highly recommend you look at the relevant section of the Verzani *Using R Introductory Statistics* book (pp 222-229) for more on this (and, honestly, for more on most of the topics we're covering in R).
|
||||||
runif(n=1, min=0, max=3) # a random value in [0,3]
|
|
||||||
|
|
||||||
## Doing simple simulations with random data
|
The key to this is that R has a set of distributions (e.g. uniform, normal, binomial, log-normal, etc.) and a set of functions (`d`, `p`, `q`, and `r`) that can be run for each distribution. I'll use a uniform distribuition (a distribution between any two values (*min*, *max*) where the values may occur with uniform probability) for my example below. Verzani has others for when you need them.
|
||||||
runif()
|
|
||||||
rnorm()
|
|
||||||
|
|
||||||
## A quick simulation
|
The `d` function gets you information about the density function of the distribution. The `p` function works with the cumulative probabilities. The `q` function gets you quantiles from the distribution. The `r` function allows you to generate random samples from the distribution. As you can see, the letters corresponding to each function *almost* make sense...<*sigh*>. They also each take specific arguments that can vary a bit depending on which kind of distribution you are using them with (as always, the help pages and the internet can be helpful here).
|
||||||
|
|
||||||
In case you don't believe the central limit theorem, let's put together a quick simulation to illustrate it in R.
|
Onwards to the example code, which looks at a uniform distribution between 0 and 3:
|
||||||
|
|
||||||
Write a function to repeatedly take the mean of a sample.
|
```{r}
|
||||||
|
dunif(x=1, min=0, max=3) # What proportion of the area is the to the left of 1?
|
||||||
|
|
||||||
Experiment by changing the size of the sample
|
punif(q=1, min=0, max=3) # Same as the prior example in this case.
|
||||||
|
|
||||||
|
qunif(p=0.5, min=0, max=3) # 50th percentile
|
||||||
|
|
||||||
|
runif(n=4, min=0, max=3) # Random values in [0,3]
|
||||||
|
```
|
||||||
|
Look at the Verzani text for additional examples, including several that can solve binomial probability calculations (e.g., if you flip a fair coin 100 times, what are the odds of observing heads 60 or more times?).
|
||||||
|
|
||||||
|
### A quick simulation (using a for-loop!)
|
||||||
|
|
||||||
|
Beyond proving invaluable for rapid calculations of solutions to problem set questions, the distribution functions are very, very useful for running simulations. We won't really spend a lot of time on simulations in class, but I'll give you an example here that can generalize to more complicated problems. I also use a programming technique we haven't talked about yet called a for-loop to help repeat the sampling process multiple times.
|
||||||
|
|
||||||
|
For my simulation let's say that I want to repeatedly draw random samples from a distribution and examine the distribution of the resulting sample means. I'll start by generating a vector of 10,000 random data points drawn from a log-normal distribution where the mean and standard deviation of the log-transformed values are 0 and 1 respectively:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
d <- rlnorm(10000, meanlog=0, sdlog=1)
|
||||||
|
|
||||||
|
head(d)
|
||||||
|
mean(d)
|
||||||
|
sd(d)
|
||||||
|
hist(d)
|
||||||
|
```
|
||||||
|
|
||||||
|
Okay, now, I want to draw 500 samples of 100 observations from this population and take the mean of each sample. Time to write a function! Notice that I require two inputs into my function: the population data and the sample size.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
sample.mean <- function(pop, n){
|
||||||
|
s <- sample(pop, n)
|
||||||
|
return(mean(s))
|
||||||
|
}
|
||||||
|
|
||||||
|
## Run it once to see how it goes:
|
||||||
|
sample.mean(d, 100)
|
||||||
|
```
|
||||||
|
Next step: let's run that 500 times. Here's where the for-loop comes in handy. A couple of things about the syntax of for-loops in R: The basic syntax of a for-loop is that you call some operation to occur over some index. Here's a simple example that illustrates how they work. The loop iterates through the integers between 1-10 and prints the square of each value:
|
||||||
|
```{r}
|
||||||
|
for(x in c(1:10)){
|
||||||
|
print(x^2)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Since I want to store the output of my sample means loop, I will first create an object `s.means` that is a numeric vector with one value (0) that will be replaced in a moment.
|
||||||
|
```{r}
|
||||||
|
s.means <- 0
|
||||||
|
```
|
||||||
|
Onwards to the loop itself. In the block of code below, you'll notice that I once again declare an index over which to iterate. That's what happens inside that first set of parentheses where I have `i in c(1:30)`. That's telling R to go through the loop for each value from 1:30 and to call the current index value `i` during each loop. Each time through the loop, the value of `i` advances through the index (in this case, it goes up by 1). The result is that each time through it will take the output of my `sample.mean` function and append it as the $i^{th}$ value of `s.means`. The `next` call at the end is optional, but can be important sometimes to help you keep track of what's going on.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
for(i in c(1:500)){
|
||||||
|
s.means[i] <- sample.mean(d, 100)
|
||||||
|
next
|
||||||
|
}
|
||||||
|
```
|
||||||
|
The `s.means` variable now contains a distribution of sample means! What are the characteristics of the distribution? You already know how to do that.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
summary(s.means)
|
||||||
|
```
|
||||||
|
Let's plot it too:
|
||||||
|
```{r}
|
||||||
|
library(ggplot2)
|
||||||
|
qplot(s.means, geom="density")
|
||||||
|
```
|
||||||
|
|
||||||
|
That looks pretty "normal."
|
||||||
|
|
||||||
|
Experiment with this example by changing the size of the sample and/or the number of samples we draw.
|
||||||
|
|
||||||
|
Now, think back to the original vector `d.` Can you explain what fundamental statistical principle is illustrated in this example? Why do the values in `s.means` fluctuate so much? What is the relationship of `s.means` to `d`?
|
||||||
|
|
||||||
## Quantile quantile plots
|
## Quantile quantile plots
|
||||||
|
|
||||||
|
Last, but not least, you might have admired the quantile-quantile plots presented in some of the examples in *OpenIntro*. The usual idea with "Q-Q- plots" is that you want to see how the observed (empirical) quantiles of some data compare against the theoretical quantiles of a normal distribution. You too can create these plots!
|
||||||
|
|
||||||
## Binomial and factorial functions
|
Here's an example that visualizes the result of our simulation (labeled "sample") against a normal distribution with the same mean and standard deviation (labeled "theoretical"). Notice that to accommodate ggplot2 I have to turn `s.means` into a data frame first.
|
||||||
Choose, factorial
|
|
||||||
|
```{r}
|
||||||
|
s.means <- data.frame(s.means)
|
||||||
|
ggplot(s.means, aes(sample=s.means)) + geom_qq() + geom_qq_line(color="red")
|
||||||
|
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
And/or (finally) we could even standardize the values of `s.means` as z-scores using the `scale()` function:
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
s.z <- data.frame(scale(s.means)); names(s.z) <- "z"
|
||||||
|
ggplot(s.z, aes(sample=z)) + geom_qq() + geom_qq_line(color="red")
|
||||||
|
```
|
||||||
|
|
||||||
|
|||||||
339
r_lectures/w04-R_lecture.html
Normal file
339
r_lectures/w04-R_lecture.html
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user