Statistics with Resampling - the Basics

By Robin Verhoef in R Statistics

June 2, 2020

Introduction

This blogpost and others that might follow are all based on what I learned from the great book Resampling: The New Statistics by Julian L. Simon. His book taught me everything that my Stat 101 class should have but did not and in a way that has shaped my thoughts about statistics. I will try to link to relevant chapters along the way where the ideas are covered in more detail.

What is resampling?

The idea of resampling is the following: instead of using mathematical formulas to answer questions, we draw random samples from a distribution or sometimes from a sample of data in order to answer the question. While this method still relies on the mathematical foundations of statistal distribution and their randomness, you get to see the moving parts more than when using formulas. Other names for this kind of method are bootstrapping, which focuses on just taking data from samples, and Monte Carlo simulations, which use probability distributions.

Resampling from a distribution

An example of a question we can answer using a distribution is about trucks breaking down. What is the chance that, for a company with 20 trucks and a 10% chance of each of the trucks breaking down, 4 trucks are broken down at the same time? We can answer this problem using different levels of abstraction. We can draw for each truck individually, for all 20 trucks at once or for all repetitions of the experiments. I will show the middle approach to showcase the resampling approach.

library(ggplot2) # Used for nice plots

## Warning: package 'ggplot2' was built under R version 4.0.5

set.seed(03062020) # This makes the random numbers reproducible
repetitions <- 1000
trucks <- 20
breakingchance <- 0.1
result <- rep(NA, 1000)

# This is equal to rbinom(repetitions, trucks, breakingchance)
for (i in 1:1000) {
   # If a truck breaks down, this results in a 1 and the sum is the amount of broken trucks
  result[i] <- rbinom(n=1, size=trucks,prob=breakingchance)
}
ggplot(data = data.frame(brokentrucks=result), 
       mapping = aes(x=brokentrucks)) +
  geom_bar() +
  ggtitle('Most days have 1 to 3 broken trucks out of the 20') +
  labs(x='Number of broken trucks')
 
print(sum(result >= 4)/repetitions)

## [1] 0.137

print(1 - pbinom(3, trucks, breakingchance))

## [1] 0.1329533

Figure 1: Good truck or bad truck

As you can see in Figure 1, 4 broken down trucks happens at least 100/1000=10%, with more broken trucks being more unlikely. I also calculated the exact number of cases where four or more trucks are broken down and after that, I added the mathematical ideal by using the probability density function pbinom, which gives the probablity of a result of 0 - 3 happening with 20 draws and a 10% chance. By subtracting it from 1, I got the chance of 4 or more hits from 20 draws and a 10% chance. As you can see, resampling was not the most efficient way, but once the problems become more complicated, resampling is easier to expand upon. This problem is introduced by Simon in Chapter 1.

Resampling from a sample

While this is a very limited sample, I think the problem and the mechanics do help with explaining the concept (the choice could also be modeled as a draw from a bernoulli distribution with a 33.3% chance of success). The problem we want to solve is the Monty Hall, or 3 doors problem. In a game, a contestant has to choose between 3 doors, 1 of which contains a car and the other 2 contain a goat. After the choice, the gamemaster will open 1 of the two doors and show a goat. The contestant gets the choice to switch, but should they?

set.seed(03062020) # This makes the random numbers reproducible
repetitions <- 1000 # Amount of times the experiment will be repeated
switch <- rep(NA, repetitions) # Prepare a vector for storing the results
stay <- rep(NA, repetitions) 

for (i in 1:repetitions) { # i is the index of the experiment we are at
  choice <- sample(c('goat', 'goat', 'car'), 1) # Initial choice
  if (choice == 'goat'){ # Contestant choose a goat, so will switch to the car
    switch[i] <- 'Car'
    stay[i] <- 'Goat'
  }
  else { # Contestant must have chosen the car
    switch[i] <- 'Goat'
    stay[i] <- 'Car'
  }
}

df <- data.frame(Tactic = c(rep('Switch', repetitions), rep('Stay',repetitions)),
                 Outcome = c(switch, stay))

ggplot(df, aes(x=Tactic, fill=Outcome)) +
  geom_bar(position='dodge') +
  geom_text(aes(label=Outcome, y=..count../2), stat="count", position=position_dodge(1), vjust='center', hjust= 'center') +
  ggtitle('Switching doubles the chance of finding a car') +
  theme(legend.position = 'none')

Figure 2: Should I stay or should I go?

As you can see in Figure 2, if you want the car you should always switch. If you are not yet convinced, you can run trough it by hand with a partner and 3 marbles. While you can also find the answer to this question using standard statistics, I thing the visual difference and writing out all the options in the program really helps. You can find the explanation by Simon in Chapter 6 of his book.