Statistics with Resampling - the Technique
By Robin Verhoef in R Statistics
June 2, 2020
In this post, I will try to explain the three base parts of every resampling experiment and explain the relevant code I use for those parts. I hope that after this, you will be able to work from this base and create your own experiments
Setup
The first part of any experiment is the setup. In this part, I ensure repeatability, set up our variables and also the place where we want to store our results.
First, I add a seed. The function set.seed()
sets a starting point for generating random numbers and makes sure that I get the same set of random numbers every time. Secondly, I specify how many times I want to repeat an experiment. This number also determines how many outputs I will get. If I am only interested in 1 outcome, I will create a vector which is filled with NA
values to reserve space and fill these along the way. If I was interested in multiple outcomes, I would make multiple vectors. Lastly, sometimes there will be a dataset from which you will be sampeling. In this case, I will check how often you get a pair in a hand of 5 cards. As data, I will use the numbers from 1 to 13 to represent the 13 different cards and repeat them 4 times for the 4 suits in a deck.
set.seed(20200609) # Set a starting point for all random numbers
reps <- 1000 # Number of times we want to run the experiment
results <- rep(NA, times = reps) # Repeat the value NA for reps times
data <- rep(1:13, times = 4) # Repeat the numbers 1-13 4 times
Experiment
This is where the fun part starts. Here, you get to think about what you actually want to measure and how you can explain that to a computer. This is the part where you can go crazy with experimentation and complexity. To start, I added a for
loop which will run reps
times. Then, I add the draw of the hand of cards using sample()
and store the result in hand
. To count the numbers in the hand and find if there is a pair, I will use the table()
function.
for (i in 1:reps) { # i will increase from 1 to reps in steps of 1
hand <- sample(data, size=5) # Draw 5 numbers from data
numbers <- table(hand) # Table contains a name-row with numbers and a row with how often those numbers appear
results[i] <- 0 # Default is no pair
for (count in numbers){ # If you loop over the table, you will only get the counts of numbers
if (count >= 2){ # If we have a card twice or more
results[i] <- 1
break # Stop the loop once a pair is found
}
}
}
Results
Once the experiment has run without any error, you can take a look at the results of the experiment. With a simple yes-no experiment like this, the results are not very exciting since making nice graphics out of this is not very sensible. With this experiment I think that a single number probability is enough of an answer. As you can see from the small bit of code, we get a clear answer. The mathematical answer would be that the chance of a pair or more is equal to \(1 - 52/52 * 48/51 * 44/50 * 40/49 * 36/48 = 1 - (52*48*44*40*36)/(52*51*50*49*48) = 1 - (44*40*36)/(51*50*49)\)
. This is 1 - the chance of getting 5 unique numbers.
number_of_pairs <- sum(results) # Since pairs are coded as a 1, the total number of pairs is equal to the sum of results
probability_of_pair <- number_of_pairs / reps # Percentage of experiments where a pair is drawn
print(probability_of_pair)
## [1] 0.515
maths <- 1 - (44*40*36)/(51*50*49)
print(maths)
## [1] 0.4929172
As you can see from the numbers, the simulated number does not differ a lot from the theoretical one and if you increase the amount of repetitions, it should get closer to the theoretical ideal. I find the code much easier to understand than the mathematical reasoning and I hope I was able to make resampling a little clearer for you.