Hypothesis Testing

Testing for statistical significance

Introduction

Lets say you flip a coin ten times and it comes up heads six out of ten times. Would you think this coin is biased? Probably not.

Now lets say you flip a coin a thousand times and it comes up heads six hundred times. You would probably think it is biased. But why?

In both cases the coin comes up heads 60% of the time. How can we quantify our intuition here? We will turn to the world of hypothesis testing to answer this question.

Null and Alternative Hypothesis

The null hypothesis is the statement we are trying develop evidence against and the alternative hypothesis is its complement.

Here our null hypothesis is that the coin is fair. The alternative hypothesis is that the coin is biased.

P-Value

One useful question for us to ask is what is the chance that we see an event this or more extreme due to random chance assuming the coin where unbiased?

The statistical term for this is p-value.

If our p-value is small it means that it is unlikely that we see six hundred heads out of one thousand flips coming up heads if the coin were fair. So a smaller p-value would make us conclude that the coin is indeed biased.

Level of Significance

But how small of a p-value is small enough for us to conclude the coin is biased? In statistics, this value is called $\alpha$ and a typical value for $\alpha$ is 0.05. But what does this $\alpha$ really mean?

It is the chance that we conclude that the coin is biased when in reality it isn’t. This kind of error is called type I error.

So if $p < 0.05$ then we will conclude that the coin is biased, and we know there is a 5% chance that we conclude the coin is biased when in reality it is not.

Example

Let’s dig into the math of computing p-values. By definition, the p-value is the chance that we see six hundred or more heads out of a thousand flips assuming the coin where unbiased.

Let $X$ be a Bernoulli random variable. If $X=0$ then the coin landed on tails, and if $X=1$ the coin landed on heads. Since we are assuming the coin is unbiased we can write the probability mass function, expected value, and variance of $X$ as follows

\[\mathbb{P}[X=0] = 0.5\] \[\mathbb{P}[X=1] = 0.5\] \[\mathbb{E}[X] = 0.5\] \[\mathbb{V}[X] = 0.25\]

The number of heads that we get is the sum of X from 1 to a thousand. We will define this new random variable as $H$. We can write its expected value and variance as follows

\[\mathbb{E}[H] = 500\] \[\mathbb{V}[H] = 250\]

Since H is the sum of a large number of independent and identically distrubuted random variables, we can use the central limit theorem to conclude that in addition to having the expected value and variance written above we know that $H$ is normally distributed. We can then calculate the z-score of flipping 6 million heads as follows:

\[z = \frac{x-\mu}{\sigma} = \frac{600 - \mathbb{E}[H]}{\sqrt{\mathbb{V}[H]}} \approx 6.32\]

To get the p-value from the z-score we use the standard gaussian cumulative density function as follows:

\[p = 2(1 - \phi(z)) \approx 2 \times 10^{-10}\]

This p-value is less than 0.05 so we can conclude that the coin is biased.

Code

Below is some Julia code that conducts a hypothesis test for a coin flip example. The input parameters are the number of heads and the number of coin flips. The source file can be found here. I encourage you to play around with the number of heads and total number of flips to get an intuition for what is statistically significant and what isn’t.

using Distributions

# Input Parameters
num_heads = 60
num_flips = 100
@assert num_heads <= num_flips "Number of heads must be less than or equal to the number of flips"

alpha = 0.05 # Significance level

# Mean and variance of number of heads if the coin were unbiased
mu = 0.5 * num_flips
var = 0.25 * num_flips

# Z-Score of our observation of num_heads
z = (num_heads - mu) / sqrt(var)

# Compute p-value
p = 2 * (1 - cdf(Normal(), abs(z)))

println("p-value: ", p)
if (p < alpha)
    println("p < 0.05 so we can reject the null hypothesis and conclude the coin is biased")
else
    println("p > 0.05 so we cannot reject the null hypothesis and we cannot conclude that the coin is biased")
end