Testing for statistical significance

Lets say you flip a coin ten times and it comes up heads six out of ten times. Would you think this coin is biased? Probably not.

Now lets say you flip a coin a thousand times and it comes up heads six hundred times. You would probably think it is biased. But why?

In both cases the coin comes up heads 60% of the time. How can we quantify our intuition here? We will turn to the world of hypothesis testing to answer this question.

The null hypothesis is the statement we are trying develop evidence against and the alternative hypothesis is its complement.

Here our null hypothesis is that the coin is fair. The alternative hypothesis is that the coin is biased.

One useful question for us to ask is what is the chance that we see an event this or more extreme due to random chance assuming the coin where unbiased?

The statistical term for this is p-value.

If our p-value is small it means that it is unlikely that we see six hundred heads out of one thousand flips coming up heads if the coin were fair. So a smaller p-value would make us conclude that the coin is indeed biased.

But how small of a p-value is small enough for us to conclude the coin is biased? In statistics, this value is called $\alpha$ and a typical value for $\alpha$ is 0.05. But what does this $\alpha$ really mean?

It is the chance that we conclude that the coin is biased when in reality it isn’t. This kind of error is called type I error.

So if $p < 0.05$ then we will conclude that the coin is biased, and we know there is a 5% chance that we conclude the coin is biased when in reality it is not.

Let’s dig into the math of computing p-values. By definition, the p-value is the chance that we see six hundred or more heads out of a thousand flips assuming the coin where unbiased.

Let $X$ be a Bernoulli random variable. If $X=0$ then the coin landed on tails, and if $X=1$ the coin landed on heads. Since we are assuming the coin is unbiased we can write the probability mass function, expected value, and variance of $X$ as follows

\[\mathbb{P}[X=0] = 0.5\] \[\mathbb{P}[X=1] = 0.5\] \[\mathbb{E}[X] = 0.5\] \[\mathbb{V}[X] = 0.25\]The number of heads that we get is the sum of X from 1 to a thousand. We will define this new random variable as $H$. We can write its expected value and variance as follows

\[\mathbb{E}[H] = 500\] \[\mathbb{V}[H] = 250\]Since H is the sum of a large number of independent and identically distrubuted random variables, we can use the central limit theorem to conclude that in addition to having the expected value and variance written above we know that $H$ is normally distributed. We can then calculate the z-score of flipping 6 million heads as follows:

\[z = \frac{x-\mu}{\sigma} = \frac{600 - \mathbb{E}[H]}{\sqrt{\mathbb{V}[H]}} \approx 6.32\]To get the p-value from the z-score we use the standard gaussian cumulative density function as follows:

\[p = 2(1 - \phi(z)) \approx 2 \times 10^{-10}\]This p-value is less than 0.05 so we can conclude that the coin is biased.

Below is some Julia code that conducts a hypothesis test for a coin flip example. The input parameters are the number of heads and the number of coin flips. The source file can be found here. I encourage you to play around with the number of heads and total number of flips to get an intuition for what is statistically significant and what isn’t.