Null-Hypothesis, Z-Scores and Normal Distribution

This is Carl Friedrich Gauß, one of the most friendly looking German mathematicans, on a banknote. Nowadays Germany forms part of the European Union and use the (by the way much more secure) EUR banknote (a different story). He earned that honor because of his widely used contributions to algebra, geometry and astronomy. If you have a close look on the bank note you will even spot the "Gaussian Bell Curve". This is what I want to talk about in this post.
We saw in this post about null hypothesis and p-values that in order to accept an alternative hypothesis, we have to find a way to reject a null hypothesis. I just mentioned the $p$-value, however I did not explain yet how to calculate it.

For every value we get its corresponding z-score in the data set by subtracting the average and deviding by the standard deviation.

Instead of the original dataset we can then create the set of standardized values. You might wonder, why we should do this: We know that under the null hypothesis the standardized values have an average of $0$ and a standard deviation of $1$ (given that the data set holds enough data, for less than 30 values there is in fact a slightly different assumption)! This means that we would expect the observed values to follow the normal distribution (see picture below). In particular, around $68\%$ of the values should lie between $-1$ and $1$, around $95\%$ of the values between $-2$ and $2$, and only very few data should lie outside $-3$ and $3$:

Actually the bell curve does not end at $-4$ or $4$, but the values are getting close to zero very fast.
The actual formula of the normal distribution for a means $\mu$ and a variance $\sigma^2$ is
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp^{-(x-\mu^2)/2\sigma^2}$$
Here the R code:

x <- seq(-4, 4, length = 10000)

y <- dnorm(x, 0, 1)
plot(x, y, type="n", xlab = "x", ylab = "y", main = "Bell Curve", axes = TRUE)
lines(x, y, col="red")

To calculate the $p$-value in R, we can use the command
pnorm(...)

What we get out of it:
For every value $x$ the $y$ value determines the frequency of the occurance of this value in a normal distributed data set. We can now check the probability of a result as good as our (standardized) observation. We have to choose between a one-tailed and a two-tailed test, meaning we put our decision boundary on the one or two sides (split up) of the bell curve (details in a different post). Now if our observation is lower than the value we specified for our significance level, we decide to reject the null hypothesis.

Modern Data Analysis

Null-Hypothesis, Z-Scores and Normal Distribution

Mirko

0 Comment