Modern Data Analysis: 2017

Archive for 2017

Correlation

What is a correlaion?
A correlation is a kind of relationship between two variables. It is a statistical value expressed in a correlation coefficient between -1 and 1 which measures to which degree the variables have a linear relationship between each other. The higher the (absolute) value of the correlation coefficient the stronger is the relation.

What types of correlaion exist?
- A value of 0 means that there is no (linear) relation between the variables (so called zero order correlation)
- A positive value means a positive relationship, that is one variable grows as the other one grows
- A negative value means a negative relationship, that is one variable falls as the other one grows.
- There is even a standard available published by the Political Science Department at Quinnipiac University defining the following terms for the absolute values of correlation coefficients:

Value of the Correlation Coefficient

0.00: no relationship

0.01 - 0.19: no or negligible

0.20 - 0.29: weak

0.30 - 0.39: moderate

0.40 - 0.69: strong

>= 0.70: very strong

Examples:
- Aristoteles: "The more you know, the more you know you don't know" (positive relationship between what you know and what you know you don't know).
- Walt Disney: "The more you like yourself, the less you are like anyone else, which makes you unique." (negative relationship between the amount by which you like yourself and the amount you are similar to others (don't overact))
- If I wish you all the best, what is left for me? (There should be no relation between what I wish you and what is left for me (I really hope so!))
- further examples can be found in this nice graphic by DenisBoigelot.

How is the correlation coefficient calculated?
There are different definitions of correlation. The most famous correlation coefficient is the so called Pearson product-moment correlation coefficient or simply calles Pearson's correlation coefficient. To calculate it use a formula: Assume we have two variables $X$ and $Y$. the correlation coefficient $\rho_{X,Y}$ or $corr(X,Y)$ is calculated by $$\rho_{X,Y} = corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}$$ where $\sigma_X$ and $\sigma_Y$ are standard deviations, $\mu_X$ and $\mu_Y$ are the means of the variables, $E$ stands for the expected values and $cov$ stands for the covariance.
If $X$ and $Y$ consist of indexed samples $x_i$ and $y_i$ for $i = 1..n$ we can rewrite the formular to $$\rho_{X,Y} = corr(X,Y) = \frac{n\sum{x_iy_i} - \sum{x_i}\sum{y_i}}{\sqrt{n\sum{x_i}^2-(\sum{x_i})^2}\sqrt{n\sum{y_i}^2-(\sum{y_i})^2}}$$

You see immediately that the correlation coefficient is symmetric, which is nice, however it despicts an important lack of it: You cannot conclude causation of correlation! Your water consumption has a strong correlation with the outside temperature, however on a snowy day you could drink as much as you want, you probably would not raise the outside temperature (in this case please contact me in winter).

Example:
Assume we have the following example:

> X <- c(1, 3, 4, 7, 8, 23)
> Y <- c(3, 7, 8, 13, 24, 60)

To calculate the correlation in R we use the formula

> cor(X,Y)

and get the result $cor(X,Y) = 0.991259$ (the optional parameter "method" is by default "pearson", you can also choose "spearman" and "kendall").
To calculate it by hand we would first calculate the products $x_iy_i$ and the squares $x_i^2$ and $y_i^2$ and use the upper mentioned formula (verify!).

The result of the example states a very strong linear relationship between $X$ and $Y$, we see this in the diagramm (including the linear regression line y = -1.191 + 2.655x in red):

In a perfect relationship with correlation coefficient 1 all the data points would lie on a straight line.

What else is good to know?
- A correlation coefficient cannot tell you, if the correlation is significantly different from $0$ (e.g. to reject a hypothesis negating any relation between the variables). Therefore you need a test of significance (in R it is command cor.test(.712016436 ..)).
- There are of course other methods to determine correlation, especially for non-linear relationships.
- Partial correlation is a correlation between two quantitative variables on changes to selected further quantitative variables.
- The correlation is not very robust, an outlier could change its value considerably.