Correlation



What is a correlaion?
A correlation is a kind of relationship between two variables. It is a statistical value expressed in a correlation coefficient between -1 and 1 which measures to which degree the variables have a linear relationship between each other. The higher the (absolute) value of the correlation coefficient the stronger is the relation.


What types of correlaion exist?
- A value of 0 means that there is no (linear) relation between the variables (so called zero order correlation)
- A positive value means a positive relationship, that is one variable grows as the other one grows
- A negative value means a negative relationship, that is one variable falls as the other one grows.
 - There is even a standard available published by the Political Science Department at Quinnipiac University defining the following terms for the absolute values of correlation coefficients:

Value of the Correlation Coefficient     
0.00:            no relationship
0.01 - 0.19:  no or negligible
0.20 - 0.29:  weak
0.30 - 0.39:  moderate
0.40 - 0.69:  strong
>= 0.70:       very strong  


Examples:
- Aristoteles: "The more you know, the more you know you don't know" (positive relationship between what you know and what you know you don't know).
- Walt Disney:  "The more you like yourself, the less you are like anyone else, which makes you unique." (negative relationship between the amount by which you like yourself and the amount you are similar to others (don't overact))
- If I wish you all the best, what is left for me? (There should be no relation between what I wish you and what is left for me (I really hope so!))
- further examples can be found in this nice graphic by DenisBoigelot.


How is the correlation coefficient calculated?
There are different definitions of correlation. The most famous correlation coefficient is the so called Pearson product-moment correlation coefficient or simply calles Pearson's correlation coefficient. To calculate it use a formula: Assume we have two variables $X$ and $Y$. the correlation coefficient $\rho_{X,Y}$ or $corr(X,Y)$ is calculated by $$\rho_{X,Y} = corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}$$ where $\sigma_X$ and $\sigma_Y$ are standard deviations, $\mu_X$ and $\mu_Y$ are the means of the variables, $E$ stands for the expected values and $cov$ stands for the covariance.
If $X$ and $Y$ consist of indexed samples $x_i$ and $y_i$ for $i = 1..n$ we can rewrite the formular to $$\rho_{X,Y} = corr(X,Y) = \frac{n\sum{x_iy_i} - \sum{x_i}\sum{y_i}}{\sqrt{n\sum{x_i}^2-(\sum{x_i})^2}\sqrt{n\sum{y_i}^2-(\sum{y_i})^2}}$$

You see immediately that the correlation coefficient is symmetric, which is nice, however it despicts an important lack of it: You cannot conclude causation of correlation! Your water consumption has a strong correlation with the outside temperature, however on a snowy day you could drink as much as you want, you probably would not raise the outside temperature (in this case please contact me in winter).


Example:
Assume we have the following example:
> X <- c(1, 3, 4, 7, 8, 23)
> Y <- c(3, 7, 8, 13, 24, 60)

To calculate the correlation in R we use the formula
> cor(X,Y)

and get the result $cor(X,Y) = 0.991259$ (the optional parameter "method" is by default "pearson", you can also choose "spearman" and "kendall").
To calculate it by hand we would first calculate the products $x_iy_i$ and the squares $x_i^2$ and $y_i^2$ and use the upper mentioned formula (verify!).

The result of the example states a very strong linear relationship between $X$ and $Y$, we see this in the diagramm (including the linear regression line y = -1.191 + 2.655x in red):

In a perfect relationship with correlation coefficient 1 all the data points would lie on a straight line.

What else is good to know?
- A correlation coefficient cannot tell you, if the correlation is significantly different from $0$ (e.g. to reject a hypothesis negating any relation between the variables). Therefore you need a test of significance (in R it is command cor.test(.712016436 ..)).
- There are of course other methods to determine correlation, especially for non-linear relationships.
- Partial correlation is a correlation between two quantitative variables on changes to selected further quantitative variables.
- The correlation is not very robust, an outlier could change its value considerably.


Random Forests



Like mentioned in the post about decision trees, the big challenge to face is overfitting. To adress the issue, a concept developed by Tin Kam Ho has been introduced and later named and implemented by Leo Breiman. In his approach he tries to create distance from the training data (no need for test or cross validation dataset) by considering randomly chosen subsets, so called bootstaps. The remaining part, the so called Out-of-Bag data (e.g. one third of the training data is used to validate the classification). In addition for the contruction of each tree only a randomly chosen subsets (e.g. one third for regression problems and sqrt for classification) of the splitting features is taken into account.
The final result of a request is then the aggregated result of all the decision trees. So the approach makes use of the fact that cumulated decisions in a group in general yields to better results than individual decisions. The naming random is thereby a nice and intuitive wordplay.

As the different trees are independent of each other the evaluation of a random forest can be parallelized. Also it reduced the high variance that is often created by a single decision tree. For further information about the advantages and disadvantages of a random forest I refer to Leo Breimans site https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

R has an own library "randomForests" for random forests.

R - Simple Data Preparation

 
R covers a huge variety of functions for data manipulation, reorganization and separation. Here are some of the commands I consider the most useful when preparing data.

I will start with an example:
Imagine that we have an IoT scenario in which sensor data from temperature sensors s1, s2 and s3 are collected in an edge unit before a reduced set of data is sent over to the central unit. Each sensor sends data after a specific amount t in (milli)seconds.

s1:
t = 0: temp = 24.3
t = 1: temp = 24.7
t = 2: temp = 25.2
t = 3: temp = 25.0

s2:
t = 0: temp = 20.2
t = 1: temp = 20.1
t = 2: temp = 99.9
t = 3: temp = 20.1

s3:
t = 0: temp = 28.1
t = 1: temp = 28.0
t = 2: temp =
t = 3: temp = 27.7

To get the data into R we create vectors holding the values for the sensors. We see that the value for n=2 for sensor s3 is missing, therefore we add here "NA" in order to tell R that we do not have the value:
> s1 <- c(24.3, 24.7, 25.2, 25.0)
> s2 <- c(20.2, 20.1, 99.9, 20.1)
> s3 <- c(28.1, 28.0, NA, 27.7)

We can furthermore create a dataframe holding these measurements by
> sensor.data <- data.frame(s1, s2, s3)


What we get is the following output for sensor.data:
   s1   s2   s3
1 24.3 20.2 28.1
2 24.7 20.1 28.0
3 25.2 99.9   NA
4 25.0 20.1 27.7
 
We do not like the row identifyer that R sets up by default, we have our own identifyier using the values for t, we just have to tell R that it should use these values, we do that with command row.names:
> t <- c(0, 1, 2, 3)
> sensor.data <- data.frame(t, s1, s2, s3, row.names=t)

  t   s1   s2   s3
0 0 24.3 20.2 28.1
1 1 24.7 20.1 28.0
2 2 25.2 99.9   NA
3 3 25.0 20.1 27.7

First of all we want to see the data in an easy diagram:
> plot(t, sensor.data$s1, type="b", pch=1,
     col="red", xlim=c(0, 5), ylim=c(18, 100), 
     main="Sensor Data", lwd=2, xlab="time", ylab="temperature")
> lines(t, sensor.data$s2, type="b", pch=5, lwd=2, col="green")
> lines(t, sensor.data$s3, type="b", pch=7, lwd=2, col="blue")

We see that the value 99.9 seems to be a wrong measurement (we assume it here in order to see how to manipulate the data, actually an analysis is required to check the background of this outlying value):

> sensor.data[sensor.data[, 2:4]==99.9] <- NA
 
  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2   NA   NA
3 3 25.0 20.1 23.9
 
Also the program cannot handle NA values, therefore the plot seems to be incomplete. We can identify NA values using the command is.na(v) for a vector v.
We want to replace the missing values for sensor s2 and s3 by the means of the neighbour values (this is also an assumption that should be validated carefully):

> sensor.data$s2[3] <- (sensor.data$s2[2] + sensor.data$s2[4])/2
sensor.data$s3[3] <- (sensor.data$s3[2] + sensor.data$s3[4])/2
 
  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2 20.1 24.5
3 3 25.0 20.1 23.9
 
Now we decide that we want to extend our dataframe to also hold the sum of the values and the means of the values at every point in time. We do that by command:

> sensor.data.xt <- transform(sensor.data, sumx = s1 + s2 + s3, meanx = s1 + s2 + s3/3)
 
  t   s1   s2    s3  sumx    meanx
0 0 24.3 20.2 28.10 72.60 53.86667
1 1 24.7 20.1 28.00 72.80 54.13333
2 2 25.2 20.1 27.85 73.15 54.58333
3 3 25.0 20.1 27.70 72.80 54.33333
  
Next we decide to classify a dataset as critical, if the sum is greater than 73.0, as anormal if it is greater than 72.7 and normal else. So we create another variable in our data frame by

> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 73] <- "critical"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 72.7 & sensor.data.xt$sumx < 73] <- "anormal"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx < 72.7] <- "normal"

  t   s1   s2    s3  sumx    meanx riskcatg
0 0 24.3 20.2 28.10 72.60 53.86667   normal
1 1 24.7 20.1 28.00 72.80 54.13333  anormal
2 2 25.2 20.1 27.85 73.15 54.58333 critical
3 3 25.0 20.1 27.70 72.80 54.33333  anormal


Instead of using strings we make categories out of riskcategory by

> sensor.data.xt$riskcatg <- factor(sensor.data.xt$riskcatg)



TIPP: The statement  
variable[condition] <- expression 
is very powerful and useful for data manipulation
.

R - Packages and DataTypes



Packages
Functionality is maintained in packages. Some packages are part of the basic functionality and are predefined when you install R, others have to be installed (and loaded) before used.
To find out the directories of your packages, enter command .libPaths(), to find the already loaded packages choose search(). The commands installed.packages(), install.packages("..") and  update.packages() are self-explaining. To load a package use library(..) or require(..).
To show the content of a package use command help(package="..").
In RStudio this is easy, as you see the packages in an own window.


Data Types
  • Scalar A Scalar is a single value vector (numeric, logical or character value) Example: s <- 3="" i="">
  • Vector A vector is a collection of scalars of the same type, to combine use c(..) Example: v <- c(1, 2, s, 4, 5, 6, 7, 8)  (s from above)
  • Matrix A matrix is a collection of vectors, all elements have the same type. Example: m <- matrix(v, nrow=2, ncol=4, byrow = TRUE) (v from above, note that byrow determines if the values of v are filled by row or by column (default)
  • Array An Array is a collection of matrices, all elements have the same type. Example: a <- array(v, c(2,2,2))
  • DataFrame A data frame is a matrix that can hold different types of elements mixed. As your data will usually be a mix of different types, this is the most used datatype in R. Example: df <- data.frame(column1, colum2, ...) where the columns are vectors of the same length that can be of different type. Fill column names function names(df)=c("x", "y")
  • Factor A factor is a nominal (categorical) or ordinal variable (ordered categorical) Example: Yes/No are categorical, Small/Medium/Large/XLarge are ordinal variables
  • List A List is a wild collection of other data types. Example: l <- list(s, v, m, a, df) (variables from above). To name it use list("scalar"=s, ...). To access the elements use double brackets and index [[1]] or ["scalar"]

Null-Hypothesis, Z-Scores and Normal Distribution


This is Carl Friedrich Gauß, one of the most friendly looking German mathematicans, on a banknote. Nowadays Germany forms part of the European Union and use the (by the way much more secure) EUR banknote (a different story). He earned that honor because of his widely used contributions to algebra, geometry and astronomy. If you have a close look on the bank note  you will even spot the "Gaussian Bell Curve". This is what I want to talk about in this post.
We saw in this post about null hypothesis and p-values that in order to accept an alternative hypothesis, we have to find a way to reject a null hypothesis. I just mentioned the $p$-value, however I did not explain yet how to calculate it.

For every value we get its corresponding z-score in the data set by subtracting the average and deviding by the standard deviation.

Instead of the original dataset we can then create the set of standardized values. You might wonder, why we should do this: We know that under the null hypothesis the standardized values have an average of $0$ and a standard deviation of $1$ (given that the data set holds enough data, for less than 30 values there is in fact a slightly different assumption)! This means that we would expect the observed values to follow the normal distribution (see picture below). In particular, around $68\%$ of the values should lie between $-1$ and $1$, around $95\%$ of the values between $-2$ and $2$, and only very few data should lie outside $-3$ and $3$:
Actually the bell curve does not end at $-4$ or $4$, but the values are getting close to zero very fast.
The actual formula of the normal distribution for a means $\mu$ and a variance $\sigma^2$ is
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp^{-(x-\mu^2)/2\sigma^2}$$
Here the R code:
x <- seq(-4, 4, length = 10000)
y <- dnorm(x, 0, 1)
plot(x, y, type="n", xlab = "x", ylab = "y", main = "Bell Curve", axes = TRUE)
lines(x, y, col="red")

To calculate the $p$-value in R, we can use the command
pnorm(...)

What we get out of it:
For every value $x$ the $y$ value determines the frequency of the occurance of this value in a normal distributed data set. We can now check the probability of a result as good as our (standardized) observation. We have to choose between a one-tailed and a two-tailed test, meaning we put our decision boundary on the one or two sides (split up) of the bell curve (details in a different post). Now if our observation is lower than the value we specified for our significance level, we decide to reject the null hypothesis.

Null Hypothesis, P-Value and Significance Level




What is the null hypothesis?
That's easy: It is the assumption, that an observation is simply due to chance. The contrary assumption, that an observation is NOT due to chance, is called the alternative hypothesis.

What is it used for?
It is used to formulate a data science problem of the contrary form: Could it be that there is something beyond chance in my observation? Can I build up confidence that there is "something special" in the data (the alternative hypothesis)?
To answer the question, we use a classical mathematical trick: We assume that the null hypothesis is true, so we work on purely random observations. Knowing the rules for chance, we could then calculate the probability for a value as good as the observed value, the so called $p$-value, given the null hypothesis is true. If this probability $p$ is high, then it is not a good idea to reject it. On the other hand, if $p$ is low then we would say that the assumption is not plausible and found a way to reject the null hypothesis - we gained confidence that the alternative hypothesis is valid.

When do we reject the null hypothesis?
If the $p$-value falls below a critical level, the so called significance level $\alpha$. Usual values are $\alpha = 0.05$ or $\alpha = 0.025$. In this case the probability is too low to accept the null hypothesis, the deviation from it is statistically significant.
Assume that the probability for the observation under the null hypothesis is only around 1% , in this case we are pretty confident (99%) that we should reject the null hypothesis. Therefore the $q$-value $q = 1 - p$ (here 99%) is also called the confidence.

If you are interested in how to calculate the $p$-value using the $z$-values, check out this post about the z-scores.



R - How to Start

 

Welcome to a new challenge - an empty green meadow is waiting to be explored and developed!

R is a pretty strange programming language, it does not follow the usual conventions of a programming language and is not intuitive, or at least not in the beginning. I am familiar with quite a few programming languages, however R is nothing like them (in which other programming language do you prefer the assignment operator "<-" over a "="??). However as R is currently considered the statistics reference programm (even more popular than SPSS) and in addition it is open source, it is worth looking at it.

Here are some tips if you think about starting to learn R:

- Install R first, play around with the console. You will find out that it is not very handy. E.g. R does not use line delimiter (";", ".") after statements, R only permits a maximum line length is 80 characters, ...
- After you found out, that R is quite strange, get the RStudio (e.g. here). The RStudio is really helpful, it provides short cuts for the most used commands, a simple structure and a pretty nice user interface.
- Make yourself familiar with Google's R Style Guide, so you do make a fool out of yourself chatting with experts. Also you learn a lot of R's specialities (so far I considered "." a bad choice for a letter in identifyiers as you expect it to point to a subattribute, however in R it is accepted and "_" is the bad choice...)
- Remember command "rm(list = ls())" which is used to clear the current workspace (yes, there is a current workspace in which locally created variables live!)
- Look for free online courses (there are tons of them)
- Get familiar with the shortcuts "ALT + -" and "Strg + L"
- Find  information on available packages on the sites of CRAN, have a look especially on this crantastic page that allows you to search for (popular) packages
- Use the predefined data sets in R (see "data()" for an overview), they will often be used in examples and it feels good to already know them
- Use command "View(..)" regularily on your data to get a clean picture of it
- When you use "require("packageName")", remember to use "detach("package:packageName", unload = TRUE)" at the end (use those commands together in RStudio)
- use command demo() to get an overview on the demos included in R. To execute a demos use the same command with one of the given arguments (e.g. demo(colors)).
- use fix() on a data frame to correct manually
- use transform() and the powerful (s)apply() on your data frames

This post is being updated regularily.