Modern Data Analysis: Mai 2017

Correlation

What is a correlaion?
A correlation is a kind of relationship between two variables. It is a statistical value expressed in a correlation coefficient between -1 and 1 which measures to which degree the variables have a linear relationship between each other. The higher the (absolute) value of the correlation coefficient the stronger is the relation.

What types of correlaion exist?
- A value of 0 means that there is no (linear) relation between the variables (so called zero order correlation)
- A positive value means a positive relationship, that is one variable grows as the other one grows
- A negative value means a negative relationship, that is one variable falls as the other one grows.
- There is even a standard available published by the Political Science Department at Quinnipiac University defining the following terms for the absolute values of correlation coefficients:

Value of the Correlation Coefficient

0.00: no relationship

0.01 - 0.19: no or negligible

0.20 - 0.29: weak

0.30 - 0.39: moderate

0.40 - 0.69: strong

>= 0.70: very strong

Examples:
- Aristoteles: "The more you know, the more you know you don't know" (positive relationship between what you know and what you know you don't know).
- Walt Disney: "The more you like yourself, the less you are like anyone else, which makes you unique." (negative relationship between the amount by which you like yourself and the amount you are similar to others (don't overact))
- If I wish you all the best, what is left for me? (There should be no relation between what I wish you and what is left for me (I really hope so!))
- further examples can be found in this nice graphic by DenisBoigelot.

How is the correlation coefficient calculated?
There are different definitions of correlation. The most famous correlation coefficient is the so called Pearson product-moment correlation coefficient or simply calles Pearson's correlation coefficient. To calculate it use a formula: Assume we have two variables $X$ and $Y$. the correlation coefficient $\rho_{X,Y}$ or $corr(X,Y)$ is calculated by $$\rho_{X,Y} = corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}$$ where $\sigma_X$ and $\sigma_Y$ are standard deviations, $\mu_X$ and $\mu_Y$ are the means of the variables, $E$ stands for the expected values and $cov$ stands for the covariance.
If $X$ and $Y$ consist of indexed samples $x_i$ and $y_i$ for $i = 1..n$ we can rewrite the formular to $$\rho_{X,Y} = corr(X,Y) = \frac{n\sum{x_iy_i} - \sum{x_i}\sum{y_i}}{\sqrt{n\sum{x_i}^2-(\sum{x_i})^2}\sqrt{n\sum{y_i}^2-(\sum{y_i})^2}}$$

You see immediately that the correlation coefficient is symmetric, which is nice, however it despicts an important lack of it: You cannot conclude causation of correlation! Your water consumption has a strong correlation with the outside temperature, however on a snowy day you could drink as much as you want, you probably would not raise the outside temperature (in this case please contact me in winter).

Example:
Assume we have the following example:

> X <- c(1, 3, 4, 7, 8, 23)
> Y <- c(3, 7, 8, 13, 24, 60)

To calculate the correlation in R we use the formula

> cor(X,Y)

and get the result $cor(X,Y) = 0.991259$ (the optional parameter "method" is by default "pearson", you can also choose "spearman" and "kendall").
To calculate it by hand we would first calculate the products $x_iy_i$ and the squares $x_i^2$ and $y_i^2$ and use the upper mentioned formula (verify!).

The result of the example states a very strong linear relationship between $X$ and $Y$, we see this in the diagramm (including the linear regression line y = -1.191 + 2.655x in red):

In a perfect relationship with correlation coefficient 1 all the data points would lie on a straight line.

What else is good to know?
- A correlation coefficient cannot tell you, if the correlation is significantly different from $0$ (e.g. to reject a hypothesis negating any relation between the variables). Therefore you need a test of significance (in R it is command cor.test(.712016436 ..)).
- There are of course other methods to determine correlation, especially for non-linear relationships.
- Partial correlation is a correlation between two quantitative variables on changes to selected further quantitative variables.
- The correlation is not very robust, an outlier could change its value considerably.

Random Forests

Mirko

Decision_Tree 0

Like mentioned in the post about decision trees, the big challenge to face is overfitting. To adress the issue, a concept developed by Tin Kam Ho has been introduced and later named and implemented by Leo Breiman. In his approach he tries to create distance from the training data (no need for test or cross validation dataset) by considering randomly chosen subsets, so called bootstaps. The remaining part, the so called Out-of-Bag data (e.g. one third of the training data is used to validate the classification). In addition for the contruction of each tree only a randomly chosen subsets (e.g. one third for regression problems and sqrt for classification) of the splitting features is taken into account.
The final result of a request is then the aggregated result of all the decision trees. So the approach makes use of the fact that cumulated decisions in a group in general yields to better results than individual decisions. The naming random is thereby a nice and intuitive wordplay.

As the different trees are independent of each other the evaluation of a random forest can be parallelized. Also it reduced the high variance that is often created by a single decision tree. For further information about the advantages and disadvantages of a random forest I refer to Leo Breimans site https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

R has an own library "randomForests" for random forests.

R - Simple Data Preparation

Mirko

R 0

R covers a huge variety of functions for data manipulation, reorganization and separation. Here are some of the commands I consider the most useful when preparing data.

I will start with an example:
Imagine that we have an IoT scenario in which sensor data from temperature sensors s1, s2 and s3 are collected in an edge unit before a reduced set of data is sent over to the central unit. Each sensor sends data after a specific amount t in (milli)seconds.

s1:
t = 0: temp = 24.3
t = 1: temp = 24.7
t = 2: temp = 25.2
t = 3: temp = 25.0

s2:
t = 0: temp = 20.2
t = 1: temp = 20.1
t = 2: temp = 99.9
t = 3: temp = 20.1

s3:
t = 0: temp = 28.1
t = 1: temp = 28.0
t = 2: temp =
t = 3: temp = 27.7

To get the data into R we create vectors holding the values for the sensors. We see that the value for n=2 for sensor s3 is missing, therefore we add here "NA" in order to tell R that we do not have the value:

> s1 <- c(24.3, 24.7, 25.2, 25.0)
> s2 <- c(20.2, 20.1, 99.9, 20.1)
> s3 <- c(28.1, 28.0, NA, 27.7)

We can furthermore create a dataframe holding these measurements by

> sensor.data <- data.frame(s1, s2, s3)

What we get is the following output for sensor.data:

   s1   s2   s3
1 24.3 20.2 28.1
2 24.7 20.1 28.0
3 25.2 99.9   NA
4 25.0 20.1 27.7

We do not like the row identifyer that R sets up by default, we have our own identifyier using the values for t, we just have to tell R that it should use these values, we do that with command row.names:

> t <- c(0, 1, 2, 3)
> sensor.data <- data.frame(t, s1, s2, s3, row.names=t)

  t   s1   s2   s3
0 0 24.3 20.2 28.1
1 1 24.7 20.1 28.0
2 2 25.2 99.9   NA
3 3 25.0 20.1 27.7

First of all we want to see the data in an easy diagram:

> plot(t, sensor.data$s1, type="b", pch=1,
     col="red", xlim=c(0, 5), ylim=c(18, 100), 
     main="Sensor Data", lwd=2, xlab="time", ylab="temperature")
> lines(t, sensor.data$s2, type="b", pch=5, lwd=2, col="green")
> lines(t, sensor.data$s3, type="b", pch=7, lwd=2, col="blue")

We see that the value 99.9 seems to be a wrong measurement (we assume it here in order to see how to manipulate the data, actually an analysis is required to check the background of this outlying value):

> sensor.data[sensor.data[, 2:4]==99.9] <- NA

  t   s1   s2   s3

0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2   NA   NA
3 3 25.0 20.1 23.9

Also the program cannot handle NA values, therefore the plot seems to be incomplete. We can identify NA values using the command is.na(v) for a vector v.
We want to replace the missing values for sensor s2 and s3 by the means of the neighbour values (this is also an assumption that should be validated carefully):

> sensor.data$s2[3] <- (sensor.data$s2[2] + sensor.data$s2[4])/2
sensor.data$s3[3] <- (sensor.data$s3[2] + sensor.data$s3[4])/2

  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2 20.1 24.5
3 3 25.0 20.1 23.9

Now we decide that we want to extend our dataframe to also hold the sum of the values and the means of the values at every point in time. We do that by command:

> sensor.data.xt <- transform(sensor.data, sumx = s1 + s2 + s3, meanx = s1 + s2 + s3/3)

  t   s1   s2    s3  sumx    meanx
0 0 24.3 20.2 28.10 72.60 53.86667
1 1 24.7 20.1 28.00 72.80 54.13333
2 2 25.2 20.1 27.85 73.15 54.58333
3 3 25.0 20.1 27.70 72.80 54.33333

Next we decide to classify a dataset as critical, if the sum is greater than 73.0, as anormal if it is greater than 72.7 and normal else. So we create another variable in our data frame by

> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 73] <- "critical"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 72.7 & sensor.data.xt$sumx < 73] <- "anormal"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx < 72.7] <- "normal"

  t   s1   s2    s3  sumx    meanx riskcatg
0 0 24.3 20.2 28.10 72.60 53.86667   normal
1 1 24.7 20.1 28.00 72.80 54.13333  anormal
2 2 25.2 20.1 27.85 73.15 54.58333 critical
3 3 25.0 20.1 27.70 72.80 54.33333  anormal

Instead of using strings we make categories out of riskcategory by

> sensor.data.xt$riskcatg <- factor(sensor.data.xt$riskcatg)

TIPP: The statement

variable[condition] <- expression

is very powerful and useful for data manipulation