Correlation



What is a correlaion?
A correlation is a kind of relationship between two variables. It is a statistical value expressed in a correlation coefficient between -1 and 1 which measures to which degree the variables have a linear relationship between each other. The higher the (absolute) value of the correlation coefficient the stronger is the relation.


What types of correlaion exist?
- A value of 0 means that there is no (linear) relation between the variables (so called zero order correlation)
- A positive value means a positive relationship, that is one variable grows as the other one grows
- A negative value means a negative relationship, that is one variable falls as the other one grows.
 - There is even a standard available published by the Political Science Department at Quinnipiac University defining the following terms for the absolute values of correlation coefficients:

Value of the Correlation Coefficient     
0.00:            no relationship
0.01 - 0.19:  no or negligible
0.20 - 0.29:  weak
0.30 - 0.39:  moderate
0.40 - 0.69:  strong
>= 0.70:       very strong  


Examples:
- Aristoteles: "The more you know, the more you know you don't know" (positive relationship between what you know and what you know you don't know).
- Walt Disney:  "The more you like yourself, the less you are like anyone else, which makes you unique." (negative relationship between the amount by which you like yourself and the amount you are similar to others (don't overact))
- If I wish you all the best, what is left for me? (There should be no relation between what I wish you and what is left for me (I really hope so!))
- further examples can be found in this nice graphic by DenisBoigelot.


How is the correlation coefficient calculated?
There are different definitions of correlation. The most famous correlation coefficient is the so called Pearson product-moment correlation coefficient or simply calles Pearson's correlation coefficient. To calculate it use a formula: Assume we have two variables $X$ and $Y$. the correlation coefficient $\rho_{X,Y}$ or $corr(X,Y)$ is calculated by $$\rho_{X,Y} = corr(X,Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}$$ where $\sigma_X$ and $\sigma_Y$ are standard deviations, $\mu_X$ and $\mu_Y$ are the means of the variables, $E$ stands for the expected values and $cov$ stands for the covariance.
If $X$ and $Y$ consist of indexed samples $x_i$ and $y_i$ for $i = 1..n$ we can rewrite the formular to $$\rho_{X,Y} = corr(X,Y) = \frac{n\sum{x_iy_i} - \sum{x_i}\sum{y_i}}{\sqrt{n\sum{x_i}^2-(\sum{x_i})^2}\sqrt{n\sum{y_i}^2-(\sum{y_i})^2}}$$

You see immediately that the correlation coefficient is symmetric, which is nice, however it despicts an important lack of it: You cannot conclude causation of correlation! Your water consumption has a strong correlation with the outside temperature, however on a snowy day you could drink as much as you want, you probably would not raise the outside temperature (in this case please contact me in winter).


Example:
Assume we have the following example:
> X <- c(1, 3, 4, 7, 8, 23)
> Y <- c(3, 7, 8, 13, 24, 60)

To calculate the correlation in R we use the formula
> cor(X,Y)

and get the result $cor(X,Y) = 0.991259$ (the optional parameter "method" is by default "pearson", you can also choose "spearman" and "kendall").
To calculate it by hand we would first calculate the products $x_iy_i$ and the squares $x_i^2$ and $y_i^2$ and use the upper mentioned formula (verify!).

The result of the example states a very strong linear relationship between $X$ and $Y$, we see this in the diagramm (including the linear regression line y = -1.191 + 2.655x in red):

In a perfect relationship with correlation coefficient 1 all the data points would lie on a straight line.

What else is good to know?
- A correlation coefficient cannot tell you, if the correlation is significantly different from $0$ (e.g. to reject a hypothesis negating any relation between the variables). Therefore you need a test of significance (in R it is command cor.test(.712016436 ..)).
- There are of course other methods to determine correlation, especially for non-linear relationships.
- Partial correlation is a correlation between two quantitative variables on changes to selected further quantitative variables.
- The correlation is not very robust, an outlier could change its value considerably.


Random Forests



Like mentioned in the post about decision trees, the big challenge to face is overfitting. To adress the issue, a concept developed by Tin Kam Ho has been introduced and later named and implemented by Leo Breiman. In his approach he tries to create distance from the training data (no need for test or cross validation dataset) by considering randomly chosen subsets, so called bootstaps. The remaining part, the so called Out-of-Bag data (e.g. one third of the training data is used to validate the classification). In addition for the contruction of each tree only a randomly chosen subsets (e.g. one third for regression problems and sqrt for classification) of the splitting features is taken into account.
The final result of a request is then the aggregated result of all the decision trees. So the approach makes use of the fact that cumulated decisions in a group in general yields to better results than individual decisions. The naming random is thereby a nice and intuitive wordplay.

As the different trees are independent of each other the evaluation of a random forest can be parallelized. Also it reduced the high variance that is often created by a single decision tree. For further information about the advantages and disadvantages of a random forest I refer to Leo Breimans site https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm.

R has an own library "randomForests" for random forests.

R - Simple Data Preparation

 
R covers a huge variety of functions for data manipulation, reorganization and separation. Here are some of the commands I consider the most useful when preparing data.

I will start with an example:
Imagine that we have an IoT scenario in which sensor data from temperature sensors s1, s2 and s3 are collected in an edge unit before a reduced set of data is sent over to the central unit. Each sensor sends data after a specific amount t in (milli)seconds.

s1:
t = 0: temp = 24.3
t = 1: temp = 24.7
t = 2: temp = 25.2
t = 3: temp = 25.0

s2:
t = 0: temp = 20.2
t = 1: temp = 20.1
t = 2: temp = 99.9
t = 3: temp = 20.1

s3:
t = 0: temp = 28.1
t = 1: temp = 28.0
t = 2: temp =
t = 3: temp = 27.7

To get the data into R we create vectors holding the values for the sensors. We see that the value for n=2 for sensor s3 is missing, therefore we add here "NA" in order to tell R that we do not have the value:
> s1 <- c(24.3, 24.7, 25.2, 25.0)
> s2 <- c(20.2, 20.1, 99.9, 20.1)
> s3 <- c(28.1, 28.0, NA, 27.7)

We can furthermore create a dataframe holding these measurements by
> sensor.data <- data.frame(s1, s2, s3)


What we get is the following output for sensor.data:
   s1   s2   s3
1 24.3 20.2 28.1
2 24.7 20.1 28.0
3 25.2 99.9   NA
4 25.0 20.1 27.7
 
We do not like the row identifyer that R sets up by default, we have our own identifyier using the values for t, we just have to tell R that it should use these values, we do that with command row.names:
> t <- c(0, 1, 2, 3)
> sensor.data <- data.frame(t, s1, s2, s3, row.names=t)

  t   s1   s2   s3
0 0 24.3 20.2 28.1
1 1 24.7 20.1 28.0
2 2 25.2 99.9   NA
3 3 25.0 20.1 27.7

First of all we want to see the data in an easy diagram:
> plot(t, sensor.data$s1, type="b", pch=1,
     col="red", xlim=c(0, 5), ylim=c(18, 100), 
     main="Sensor Data", lwd=2, xlab="time", ylab="temperature")
> lines(t, sensor.data$s2, type="b", pch=5, lwd=2, col="green")
> lines(t, sensor.data$s3, type="b", pch=7, lwd=2, col="blue")

We see that the value 99.9 seems to be a wrong measurement (we assume it here in order to see how to manipulate the data, actually an analysis is required to check the background of this outlying value):

> sensor.data[sensor.data[, 2:4]==99.9] <- NA
 
  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2   NA   NA
3 3 25.0 20.1 23.9
 
Also the program cannot handle NA values, therefore the plot seems to be incomplete. We can identify NA values using the command is.na(v) for a vector v.
We want to replace the missing values for sensor s2 and s3 by the means of the neighbour values (this is also an assumption that should be validated carefully):

> sensor.data$s2[3] <- (sensor.data$s2[2] + sensor.data$s2[4])/2
sensor.data$s3[3] <- (sensor.data$s3[2] + sensor.data$s3[4])/2
 
  t   s1   s2   s3
0 0 24.3 20.2 26.1
1 1 24.7 20.1 25.1
2 2 25.2 20.1 24.5
3 3 25.0 20.1 23.9
 
Now we decide that we want to extend our dataframe to also hold the sum of the values and the means of the values at every point in time. We do that by command:

> sensor.data.xt <- transform(sensor.data, sumx = s1 + s2 + s3, meanx = s1 + s2 + s3/3)
 
  t   s1   s2    s3  sumx    meanx
0 0 24.3 20.2 28.10 72.60 53.86667
1 1 24.7 20.1 28.00 72.80 54.13333
2 2 25.2 20.1 27.85 73.15 54.58333
3 3 25.0 20.1 27.70 72.80 54.33333
  
Next we decide to classify a dataset as critical, if the sum is greater than 73.0, as anormal if it is greater than 72.7 and normal else. So we create another variable in our data frame by

> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 73] <- "critical"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx >= 72.7 & sensor.data.xt$sumx < 73] <- "anormal"
> sensor.data.xt$riskcatg[sensor.data.xt$sumx < 72.7] <- "normal"

  t   s1   s2    s3  sumx    meanx riskcatg
0 0 24.3 20.2 28.10 72.60 53.86667   normal
1 1 24.7 20.1 28.00 72.80 54.13333  anormal
2 2 25.2 20.1 27.85 73.15 54.58333 critical
3 3 25.0 20.1 27.70 72.80 54.33333  anormal


Instead of using strings we make categories out of riskcategory by

> sensor.data.xt$riskcatg <- factor(sensor.data.xt$riskcatg)



TIPP: The statement  
variable[condition] <- expression 
is very powerful and useful for data manipulation
.

R - Packages and DataTypes



Packages
Functionality is maintained in packages. Some packages are part of the basic functionality and are predefined when you install R, others have to be installed (and loaded) before used.
To find out the directories of your packages, enter command .libPaths(), to find the already loaded packages choose search(). The commands installed.packages(), install.packages("..") and  update.packages() are self-explaining. To load a package use library(..) or require(..).
To show the content of a package use command help(package="..").
In RStudio this is easy, as you see the packages in an own window.


Data Types
  • Scalar A Scalar is a single value vector (numeric, logical or character value) Example: s <- 3="" i="">
  • Vector A vector is a collection of scalars of the same type, to combine use c(..) Example: v <- c(1, 2, s, 4, 5, 6, 7, 8)  (s from above)
  • Matrix A matrix is a collection of vectors, all elements have the same type. Example: m <- matrix(v, nrow=2, ncol=4, byrow = TRUE) (v from above, note that byrow determines if the values of v are filled by row or by column (default)
  • Array An Array is a collection of matrices, all elements have the same type. Example: a <- array(v, c(2,2,2))
  • DataFrame A data frame is a matrix that can hold different types of elements mixed. As your data will usually be a mix of different types, this is the most used datatype in R. Example: df <- data.frame(column1, colum2, ...) where the columns are vectors of the same length that can be of different type. Fill column names function names(df)=c("x", "y")
  • Factor A factor is a nominal (categorical) or ordinal variable (ordered categorical) Example: Yes/No are categorical, Small/Medium/Large/XLarge are ordinal variables
  • List A List is a wild collection of other data types. Example: l <- list(s, v, m, a, df) (variables from above). To name it use list("scalar"=s, ...). To access the elements use double brackets and index [[1]] or ["scalar"]

Null-Hypothesis, Z-Scores and Normal Distribution


This is Carl Friedrich Gauß, one of the most friendly looking German mathematicans, on a banknote. Nowadays Germany forms part of the European Union and use the (by the way much more secure) EUR banknote (a different story). He earned that honor because of his widely used contributions to algebra, geometry and astronomy. If you have a close look on the bank note  you will even spot the "Gaussian Bell Curve". This is what I want to talk about in this post.
We saw in this post about null hypothesis and p-values that in order to accept an alternative hypothesis, we have to find a way to reject a null hypothesis. I just mentioned the $p$-value, however I did not explain yet how to calculate it.

For every value we get its corresponding z-score in the data set by subtracting the average and deviding by the standard deviation.

Instead of the original dataset we can then create the set of standardized values. You might wonder, why we should do this: We know that under the null hypothesis the standardized values have an average of $0$ and a standard deviation of $1$ (given that the data set holds enough data, for less than 30 values there is in fact a slightly different assumption)! This means that we would expect the observed values to follow the normal distribution (see picture below). In particular, around $68\%$ of the values should lie between $-1$ and $1$, around $95\%$ of the values between $-2$ and $2$, and only very few data should lie outside $-3$ and $3$:
Actually the bell curve does not end at $-4$ or $4$, but the values are getting close to zero very fast.
The actual formula of the normal distribution for a means $\mu$ and a variance $\sigma^2$ is
$$f(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp^{-(x-\mu^2)/2\sigma^2}$$
Here the R code:
x <- seq(-4, 4, length = 10000)
y <- dnorm(x, 0, 1)
plot(x, y, type="n", xlab = "x", ylab = "y", main = "Bell Curve", axes = TRUE)
lines(x, y, col="red")

To calculate the $p$-value in R, we can use the command
pnorm(...)

What we get out of it:
For every value $x$ the $y$ value determines the frequency of the occurance of this value in a normal distributed data set. We can now check the probability of a result as good as our (standardized) observation. We have to choose between a one-tailed and a two-tailed test, meaning we put our decision boundary on the one or two sides (split up) of the bell curve (details in a different post). Now if our observation is lower than the value we specified for our significance level, we decide to reject the null hypothesis.

Null Hypothesis, P-Value and Significance Level




What is the null hypothesis?
That's easy: It is the assumption, that an observation is simply due to chance. The contrary assumption, that an observation is NOT due to chance, is called the alternative hypothesis.

What is it used for?
It is used to formulate a data science problem of the contrary form: Could it be that there is something beyond chance in my observation? Can I build up confidence that there is "something special" in the data (the alternative hypothesis)?
To answer the question, we use a classical mathematical trick: We assume that the null hypothesis is true, so we work on purely random observations. Knowing the rules for chance, we could then calculate the probability for a value as good as the observed value, the so called $p$-value, given the null hypothesis is true. If this probability $p$ is high, then it is not a good idea to reject it. On the other hand, if $p$ is low then we would say that the assumption is not plausible and found a way to reject the null hypothesis - we gained confidence that the alternative hypothesis is valid.

When do we reject the null hypothesis?
If the $p$-value falls below a critical level, the so called significance level $\alpha$. Usual values are $\alpha = 0.05$ or $\alpha = 0.025$. In this case the probability is too low to accept the null hypothesis, the deviation from it is statistically significant.
Assume that the probability for the observation under the null hypothesis is only around 1% , in this case we are pretty confident (99%) that we should reject the null hypothesis. Therefore the $q$-value $q = 1 - p$ (here 99%) is also called the confidence.

If you are interested in how to calculate the $p$-value using the $z$-values, check out this post about the z-scores.



R - How to Start

 

Welcome to a new challenge - an empty green meadow is waiting to be explored and developed!

R is a pretty strange programming language, it does not follow the usual conventions of a programming language and is not intuitive, or at least not in the beginning. I am familiar with quite a few programming languages, however R is nothing like them (in which other programming language do you prefer the assignment operator "<-" over a "="??). However as R is currently considered the statistics reference programm (even more popular than SPSS) and in addition it is open source, it is worth looking at it.

Here are some tips if you think about starting to learn R:

- Install R first, play around with the console. You will find out that it is not very handy. E.g. R does not use line delimiter (";", ".") after statements, R only permits a maximum line length is 80 characters, ...
- After you found out, that R is quite strange, get the RStudio (e.g. here). The RStudio is really helpful, it provides short cuts for the most used commands, a simple structure and a pretty nice user interface.
- Make yourself familiar with Google's R Style Guide, so you do make a fool out of yourself chatting with experts. Also you learn a lot of R's specialities (so far I considered "." a bad choice for a letter in identifyiers as you expect it to point to a subattribute, however in R it is accepted and "_" is the bad choice...)
- Remember command "rm(list = ls())" which is used to clear the current workspace (yes, there is a current workspace in which locally created variables live!)
- Look for free online courses (there are tons of them)
- Get familiar with the shortcuts "ALT + -" and "Strg + L"
- Find  information on available packages on the sites of CRAN, have a look especially on this crantastic page that allows you to search for (popular) packages
- Use the predefined data sets in R (see "data()" for an overview), they will often be used in examples and it feels good to already know them
- Use command "View(..)" regularily on your data to get a clean picture of it
- When you use "require("packageName")", remember to use "detach("package:packageName", unload = TRUE)" at the end (use those commands together in RStudio)
- use command demo() to get an overview on the demos included in R. To execute a demos use the same command with one of the given arguments (e.g. demo(colors)).
- use fix() on a data frame to correct manually
- use transform() and the powerful (s)apply() on your data frames

This post is being updated regularily.
 



Lift

 
Like its name indicates a lift is a measure on how good your binary classifier model is lifting the predictions. In other words it measures how well the new models choice is better than an old models choice or the random selection. A lift plot is an alternative to a ROC curve when you want to compare two classifier, it provides insights on the model and can help to determine a cutoff.

Lets start with the definition of list:
The lift measures the change in concentration of a target value if applied to a subgroup of the test set that is chosen by the model. Note that the lift is always connected to the concentration of the target value in the test set, the lower the concentration, the higher is also the possible lift of the model. Therefore there are no restrictions to the values list of the lift.

In an example:
Imagine that you want to improve a marketing campaign which adresses customers in order to sell new products or services. From the past campaign you build up a predictive model trying to identify the customers who are likely to respond. In the past campaign 4% of the overall adressed customer responded. If you now choose a random subset of 10 % of the adressed customers, you would expect 4% of the responders in there.
Now with your new model you can identify possible responders and you therefore choose the 10% most likely to respond to the campaign. If from these 4% of all customers 16% respond, then your classifier has a lift of 16 / 4 = 6.
Now you can calculate the expected responses adressing 20%, 30%, ... and you can plot the data in a so called lift chart, that could look like this (the yellow line corresponds to the random classifier, the red one to the fictious model):



These lift charts can be built up for different classifiers in order to compare them. Also from the form of the curve a possible maximum of adressed customer can be choosen in order to optimize the costs for the campaigns.





The ROC curve


 The ROC curve (receiver operating characteristic curve) is a graphical illustration that can be used to visualize and compare the quality of binary classifiers.

Recall that for a classification experiment we can built the confusion matrix. In there we see the values of the true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). For the ROC curve we need the TP rate on the y-axis and the FP rate on the x-axis. As TP is also called sensitivity and FP is equal to 1-specificity, the ROC curve is also called the sensitivity vs (1-specificity) plot.

We define the ROC space to be the area in the following area:
Now run your classifier and calculate sensitivity and specificity and create the data point corresponding to your test into the ROC space. The closer it is to the upper left corner, the better is you classifier. If you have a 100% correctly predicting classifier -  congratulations, you will find your classifyier directly at (0,1).


Now to create the ROC curve, begin with the values that are easiest to classify positive. Then stepwise include more examples. Every time the classifier classifies a dataset correctly as positive, the line will go up, every time the classifiere classifies a dataset incorrectly as positive, the curve will move to the right (for small values you will actually not get a curve, but a zigzag line.

What can you read from the ROC curve?
 
A random classifier would create equal proportions of true positives and false positives (independent of the number of actual true/false positives). So a random classifier would yield to the yellow line.
The perfect classifier would not create any false positives, so we get sensitivity 1 already for specificity 1 (false positive rate = 0), this would be the green line in the ROC space.
A non-trivial classifier would lie somewhere in between (red line), note that a bad classifier would lie below the random line and could be turned into a better classifier by simply inverting the predictions.

What else can be said about the ROC curve:
- Unlike cumulative gains chart the ROC curve does not depend on the densitiy of the positives in the test dataset.
- Sometimes the area under the ROC curve ("Area under the Curve", AUC) is determined in order to give a classifier a comparable number. A bad classifier would have an AUC close to 0.5 (random classifier), meanwhile a good classifier would have a value close to 1.
Note that this approach to compare classifiers has been questioned based upon recent machine learning research, among other reasons because AUC seems to be a quite noisy measure.

Classical Decomposition of a time series

A time series is a data set that has a time component. Yes, it is just what you think about, in the optimal case you have one value in a fixed time interval. It is a pretty actual topic to create massive datasets on behalf of sensors in an Internet of Things scenario, and you usually get too much values (>100 measurements per second), so to create a useful time series requires some preprocessing (filter or average). Missing values can be a problem, also it is usually not difficult to find a good estimate.

You can therefore build a time line which could look like this:


Here I used R's default dataset "AirPassengers" which reflects the monthly international airline passenger numbers in the years 1949 to 1960.

Natural questions are: 
  • Can you see which of the years were good, which of them were bad?
  • Can you split up the time series into more homogenious components?
  • Can you predict the values for future months? And how good are these predictions?
Time series analysis seems not too complex, however in reality often there are combinations of models required to make good predictions. In this post I want to show you how standard approches work. In reality, especially in retail, the timelines are usually not that easy to handle, as it is widely influenced by customer reviews.

The central idea is, that a time series $Y = (Y_t)$ is a combination of a three independent sub - time series:
- A trend component T is a long-term tendence in the data, it does not have to be linear.
- A seasonal component S is a pattern that reoccurs regularily after a fixed period (like every summer, every january or every day at 10:30).
- A random component I, also called irregular or noise.

We want to try to find these three time series in the upper mentioned example. First we have to decide on the type of decomposition, we can choose from additive and multiplicative.
In an additive model we add the 3 sub time series up to get the original time series: $$Y_t = T_t + S_t + I_t$$ You should use it when the seasonal variance does not change that much.

In a multiplicative model we multiply the 3 sub time series: $$Y_t= T_t * S_t * I_t$$ Use it when you see the peeks growing with time, like in the earlier mentioned example of airplane passengers. Here we should go for a multiplicative model.



Tip: A multiplicative model often can be changed into an additive model using the log function.

How would we get the values for the trend, seasonal and random sub time series? We will go step by step, just to motivate, here the result calculated by R with function decompose:


Here the corresponding coding in R: $$plot(decompose(ts(AirPassengers, frequency = 12, start =1949), type = "mult"))$$

To go on:

1. Here is how to determine the trend component
2. Here is how to determine the seasonal and random component
3. Here is a summary on the classical decomposition of time series

Seasonality and Random Determination

In this post we saw how the three components trend, seasonality and random of a time series. How to extract the trend was shown there, now we focus on how the seasonal component and the random component is determined.
Assume we have a detrended time series (we take here the AirPassengers time series and remove the trend). We assume a seasonality of a fixed period. In reality the assumption to have a fixed seasonality is too strict, as the period could shorten or change its structure over time. But under this assumption the determination of the seasonality is easy: To get the seasonal value of January, we take all values of January and build the average. This is the pattern we use for all periods.
The last step is to determine the random component $I$, we get it by simply removing the trend $T$ and seasonal component $S$ from the original time series $Y$, in an additive model this would be $I_t = Y_t - T_t - S_t$ and in a multiplicative model $I_t = Y_t / (T_t * S_t)$.
In our example, this is how the random components looks like:
What can we get out of it?
The random component shows the noise in the data, the values that do not fit the model. It helps to get a feeling how well the data is explained by the assumption to have a trend and a seasonality. The classical decomposition also could help to find outliers, which will show up with a high peek.

For completeness here again the whole picture holding all the steps discussed:



 

Trend determination with moving averages

We already saw in the previous post that you can decompose a trend from a time series. Here a classical approach how to determine an underlying trend.
 
A trend $T$ is actually a smoothened version a time series, it helps to capture global tendencies. To get a trend line here is what you have to do:
  •  Determine the seasonal periodicity of the time series (if there is one). These periodic patterns are usually visible, but if you cannot see them form the plotted chart, there are also methods using Fourier Transform Algorithms to determine them. In our example we see a yearly periodicity, as the values for the airplane passengers come in monthly, the periodic value is m=12.
  • With this number m use methods like the moving average of order m to determine the values of the trend

Now what is the moving average?
Once you understand the concept, it is easy to remember: Imagine that your dataset consists of 5 values $y_1$, ..., $y_5$. To determine the value of the trend of order $m = 3$ you would take the original value, the value of the predecessor and the value of the successor and create an average. In the simplest approach you simply take the sum of the values devided by the number of values (the so called Simple Moving Average SMA).
In the example of 5 values this would look like:
           
$y_1$      3
$y_2$      5     -> (3+5+4)/3 = 4
$y_3$      4     -> (5+4+1)/3 = 3.33
$y_4$      1     -> (4+1+3)/3 = 2.67
$y_5$      3



You already see the problem here: for the beginning and the ending there are no values available, the missing tail depends on $m$.

What if we choose $m=4$? We would have to decide to take more points from past than from future. In these cases the algorithm is not symmetric anymore, usually you therefore either change to the next odd number ($m=13$) or you choose a so called centered moving average. In the centered moving average you use a simple moving average of order 2 in the first step to determine values like $y_{1.5}$. Then you use the SMA of order $m$.

$y_1$      3
                           -> $y_{1.5}$ = (3+5)/2 = 4
$y_2$      5
                           -> $y_{2.5}$ = (5+4)/2 = 4.5
$y_3$      4                                                                           ->  (4+4.5+2.5+2) / 4 = 3.25
                           -> $y_{3.5}$ = (4+1)/2 = 2.5
$y_4$      1
                           -> $y_{4.5}$ = (1+3)/2 = 2
$y_5$      3

In our earlier example of air passengers we determined $m=12$ (even) and the trend values in yellow are determined by a centered moving average of order 12.

Here the command in R (ma stands for Moving Average):
lines(ma(x, order = 3, centre = TRUE))

After successfully determining the trend, we can remove it from the original data. In an additive model we get the de-trended time series by substracting it ($Y_t - T_t$), in a multiplicative model by deviding $Y_t/T_T$. This detrended time series of our AirPassenger example looks like this:
The next step is now to get the seasonality component and the random component from the de-trened time series.





Classical Decomposition - Summary


The classical decomposition of a time series can help to get an overview on the tendencies (trend component), periodic patterns (seasonal component) and quality of the model (random component). In addition it helps to identify outliers in a time series.

To forecast a time series it is often useful to have a decomposition and to forecast each of the components in the decomposition seperately. A seasonal component would just be repeated constantly (naive forecast), meanwhile you could use exponential smoothing methods to forecast the trend and random component.

On the other hand the classical decomposition shows some disadvantages: We saw in this post that the trend and therefore also the random component cannot be determined at the beginning and at the end of a time series. Also we saw in that post that it relies on the assumption that we have a stable period with a pretty constant pattern. In reality this is often not the case: e.g. 100 years ago the energy consumption was high in winter was high due to heaters, now in summer it is equally high due to air condition.

To overcome these bounderies other decomposition methods have been developed, see for instance the Seasonal and Trend Determination using LOESS (1990). I will describe it in a new post.





I hope you liked and got a picture on the classical decomposition, I really enjoyed building up this example and encourage you to comment and extend.

Sensitivity and Specificity


Apart from Accurancy and Precision there are othere measures for classification models. Today we will focus on another pair of classifiers, called sensitivity and specificity. Like accuracy and precision they are also numbers between 0 and 1 and the higher the values the better.
A perfect classification model would have 100% sensitivity and 100% specificity.

Before defining these values, we recall the confusion matrix:



Now

Sensitivity = True Positives / Actual Positives 

In other words sensitivity describes the probability that a positive is recognized as such by the model, therefore sensitivity is also often called true positive rate.

Analogously

Specificity = True Negatives / Actual Negatives

In other words specificity describes the probability that a positive is recognized as such by the model, therefore specificity is also often called True Negative Rate.



In an example lets assume that we have a binary classifier for cat and dog pictures. We test it with 100 pictures, of which 50 cat pictures and 50 dog pictures. Our classifier however erroneously classifies 6 cats as dogs and 2 dogs as cats.
We would have the following confusion matrix:
             
                Cats             Dogs
Cats         44                6              50
Dogs        2                  48            50

The sensitivity would be 44/50 = 88%, the specificity 48/50 = 96%.


Where are these values used?
A common way to compare the quality of different classifiers is to use a receiver operating characteristic curve or ROC-curve where the true positivie rate (sensitivity) is plotted against the false positive rate (1-specificity). It has its quite complicating name due to its first use during World War II to detect enemy objects in the battlefields. Every test result or confusion matrix represents one point in the ROC curve. But this will be a seperate post...





Averages

 

Averages are a bad idea. Averages hide characteristics, individual properties and specialities behind one value. Nobody wants to be average, in fact people tend to feel bad if their individual properties are not considered (no vegetarian meal in a restaurant). On the other side spotting individual strength and adressing them is a reliable approach to increase response e.g in a marketing campaign. Why would anyone even consider to write about average?

Because averages are useful!

Averages CAN hide characteristics and individual properties behind one value. In the right places they can contribute to massive simplifications, effective classifications and help to compare results. Averages help on regression and prediction and in the construction of clusters and decision trees, the most widely used data mining methods. They can be used for smaller subsets and applied in bigger scales. Therefore they are essential in data science.

From a mathematical point of view there are a lot of different ways to define averages, the most common average ("the average of the averages") is the arithmetic average, which sums up all of the values deviding them into the number of values (we only focus here on average used in practise, infinite rows are a nice, but a different field!). Another average is the geometric average of n numbers, in which all the n values are multiplied and the nth root is taken. And the harmonic average which however is not used that often.

In statistics the weighted average (or weighted mean) is of bigger importance, in it the different values are given a certain weight (usually the sum of all weights is 1). It does not treat all the values in the same way, some values are more important (higher weight) than others.

What does this have to do with the small town Haßloch in the Bad Dürkheim district in Germany?
Haßloch is an average. The distribution of the inhabitants considering age, income, education and domestic home size is representative for Germany, in other words the average on these values in whole Germany are pretty close to the values in Haßloch. Therefore Haßloch is the ideal place to test and analyze new product launches. Products that fail in Haßloch will not get released into the German market and lots of the successful products and packagings have made their way through the supermarkets of Haßloch.


By the way, the average German prefers the color blue, the average German man is a 1,78m tall, 82,4kg weighty blond or dark blond. You want to know, if you can recognize the average German girl (and other nations)? Find out here.



Decision Trees


One of the most important techniques for data analysis are decision trees. They are easy to understand, can illustrate complex rule sets and are an effective method to classify new datasets.

Are there any complications about decision trees? Sure there are!

There are quite a lot of algorithms trying to build decision trees. The common understanding to build them up following known rules is not the principal way decision trees are used in data science. In data science usually the rules are unknown, meaning the split of the data into the branches and the determination of the leaves is not known from the start. A good algorithm increases the purity on every split, meaning that the disjunct data subsets after a split are purer regarding a target variable that the data set before the split. In addition to determine and calculate purity the different algorithms also concentrate on questions around minimal size of leaves and number of splits, as one major problem of desicion trees is over-fittinig.

So how do you measure purity?


The simplest way to measure the purity of a split in a classifier decision tree would be to measure of the target variable in the created subsets and choose the split that generates the minimal proportion in one of the subsets. 


I found further splitting criteria in Linoff/Berry's book on Data Mining Techniques (btw a very good  and understandable reference book for data mining):

The Gini measure simply sums up the squares of the percentages of the target variable in the different subsets and assigns this value to the node. A bad split would not considerably change the proportion of a certain variable and therefore have a Gini measure around 1/n if n is the value of different values for the target variable. A good split however would have a Gini score close to 1 on the subsets. To get the score of the split the Gini score of the subsets are added weighted by the size of the subset.

The Chi-Square test can be used to meaure how good a split is. It measures how likely the split is due to chance, calculating the expected and observed proportions of each target value in every subset after the split and simply sums them up to get a measure for the split. For the subset the value is calculated as the square of ( expected - observed values ) / expected values for each value in the target, summed up. It comes into use in the Chi-Square Automatic Interaction Detector (CHAID) Algorithm for desicion trees, here it chooses the best split in every step of the desicion tree algorithm.

Accuracy and Precision - How good is your classifier?


After successfully deciding on a classifier, after adjusting and optimizing the parameters on behalf of training data, it is time to evaluate the model. In the context of model evaluation, a few statistical terms are of importance, most importantly:

Accuracy and Precision

Both have colloquially pretty much the same meaning, however in statistics they describe different properties. They are independent of each other, models can be highly accurate but have low precisiony or also highly precision but low in accuracy. Ideally you have a model that is both highly accurate and highly pricise.

To understand the definitions, we start with the so called Confusion Matrix, which can be created for all classifiers or supervised learning models. After building up your model you compare the actual results to the results predicted by your model. You then build a matrix classifying the compared results via the following matrix (in case you have more results than true/false, you can create the confusion matrix to each feature):


The True Positives (TP) and the True Negatives (TN) are the datasets that your model correctly predicted, the higher the value in these boxes, the better is your model. Errors in the predicted classifications of your model can be devided in:
- False Positives (FP) (or also called Type-1-Errors) are the cases, in which your model classified a dataset incorrectly as positive,
- False Negatives (FN) (also called Type-2-Errors) are the ones incorrectly classified to negative.

Accuracy now is the ration of correctly classified datasets compared with the whole dataset, so
Accuracy = ( True Positives + True Negatives ) / All classified examples.

A high accuracy seems to be a reasonable choice to evaluate your model, however it is not sufficient, as the following example shows: Imagine you have a classifier for breast cancer, which in most cases will give negative result. In fact for 2017, US expect a rate of ~0,123% of new cases. Imagine that your classifier always predicts negative results, then your classifier will have TP = 0,987, TN = 0, and therefore an accuracy of 98,7%, however we agree that the classifier is not at all useful.
We somehow need to have to consider the overall positives (P) and the overall negatives (N).

Here is where precision comes into play. Precision is calculated as
Precision =  True Positives / ( True Positives + False Positives )

Now for our dummy classifier the precision would be 0 confirming that it is not useful at all.

There are more key figures relevant, but they will be posted on a later post. For now the confusion matrix is a first hint to determine if your classifier is useful.







How is the weather in the Black Forest? - Nearest Neighbour Approaches


Planning a city or hiking trip on the weekend, a barbequeue or a romantic picnic in the mountains requires a pretty good weather forecast. What would you do if you wanted to know the weather? Eventually you would just check the weather forecast in the desired city, these weather forcasts are availyble online. But what if your route is lying in the mountains, without any reference city or village? You would then choose a location near your route, for which a weather forecast is available and assume you will get the same weather on your route.
Assume you have a sunny weather forecast for one side of your route and rainy weather forecast on the other side, you would probably rely on the forecast closer to the route, but also consider the forcast farer away.
This approach is quite reasonable assuming that the weather is similar on geographic similar locations. The same reasoning is also applied in the business and research world, to forecast key figures like preferences, behaviours and prizes. One approach is the Nearest Neighbour Approach which starts with the assumption that objects "close" to each other behave in the "same" way, so if you have to predict an unknown value, just ask the "nearest" neighbours and use their results.

However there are some problems in this approach: it might be easy to find neighbours considering numeric values like age, income or location, but how would you find neighbours for, lets say, prefered music?

As often there is no general answer and different ways to find neighbours exists. Usually the tasks to find a good neighbour consist of two main tasks:
1. Find a measure for distance between two datasets
2. Combine the datasets for the closest neighbours to make a prediction

Usually you have a dataset consisting of a collection of features, for a customer this could be a so called Customer ID hodling all the relevant data that could possibly influence the target variable. Often age, income, location are part of it. For these features is it easy to determine a distance, usually the difference (direct difference or euklidian difference) is used. For better comparison normalization is recommended, as the features have different ranges.

You could then go for the k nearest neigbours, check the known values of the target variables and would the have to combine these results to a target value. Here business knowledge is required to determine to which extend e.g. the age is important or if location is more relevant than income.

The advantages of this approach is that they are usually pretty easy to understand, often show additional insights and change whenever the data changes.
However they could be time consuming as for finding good neighbours you would have to check every single known dataset and measure the distance. Also the prediction could depend considerably on the value of k, further analysis is required. And the algorithm is discontinuous, meaning that a new dataset could have a huge impact on the existing predictions.

To overcome the bounderies different approaches have been established (e.g. reducing the number of datasets by choosing "important" neigbours), and the nearest neigbour approaches ars widely used in different classification or regression problems (e.g. supporting breast cancer detection, estimations on house prizes).
And defining the right distance measure you can even find neighbours to music files or detect song titles.

As a closing example here some simplified excercise. Consider the following data:

Target: Money spent online for hobbies per year
Gender Age    Income Nof Children Target
Male     35      65.000            1            6.500
Female 24      25.000           0           2.000
Female   60      60.000            4              380
Male      48      45.000            2           2.100
Female   39     60.000            0            6.000
Male      49       75.000            2  7.500
Male      18           800             0               670

Which are your closest neigbours? And how good does the estimation of the 1-nearest neighbour model fit to you?
For the distance choose the absolute difference value in age, income and number of children, take the weights age_weight = 7, income_weight = 10, nof_children_weight = 1.
(Note that this is a very simplified example, the money spent on hobbies would in reality rely on way more factors and the variables are not independent of each other).

Hopefully you get a good approximation, so your next romantic outdoor picnic in the Black Forest does not look like this:

What does a Data Scientist do? - Facts



From the post about the CRISP-DM we know what phases a data science project consist of. This post analyzes the time spend on the different tasks. I refer to the "Data Science Survey" of Rexer Analytics from 2015.
  • The top application areas are Marketing, Academics, Finance, Technology, Medical, Retail, Internet-based, Government and Manufacturing.
  • Data scientists work above all as corporate employees, consultants, academics or vendors.
  • Data scientists have a high job satisfaction level.
  • The biggest challenges are the rising data analysis complexity and data visualization.
  • The most frequently used algorithms are regression algorithms, decision trees and cluster analysis, the main tool is R.
  • There is also an impressive list of alternative job titles provided, so apart from data scientists, you can also call yourself data analyst, researcher, business analyst, data miner, statistican, predictive modeler, computer scientist, engineer, software developer.