The ROC curve


 The ROC curve (receiver operating characteristic curve) is a graphical illustration that can be used to visualize and compare the quality of binary classifiers.

Recall that for a classification experiment we can built the confusion matrix. In there we see the values of the true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). For the ROC curve we need the TP rate on the y-axis and the FP rate on the x-axis. As TP is also called sensitivity and FP is equal to 1-specificity, the ROC curve is also called the sensitivity vs (1-specificity) plot.

We define the ROC space to be the area in the following area:
Now run your classifier and calculate sensitivity and specificity and create the data point corresponding to your test into the ROC space. The closer it is to the upper left corner, the better is you classifier. If you have a 100% correctly predicting classifier -  congratulations, you will find your classifyier directly at (0,1).


Now to create the ROC curve, begin with the values that are easiest to classify positive. Then stepwise include more examples. Every time the classifier classifies a dataset correctly as positive, the line will go up, every time the classifiere classifies a dataset incorrectly as positive, the curve will move to the right (for small values you will actually not get a curve, but a zigzag line.

What can you read from the ROC curve?
 
A random classifier would create equal proportions of true positives and false positives (independent of the number of actual true/false positives). So a random classifier would yield to the yellow line.
The perfect classifier would not create any false positives, so we get sensitivity 1 already for specificity 1 (false positive rate = 0), this would be the green line in the ROC space.
A non-trivial classifier would lie somewhere in between (red line), note that a bad classifier would lie below the random line and could be turned into a better classifier by simply inverting the predictions.

What else can be said about the ROC curve:
- Unlike cumulative gains chart the ROC curve does not depend on the densitiy of the positives in the test dataset.
- Sometimes the area under the ROC curve ("Area under the Curve", AUC) is determined in order to give a classifier a comparable number. A bad classifier would have an AUC close to 0.5 (random classifier), meanwhile a good classifier would have a value close to 1.
Note that this approach to compare classifiers has been questioned based upon recent machine learning research, among other reasons because AUC seems to be a quite noisy measure.

Classical Decomposition of a time series

A time series is a data set that has a time component. Yes, it is just what you think about, in the optimal case you have one value in a fixed time interval. It is a pretty actual topic to create massive datasets on behalf of sensors in an Internet of Things scenario, and you usually get too much values (>100 measurements per second), so to create a useful time series requires some preprocessing (filter or average). Missing values can be a problem, also it is usually not difficult to find a good estimate.

You can therefore build a time line which could look like this:


Here I used R's default dataset "AirPassengers" which reflects the monthly international airline passenger numbers in the years 1949 to 1960.

Natural questions are: 
  • Can you see which of the years were good, which of them were bad?
  • Can you split up the time series into more homogenious components?
  • Can you predict the values for future months? And how good are these predictions?
Time series analysis seems not too complex, however in reality often there are combinations of models required to make good predictions. In this post I want to show you how standard approches work. In reality, especially in retail, the timelines are usually not that easy to handle, as it is widely influenced by customer reviews.

The central idea is, that a time series $Y = (Y_t)$ is a combination of a three independent sub - time series:
- A trend component T is a long-term tendence in the data, it does not have to be linear.
- A seasonal component S is a pattern that reoccurs regularily after a fixed period (like every summer, every january or every day at 10:30).
- A random component I, also called irregular or noise.

We want to try to find these three time series in the upper mentioned example. First we have to decide on the type of decomposition, we can choose from additive and multiplicative.
In an additive model we add the 3 sub time series up to get the original time series: $$Y_t = T_t + S_t + I_t$$ You should use it when the seasonal variance does not change that much.

In a multiplicative model we multiply the 3 sub time series: $$Y_t= T_t * S_t * I_t$$ Use it when you see the peeks growing with time, like in the earlier mentioned example of airplane passengers. Here we should go for a multiplicative model.



Tip: A multiplicative model often can be changed into an additive model using the log function.

How would we get the values for the trend, seasonal and random sub time series? We will go step by step, just to motivate, here the result calculated by R with function decompose:


Here the corresponding coding in R: $$plot(decompose(ts(AirPassengers, frequency = 12, start =1949), type = "mult"))$$

To go on:

1. Here is how to determine the trend component
2. Here is how to determine the seasonal and random component
3. Here is a summary on the classical decomposition of time series

Seasonality and Random Determination

In this post we saw how the three components trend, seasonality and random of a time series. How to extract the trend was shown there, now we focus on how the seasonal component and the random component is determined.
Assume we have a detrended time series (we take here the AirPassengers time series and remove the trend). We assume a seasonality of a fixed period. In reality the assumption to have a fixed seasonality is too strict, as the period could shorten or change its structure over time. But under this assumption the determination of the seasonality is easy: To get the seasonal value of January, we take all values of January and build the average. This is the pattern we use for all periods.
The last step is to determine the random component $I$, we get it by simply removing the trend $T$ and seasonal component $S$ from the original time series $Y$, in an additive model this would be $I_t = Y_t - T_t - S_t$ and in a multiplicative model $I_t = Y_t / (T_t * S_t)$.
In our example, this is how the random components looks like:
What can we get out of it?
The random component shows the noise in the data, the values that do not fit the model. It helps to get a feeling how well the data is explained by the assumption to have a trend and a seasonality. The classical decomposition also could help to find outliers, which will show up with a high peek.

For completeness here again the whole picture holding all the steps discussed:



 

Trend determination with moving averages

We already saw in the previous post that you can decompose a trend from a time series. Here a classical approach how to determine an underlying trend.
 
A trend $T$ is actually a smoothened version a time series, it helps to capture global tendencies. To get a trend line here is what you have to do:
  •  Determine the seasonal periodicity of the time series (if there is one). These periodic patterns are usually visible, but if you cannot see them form the plotted chart, there are also methods using Fourier Transform Algorithms to determine them. In our example we see a yearly periodicity, as the values for the airplane passengers come in monthly, the periodic value is m=12.
  • With this number m use methods like the moving average of order m to determine the values of the trend

Now what is the moving average?
Once you understand the concept, it is easy to remember: Imagine that your dataset consists of 5 values $y_1$, ..., $y_5$. To determine the value of the trend of order $m = 3$ you would take the original value, the value of the predecessor and the value of the successor and create an average. In the simplest approach you simply take the sum of the values devided by the number of values (the so called Simple Moving Average SMA).
In the example of 5 values this would look like:
           
$y_1$      3
$y_2$      5     -> (3+5+4)/3 = 4
$y_3$      4     -> (5+4+1)/3 = 3.33
$y_4$      1     -> (4+1+3)/3 = 2.67
$y_5$      3



You already see the problem here: for the beginning and the ending there are no values available, the missing tail depends on $m$.

What if we choose $m=4$? We would have to decide to take more points from past than from future. In these cases the algorithm is not symmetric anymore, usually you therefore either change to the next odd number ($m=13$) or you choose a so called centered moving average. In the centered moving average you use a simple moving average of order 2 in the first step to determine values like $y_{1.5}$. Then you use the SMA of order $m$.

$y_1$      3
                           -> $y_{1.5}$ = (3+5)/2 = 4
$y_2$      5
                           -> $y_{2.5}$ = (5+4)/2 = 4.5
$y_3$      4                                                                           ->  (4+4.5+2.5+2) / 4 = 3.25
                           -> $y_{3.5}$ = (4+1)/2 = 2.5
$y_4$      1
                           -> $y_{4.5}$ = (1+3)/2 = 2
$y_5$      3

In our earlier example of air passengers we determined $m=12$ (even) and the trend values in yellow are determined by a centered moving average of order 12.

Here the command in R (ma stands for Moving Average):
lines(ma(x, order = 3, centre = TRUE))

After successfully determining the trend, we can remove it from the original data. In an additive model we get the de-trended time series by substracting it ($Y_t - T_t$), in a multiplicative model by deviding $Y_t/T_T$. This detrended time series of our AirPassenger example looks like this:
The next step is now to get the seasonality component and the random component from the de-trened time series.





Classical Decomposition - Summary


The classical decomposition of a time series can help to get an overview on the tendencies (trend component), periodic patterns (seasonal component) and quality of the model (random component). In addition it helps to identify outliers in a time series.

To forecast a time series it is often useful to have a decomposition and to forecast each of the components in the decomposition seperately. A seasonal component would just be repeated constantly (naive forecast), meanwhile you could use exponential smoothing methods to forecast the trend and random component.

On the other hand the classical decomposition shows some disadvantages: We saw in this post that the trend and therefore also the random component cannot be determined at the beginning and at the end of a time series. Also we saw in that post that it relies on the assumption that we have a stable period with a pretty constant pattern. In reality this is often not the case: e.g. 100 years ago the energy consumption was high in winter was high due to heaters, now in summer it is equally high due to air condition.

To overcome these bounderies other decomposition methods have been developed, see for instance the Seasonal and Trend Determination using LOESS (1990). I will describe it in a new post.





I hope you liked and got a picture on the classical decomposition, I really enjoyed building up this example and encourage you to comment and extend.

Sensitivity and Specificity


Apart from Accurancy and Precision there are othere measures for classification models. Today we will focus on another pair of classifiers, called sensitivity and specificity. Like accuracy and precision they are also numbers between 0 and 1 and the higher the values the better.
A perfect classification model would have 100% sensitivity and 100% specificity.

Before defining these values, we recall the confusion matrix:



Now

Sensitivity = True Positives / Actual Positives 

In other words sensitivity describes the probability that a positive is recognized as such by the model, therefore sensitivity is also often called true positive rate.

Analogously

Specificity = True Negatives / Actual Negatives

In other words specificity describes the probability that a positive is recognized as such by the model, therefore specificity is also often called True Negative Rate.



In an example lets assume that we have a binary classifier for cat and dog pictures. We test it with 100 pictures, of which 50 cat pictures and 50 dog pictures. Our classifier however erroneously classifies 6 cats as dogs and 2 dogs as cats.
We would have the following confusion matrix:
             
                Cats             Dogs
Cats         44                6              50
Dogs        2                  48            50

The sensitivity would be 44/50 = 88%, the specificity 48/50 = 96%.


Where are these values used?
A common way to compare the quality of different classifiers is to use a receiver operating characteristic curve or ROC-curve where the true positivie rate (sensitivity) is plotted against the false positive rate (1-specificity). It has its quite complicating name due to its first use during World War II to detect enemy objects in the battlefields. Every test result or confusion matrix represents one point in the ROC curve. But this will be a seperate post...





Averages

 

Averages are a bad idea. Averages hide characteristics, individual properties and specialities behind one value. Nobody wants to be average, in fact people tend to feel bad if their individual properties are not considered (no vegetarian meal in a restaurant). On the other side spotting individual strength and adressing them is a reliable approach to increase response e.g in a marketing campaign. Why would anyone even consider to write about average?

Because averages are useful!

Averages CAN hide characteristics and individual properties behind one value. In the right places they can contribute to massive simplifications, effective classifications and help to compare results. Averages help on regression and prediction and in the construction of clusters and decision trees, the most widely used data mining methods. They can be used for smaller subsets and applied in bigger scales. Therefore they are essential in data science.

From a mathematical point of view there are a lot of different ways to define averages, the most common average ("the average of the averages") is the arithmetic average, which sums up all of the values deviding them into the number of values (we only focus here on average used in practise, infinite rows are a nice, but a different field!). Another average is the geometric average of n numbers, in which all the n values are multiplied and the nth root is taken. And the harmonic average which however is not used that often.

In statistics the weighted average (or weighted mean) is of bigger importance, in it the different values are given a certain weight (usually the sum of all weights is 1). It does not treat all the values in the same way, some values are more important (higher weight) than others.

What does this have to do with the small town Haßloch in the Bad Dürkheim district in Germany?
Haßloch is an average. The distribution of the inhabitants considering age, income, education and domestic home size is representative for Germany, in other words the average on these values in whole Germany are pretty close to the values in Haßloch. Therefore Haßloch is the ideal place to test and analyze new product launches. Products that fail in Haßloch will not get released into the German market and lots of the successful products and packagings have made their way through the supermarkets of Haßloch.


By the way, the average German prefers the color blue, the average German man is a 1,78m tall, 82,4kg weighty blond or dark blond. You want to know, if you can recognize the average German girl (and other nations)? Find out here.



Decision Trees


One of the most important techniques for data analysis are decision trees. They are easy to understand, can illustrate complex rule sets and are an effective method to classify new datasets.

Are there any complications about decision trees? Sure there are!

There are quite a lot of algorithms trying to build decision trees. The common understanding to build them up following known rules is not the principal way decision trees are used in data science. In data science usually the rules are unknown, meaning the split of the data into the branches and the determination of the leaves is not known from the start. A good algorithm increases the purity on every split, meaning that the disjunct data subsets after a split are purer regarding a target variable that the data set before the split. In addition to determine and calculate purity the different algorithms also concentrate on questions around minimal size of leaves and number of splits, as one major problem of desicion trees is over-fittinig.

So how do you measure purity?


The simplest way to measure the purity of a split in a classifier decision tree would be to measure of the target variable in the created subsets and choose the split that generates the minimal proportion in one of the subsets. 


I found further splitting criteria in Linoff/Berry's book on Data Mining Techniques (btw a very good  and understandable reference book for data mining):

The Gini measure simply sums up the squares of the percentages of the target variable in the different subsets and assigns this value to the node. A bad split would not considerably change the proportion of a certain variable and therefore have a Gini measure around 1/n if n is the value of different values for the target variable. A good split however would have a Gini score close to 1 on the subsets. To get the score of the split the Gini score of the subsets are added weighted by the size of the subset.

The Chi-Square test can be used to meaure how good a split is. It measures how likely the split is due to chance, calculating the expected and observed proportions of each target value in every subset after the split and simply sums them up to get a measure for the split. For the subset the value is calculated as the square of ( expected - observed values ) / expected values for each value in the target, summed up. It comes into use in the Chi-Square Automatic Interaction Detector (CHAID) Algorithm for desicion trees, here it chooses the best split in every step of the desicion tree algorithm.

Accuracy and Precision - How good is your classifier?


After successfully deciding on a classifier, after adjusting and optimizing the parameters on behalf of training data, it is time to evaluate the model. In the context of model evaluation, a few statistical terms are of importance, most importantly:

Accuracy and Precision

Both have colloquially pretty much the same meaning, however in statistics they describe different properties. They are independent of each other, models can be highly accurate but have low precisiony or also highly precision but low in accuracy. Ideally you have a model that is both highly accurate and highly pricise.

To understand the definitions, we start with the so called Confusion Matrix, which can be created for all classifiers or supervised learning models. After building up your model you compare the actual results to the results predicted by your model. You then build a matrix classifying the compared results via the following matrix (in case you have more results than true/false, you can create the confusion matrix to each feature):


The True Positives (TP) and the True Negatives (TN) are the datasets that your model correctly predicted, the higher the value in these boxes, the better is your model. Errors in the predicted classifications of your model can be devided in:
- False Positives (FP) (or also called Type-1-Errors) are the cases, in which your model classified a dataset incorrectly as positive,
- False Negatives (FN) (also called Type-2-Errors) are the ones incorrectly classified to negative.

Accuracy now is the ration of correctly classified datasets compared with the whole dataset, so
Accuracy = ( True Positives + True Negatives ) / All classified examples.

A high accuracy seems to be a reasonable choice to evaluate your model, however it is not sufficient, as the following example shows: Imagine you have a classifier for breast cancer, which in most cases will give negative result. In fact for 2017, US expect a rate of ~0,123% of new cases. Imagine that your classifier always predicts negative results, then your classifier will have TP = 0,987, TN = 0, and therefore an accuracy of 98,7%, however we agree that the classifier is not at all useful.
We somehow need to have to consider the overall positives (P) and the overall negatives (N).

Here is where precision comes into play. Precision is calculated as
Precision =  True Positives / ( True Positives + False Positives )

Now for our dummy classifier the precision would be 0 confirming that it is not useful at all.

There are more key figures relevant, but they will be posted on a later post. For now the confusion matrix is a first hint to determine if your classifier is useful.







How is the weather in the Black Forest? - Nearest Neighbour Approaches


Planning a city or hiking trip on the weekend, a barbequeue or a romantic picnic in the mountains requires a pretty good weather forecast. What would you do if you wanted to know the weather? Eventually you would just check the weather forecast in the desired city, these weather forcasts are availyble online. But what if your route is lying in the mountains, without any reference city or village? You would then choose a location near your route, for which a weather forecast is available and assume you will get the same weather on your route.
Assume you have a sunny weather forecast for one side of your route and rainy weather forecast on the other side, you would probably rely on the forecast closer to the route, but also consider the forcast farer away.
This approach is quite reasonable assuming that the weather is similar on geographic similar locations. The same reasoning is also applied in the business and research world, to forecast key figures like preferences, behaviours and prizes. One approach is the Nearest Neighbour Approach which starts with the assumption that objects "close" to each other behave in the "same" way, so if you have to predict an unknown value, just ask the "nearest" neighbours and use their results.

However there are some problems in this approach: it might be easy to find neighbours considering numeric values like age, income or location, but how would you find neighbours for, lets say, prefered music?

As often there is no general answer and different ways to find neighbours exists. Usually the tasks to find a good neighbour consist of two main tasks:
1. Find a measure for distance between two datasets
2. Combine the datasets for the closest neighbours to make a prediction

Usually you have a dataset consisting of a collection of features, for a customer this could be a so called Customer ID hodling all the relevant data that could possibly influence the target variable. Often age, income, location are part of it. For these features is it easy to determine a distance, usually the difference (direct difference or euklidian difference) is used. For better comparison normalization is recommended, as the features have different ranges.

You could then go for the k nearest neigbours, check the known values of the target variables and would the have to combine these results to a target value. Here business knowledge is required to determine to which extend e.g. the age is important or if location is more relevant than income.

The advantages of this approach is that they are usually pretty easy to understand, often show additional insights and change whenever the data changes.
However they could be time consuming as for finding good neighbours you would have to check every single known dataset and measure the distance. Also the prediction could depend considerably on the value of k, further analysis is required. And the algorithm is discontinuous, meaning that a new dataset could have a huge impact on the existing predictions.

To overcome the bounderies different approaches have been established (e.g. reducing the number of datasets by choosing "important" neigbours), and the nearest neigbour approaches ars widely used in different classification or regression problems (e.g. supporting breast cancer detection, estimations on house prizes).
And defining the right distance measure you can even find neighbours to music files or detect song titles.

As a closing example here some simplified excercise. Consider the following data:

Target: Money spent online for hobbies per year
Gender Age    Income Nof Children Target
Male     35      65.000            1            6.500
Female 24      25.000           0           2.000
Female   60      60.000            4              380
Male      48      45.000            2           2.100
Female   39     60.000            0            6.000
Male      49       75.000            2  7.500
Male      18           800             0               670

Which are your closest neigbours? And how good does the estimation of the 1-nearest neighbour model fit to you?
For the distance choose the absolute difference value in age, income and number of children, take the weights age_weight = 7, income_weight = 10, nof_children_weight = 1.
(Note that this is a very simplified example, the money spent on hobbies would in reality rely on way more factors and the variables are not independent of each other).

Hopefully you get a good approximation, so your next romantic outdoor picnic in the Black Forest does not look like this: