The ROC curve


 The ROC curve (receiver operating characteristic curve) is a graphical illustration that can be used to visualize and compare the quality of binary classifiers.

Recall that for a classification experiment we can built the confusion matrix. In there we see the values of the true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). For the ROC curve we need the TP rate on the y-axis and the FP rate on the x-axis. As TP is also called sensitivity and FP is equal to 1-specificity, the ROC curve is also called the sensitivity vs (1-specificity) plot.

We define the ROC space to be the area in the following area:
Now run your classifier and calculate sensitivity and specificity and create the data point corresponding to your test into the ROC space. The closer it is to the upper left corner, the better is you classifier. If you have a 100% correctly predicting classifier -  congratulations, you will find your classifyier directly at (0,1).


Now to create the ROC curve, begin with the values that are easiest to classify positive. Then stepwise include more examples. Every time the classifier classifies a dataset correctly as positive, the line will go up, every time the classifiere classifies a dataset incorrectly as positive, the curve will move to the right (for small values you will actually not get a curve, but a zigzag line.

What can you read from the ROC curve?
 
A random classifier would create equal proportions of true positives and false positives (independent of the number of actual true/false positives). So a random classifier would yield to the yellow line.
The perfect classifier would not create any false positives, so we get sensitivity 1 already for specificity 1 (false positive rate = 0), this would be the green line in the ROC space.
A non-trivial classifier would lie somewhere in between (red line), note that a bad classifier would lie below the random line and could be turned into a better classifier by simply inverting the predictions.

What else can be said about the ROC curve:
- Unlike cumulative gains chart the ROC curve does not depend on the densitiy of the positives in the test dataset.
- Sometimes the area under the ROC curve ("Area under the Curve", AUC) is determined in order to give a classifier a comparable number. A bad classifier would have an AUC close to 0.5 (random classifier), meanwhile a good classifier would have a value close to 1.
Note that this approach to compare classifiers has been questioned based upon recent machine learning research, among other reasons because AUC seems to be a quite noisy measure.

Classical Decomposition of a time series

A time series is a data set that has a time component. Yes, it is just what you think about, in the optimal case you have one value in a fixed time interval. It is a pretty actual topic to create massive datasets on behalf of sensors in an Internet of Things scenario, and you usually get too much values (>100 measurements per second), so to create a useful time series requires some preprocessing (filter or average). Missing values can be a problem, also it is usually not difficult to find a good estimate.

You can therefore build a time line which could look like this:


Here I used R's default dataset "AirPassengers" which reflects the monthly international airline passenger numbers in the years 1949 to 1960.

Natural questions are: 
  • Can you see which of the years were good, which of them were bad?
  • Can you split up the time series into more homogenious components?
  • Can you predict the values for future months? And how good are these predictions?
Time series analysis seems not too complex, however in reality often there are combinations of models required to make good predictions. In this post I want to show you how standard approches work. In reality, especially in retail, the timelines are usually not that easy to handle, as it is widely influenced by customer reviews.

The central idea is, that a time series $Y = (Y_t)$ is a combination of a three independent sub - time series:
- A trend component T is a long-term tendence in the data, it does not have to be linear.
- A seasonal component S is a pattern that reoccurs regularily after a fixed period (like every summer, every january or every day at 10:30).
- A random component I, also called irregular or noise.

We want to try to find these three time series in the upper mentioned example. First we have to decide on the type of decomposition, we can choose from additive and multiplicative.
In an additive model we add the 3 sub time series up to get the original time series: $$Y_t = T_t + S_t + I_t$$ You should use it when the seasonal variance does not change that much.

In a multiplicative model we multiply the 3 sub time series: $$Y_t= T_t * S_t * I_t$$ Use it when you see the peeks growing with time, like in the earlier mentioned example of airplane passengers. Here we should go for a multiplicative model.



Tip: A multiplicative model often can be changed into an additive model using the log function.

How would we get the values for the trend, seasonal and random sub time series? We will go step by step, just to motivate, here the result calculated by R with function decompose:


Here the corresponding coding in R: $$plot(decompose(ts(AirPassengers, frequency = 12, start =1949), type = "mult"))$$

To go on:

1. Here is how to determine the trend component
2. Here is how to determine the seasonal and random component
3. Here is a summary on the classical decomposition of time series

Seasonality and Random Determination

In this post we saw how the three components trend, seasonality and random of a time series. How to extract the trend was shown there, now we focus on how the seasonal component and the random component is determined.
Assume we have a detrended time series (we take here the AirPassengers time series and remove the trend). We assume a seasonality of a fixed period. In reality the assumption to have a fixed seasonality is too strict, as the period could shorten or change its structure over time. But under this assumption the determination of the seasonality is easy: To get the seasonal value of January, we take all values of January and build the average. This is the pattern we use for all periods.
The last step is to determine the random component $I$, we get it by simply removing the trend $T$ and seasonal component $S$ from the original time series $Y$, in an additive model this would be $I_t = Y_t - T_t - S_t$ and in a multiplicative model $I_t = Y_t / (T_t * S_t)$.
In our example, this is how the random components looks like:
What can we get out of it?
The random component shows the noise in the data, the values that do not fit the model. It helps to get a feeling how well the data is explained by the assumption to have a trend and a seasonality. The classical decomposition also could help to find outliers, which will show up with a high peek.

For completeness here again the whole picture holding all the steps discussed:



 

Trend determination with moving averages

We already saw in the previous post that you can decompose a trend from a time series. Here a classical approach how to determine an underlying trend.
 
A trend $T$ is actually a smoothened version a time series, it helps to capture global tendencies. To get a trend line here is what you have to do:
  •  Determine the seasonal periodicity of the time series (if there is one). These periodic patterns are usually visible, but if you cannot see them form the plotted chart, there are also methods using Fourier Transform Algorithms to determine them. In our example we see a yearly periodicity, as the values for the airplane passengers come in monthly, the periodic value is m=12.
  • With this number m use methods like the moving average of order m to determine the values of the trend

Now what is the moving average?
Once you understand the concept, it is easy to remember: Imagine that your dataset consists of 5 values $y_1$, ..., $y_5$. To determine the value of the trend of order $m = 3$ you would take the original value, the value of the predecessor and the value of the successor and create an average. In the simplest approach you simply take the sum of the values devided by the number of values (the so called Simple Moving Average SMA).
In the example of 5 values this would look like:
           
$y_1$      3
$y_2$      5     -> (3+5+4)/3 = 4
$y_3$      4     -> (5+4+1)/3 = 3.33
$y_4$      1     -> (4+1+3)/3 = 2.67
$y_5$      3



You already see the problem here: for the beginning and the ending there are no values available, the missing tail depends on $m$.

What if we choose $m=4$? We would have to decide to take more points from past than from future. In these cases the algorithm is not symmetric anymore, usually you therefore either change to the next odd number ($m=13$) or you choose a so called centered moving average. In the centered moving average you use a simple moving average of order 2 in the first step to determine values like $y_{1.5}$. Then you use the SMA of order $m$.

$y_1$      3
                           -> $y_{1.5}$ = (3+5)/2 = 4
$y_2$      5
                           -> $y_{2.5}$ = (5+4)/2 = 4.5
$y_3$      4                                                                           ->  (4+4.5+2.5+2) / 4 = 3.25
                           -> $y_{3.5}$ = (4+1)/2 = 2.5
$y_4$      1
                           -> $y_{4.5}$ = (1+3)/2 = 2
$y_5$      3

In our earlier example of air passengers we determined $m=12$ (even) and the trend values in yellow are determined by a centered moving average of order 12.

Here the command in R (ma stands for Moving Average):
lines(ma(x, order = 3, centre = TRUE))

After successfully determining the trend, we can remove it from the original data. In an additive model we get the de-trended time series by substracting it ($Y_t - T_t$), in a multiplicative model by deviding $Y_t/T_T$. This detrended time series of our AirPassenger example looks like this:
The next step is now to get the seasonality component and the random component from the de-trened time series.





Classical Decomposition - Summary


The classical decomposition of a time series can help to get an overview on the tendencies (trend component), periodic patterns (seasonal component) and quality of the model (random component). In addition it helps to identify outliers in a time series.

To forecast a time series it is often useful to have a decomposition and to forecast each of the components in the decomposition seperately. A seasonal component would just be repeated constantly (naive forecast), meanwhile you could use exponential smoothing methods to forecast the trend and random component.

On the other hand the classical decomposition shows some disadvantages: We saw in this post that the trend and therefore also the random component cannot be determined at the beginning and at the end of a time series. Also we saw in that post that it relies on the assumption that we have a stable period with a pretty constant pattern. In reality this is often not the case: e.g. 100 years ago the energy consumption was high in winter was high due to heaters, now in summer it is equally high due to air condition.

To overcome these bounderies other decomposition methods have been developed, see for instance the Seasonal and Trend Determination using LOESS (1990). I will describe it in a new post.





I hope you liked and got a picture on the classical decomposition, I really enjoyed building up this example and encourage you to comment and extend.

Sensitivity and Specificity


Apart from Accurancy and Precision there are othere measures for classification models. Today we will focus on another pair of classifiers, called sensitivity and specificity. Like accuracy and precision they are also numbers between 0 and 1 and the higher the values the better.
A perfect classification model would have 100% sensitivity and 100% specificity.

Before defining these values, we recall the confusion matrix:



Now

Sensitivity = True Positives / Actual Positives 

In other words sensitivity describes the probability that a positive is recognized as such by the model, therefore sensitivity is also often called true positive rate.

Analogously

Specificity = True Negatives / Actual Negatives

In other words specificity describes the probability that a positive is recognized as such by the model, therefore specificity is also often called True Negative Rate.



In an example lets assume that we have a binary classifier for cat and dog pictures. We test it with 100 pictures, of which 50 cat pictures and 50 dog pictures. Our classifier however erroneously classifies 6 cats as dogs and 2 dogs as cats.
We would have the following confusion matrix:
             
                Cats             Dogs
Cats         44                6              50
Dogs        2                  48            50

The sensitivity would be 44/50 = 88%, the specificity 48/50 = 96%.


Where are these values used?
A common way to compare the quality of different classifiers is to use a receiver operating characteristic curve or ROC-curve where the true positivie rate (sensitivity) is plotted against the false positive rate (1-specificity). It has its quite complicating name due to its first use during World War II to detect enemy objects in the battlefields. Every test result or confusion matrix represents one point in the ROC curve. But this will be a seperate post...





Averages

 

Averages are a bad idea. Averages hide characteristics, individual properties and specialities behind one value. Nobody wants to be average, in fact people tend to feel bad if their individual properties are not considered (no vegetarian meal in a restaurant). On the other side spotting individual strength and adressing them is a reliable approach to increase response e.g in a marketing campaign. Why would anyone even consider to write about average?

Because averages are useful!

Averages CAN hide characteristics and individual properties behind one value. In the right places they can contribute to massive simplifications, effective classifications and help to compare results. Averages help on regression and prediction and in the construction of clusters and decision trees, the most widely used data mining methods. They can be used for smaller subsets and applied in bigger scales. Therefore they are essential in data science.

From a mathematical point of view there are a lot of different ways to define averages, the most common average ("the average of the averages") is the arithmetic average, which sums up all of the values deviding them into the number of values (we only focus here on average used in practise, infinite rows are a nice, but a different field!). Another average is the geometric average of n numbers, in which all the n values are multiplied and the nth root is taken. And the harmonic average which however is not used that often.

In statistics the weighted average (or weighted mean) is of bigger importance, in it the different values are given a certain weight (usually the sum of all weights is 1). It does not treat all the values in the same way, some values are more important (higher weight) than others.

What does this have to do with the small town Haßloch in the Bad Dürkheim district in Germany?
Haßloch is an average. The distribution of the inhabitants considering age, income, education and domestic home size is representative for Germany, in other words the average on these values in whole Germany are pretty close to the values in Haßloch. Therefore Haßloch is the ideal place to test and analyze new product launches. Products that fail in Haßloch will not get released into the German market and lots of the successful products and packagings have made their way through the supermarkets of Haßloch.


By the way, the average German prefers the color blue, the average German man is a 1,78m tall, 82,4kg weighty blond or dark blond. You want to know, if you can recognize the average German girl (and other nations)? Find out here.