A classical Data Science problem is to identify outliers, meaning anomalous behaviours or unexpected high or low values. These unusual values should always be analyzed, they could mean errors in the test data (to be corrected or removed from the data), they could occur naturally or they could actually be the target of an analysis. Typical applications for those analysis are e.g.
fraud detection , in which a company wants to detect misuses of their products,
fault detection, in which quality or security problems can be identified, but also
monitoring of server and computer landscapes in order to reduce or even avoid downtimes. In these problems, the challenge is to identifying the outliers. Characteristics of such problems are, that there are only few negativ examples, but large sets of positive examples.
There are several approaches to adress this kind of challenges. Apart from the a recommended visual analysis there are a lot of algorithms that adress this problem:
A simple algorithm called
Interquartile Range Algorithm calculates the Interqartile Range (IQR or also called midspread) on a set of values to find anomalous data points. It splits the number of values into 4 (equal) parts and takes the three borders between them as variables $Q_1$, $Q_2$ and $Q_3$ (ascending). The $IQR$ is then defined as
$IQR = Q_3-Q_1$.
Outliers can then be defined in different ways, e.g. via a definition of the american statist
John Wilder Tukey, there are two kinds of outliers:
-
suspected outliers that lie
$1.5 * IQR$ or more above $Q_3$ or below $Q_1$
-
outliers that lie
$3*IQR$ or more above $Q_3$ or below $Q_1$.
Visually they are represented by a so called "Boxplot", that is easy to understand:
The "whisker" represent the still accepted data sets, what lies outside is considered an outlier. The length is not symmetrical, it is defined by the smallest or largest values that are not yet considered an outlier. Suspected outliers are drawn in transparent circles, real outliers in filled circles.
So the middle 50% lie inside the box, the median is just another word for $Q2$.
But there are other definitions for outliers as well...
This is a simple algorithm that works on one-dimensional values. There are other algorithms that focus on distances to neigbours to find anomalous data points.
A more sophisticated approach is the following
anomaly detection algorithm using the Gaussian Distribution:
To adress the issue, the idea is to start choosing "normal" behaviour for all the features that might be indicators of anomalous examples. Use those normal examples as training data for an anomaly detection algorithm:
Assume that all the features $x_1, ... x_n$ are normally distributed (Gaussian), therefore the means $\mu_i$ and the variances $\sigma_i$ of all the features in the training data is needed.
For new examples use than $p(x) = \prod_{i=1}^n (p(x_i; \mu_i, \sigma_i))$ to calculate the probability to be "normal", choose a decision boundary $\epsilon$ (e.g. $\epsilon = 0,02$) and predict anomaly examples to be the the ones with $p(x) <= \epsilon$. Here $$p(x; \mu, \sigma) = \frac{1}{\sqrt{(2\pi\sigma)}} \exp(-\frac{(x-\mu)}{2\sigma^2})$$ is the Gaussian Distribution for an $n$-dimensional value $x$ under the means $\mu = \frac{1}{m}\sum_{i = 1}^{m}x^{(i)}$ and standard deviation $\sigma = \frac{1}{m}\sum_{i = 1}^{m}(x^{(i)}-\mu)^2$.
How to choose $\epsilon$?
You can use the cross validation set which should also hold examples of anomalies to find an appropriate value. Please also make sure, that you have anomalies left for your test data.
How to measure the algorithm?
As we work here with skewed classes (there are much more positive than negative examples), accuracy is not the right way to measure the quality of the algorithm. Instead use the number of True/False Positives/Negatives or other statistical values like Precision, Recall or the $F_1$-Score.
How to come up with features:
If anomaly is not clearly distinguishable from the normal examples, try to find a new property that in compination with the existing features distinguish the normal and the anomalous examples. Start features that take on very small or very large values.
A generalization of the Gaussean Distribution Algorithm is the
Multiple Gaussian Distribution Algorithm. As the upper mentioned algorithm usually yields to concentric acceptance areas (which is usually not wanted), they cannot not detect all anomalies sufficiently well. Here multiple Gaussian distribution is a useful tool to further improve the upper mentioned algorithm. However the price is paid with mor expensive calsulation:
$$p(x; \mu, \sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{(1/2)}} * \exp(-\frac{1}{2} \frac{(x-\mu)^T }{\Sigma^{-1} (x-\mu)} )$$ where $|\Sigma|$ is the determinante of the matrix $\Sigma$
Here the acceptance areas depend on the values of the covariant matrix $\Sigma$ and are in general not concentric and not axis aligned.