What does a Data Scientist do? - Facts



From the post about the CRISP-DM we know what phases a data science project consist of. This post analyzes the time spend on the different tasks. I refer to the "Data Science Survey" of Rexer Analytics from 2015.
  • The top application areas are Marketing, Academics, Finance, Technology, Medical, Retail, Internet-based, Government and Manufacturing.
  • Data scientists work above all as corporate employees, consultants, academics or vendors.
  • Data scientists have a high job satisfaction level.
  • The biggest challenges are the rising data analysis complexity and data visualization.
  • The most frequently used algorithms are regression algorithms, decision trees and cluster analysis, the main tool is R.
  • There is also an impressive list of alternative job titles provided, so apart from data scientists, you can also call yourself data analyst, researcher, business analyst, data miner, statistican, predictive modeler, computer scientist, engineer, software developer.

Association Analysis


 The typical question behind Association Analysis or often also called Basket Analysis is:Which products are bought together? This question is important as based on the result different measures can be taken: You could place those products together, increase the price of one of the products and lower the price of the other one, advertice only one of them or create combo offers.


To find out about depending products, create rules R like

R: If product A is bought, then also product B is bought

Here parameters A is called Antecedents and B is called Concequent. To determine the importance of such a rule, three statisical key figures are defined:

$SUPPORT(R) := \frac{\text{number of baskets the support the rule}}{\text{number of overall baskets}}$

$CONFIDENCE(A, B) := \frac{\text{number of baskets that support the rule}}{\text{nof of baskets that contain B}}$

In lots of examples, both of the these key figures can be high, but the result is not a very useful rule (e.g. in case product A is bought by 95% of the customer). Therefore the lift, or also called improvement is introduced:
$$LIFT(A,B) := \frac{CONFIDENCE(A, B)}{SUPPORT(A)}$$
Meanwhile the support and the lift are symmetric respect A and B, the confidence is not.

Now the lift decides, if our rule is valid:


If the lift is < 1, the rule does not describe an association. For a lift of 1 the antecedents and concequents are independent of each other, and a lift > 1 describe to which degree the products depend on each other.


A typical example for an association algorithm is the so called APRIORI algorithm which creates rules for all possible subsets having a minimal support. The big advantage of it is that it produces clear, easily understandable results, which can be directly used, however the performance grows exponentially with the set of products, also very rare data is not included into the analysis.

Anomaly detection




A classical Data Science problem is to identify outliers, meaning anomalous behaviours or unexpected high or low values. These unusual values should always be analyzed, they could mean errors in the test data (to be corrected or removed from the data), they could occur naturally or they could actually be the target of an analysis. Typical applications for those analysis are e.g. fraud detection , in which a company wants to detect misuses of their products, fault detection, in which quality or security problems can be identified, but also monitoring of server and computer landscapes in order to reduce or even avoid downtimes. In these problems, the challenge is to identifying the outliers. Characteristics of such problems are, that there are only few negativ examples, but large sets of positive examples.

There are several approaches to adress this kind of challenges. Apart from the a recommended visual analysis there are a lot of algorithms that adress this problem:

A simple algorithm called Interquartile Range Algorithm calculates the Interqartile Range (IQR or also called midspread) on a set of values to find anomalous data points. It splits the number of values into 4 (equal) parts and takes the three borders between them as variables $Q_1$, $Q_2$ and $Q_3$ (ascending). The $IQR$ is then defined as $IQR = Q_3-Q_1$.

Outliers can then be defined in different ways, e.g. via a definition of the american statist John Wilder Tukey, there are two kinds of outliers:
- suspected outliers that lie $1.5 * IQR$ or more above $Q_3$ or below $Q_1$
- outliers that lie $3*IQR$ or more above $Q_3$ or below $Q_1$.
Visually they are represented by a so called "Boxplot", that is easy to understand:



The "whisker" represent the still accepted data sets, what lies outside is considered an outlier. The length is not symmetrical, it is defined by the smallest or largest values that are not yet considered an outlier. Suspected outliers are drawn in transparent circles, real outliers in filled circles.
So the middle 50% lie inside the box, the median is just another word for $Q2$.

But there are other definitions for outliers as well...


This is a simple algorithm that works on one-dimensional values. There are other algorithms that focus on distances to neigbours to find anomalous data points.

A more sophisticated approach is the following anomaly detection algorithm using the Gaussian Distribution:
To adress the issue, the idea is to start choosing "normal" behaviour for all the features that might be indicators of anomalous examples. Use those normal examples as training data for an anomaly detection algorithm:
Assume that all the features $x_1, ... x_n$ are normally distributed (Gaussian), therefore the means $\mu_i$ and the variances $\sigma_i$ of all the features in the training data is needed.

For new examples use than $p(x) = \prod_{i=1}^n (p(x_i; \mu_i, \sigma_i))$ to calculate the probability to be "normal", choose a decision boundary $\epsilon$ (e.g. $\epsilon = 0,02$) and predict anomaly examples to be the the ones with $p(x) <= \epsilon$. Here $$p(x; \mu, \sigma) = \frac{1}{\sqrt{(2\pi\sigma)}} \exp(-\frac{(x-\mu)}{2\sigma^2})$$ is the Gaussian Distribution for an $n$-dimensional value $x$ under the means $\mu = \frac{1}{m}\sum_{i = 1}^{m}x^{(i)}$ and standard deviation $\sigma = \frac{1}{m}\sum_{i = 1}^{m}(x^{(i)}-\mu)^2$.

How to choose $\epsilon$?
You can use the cross validation set which should also hold examples of anomalies to find an appropriate value. Please also make sure, that you have anomalies left for your test data.

How to measure the algorithm?
As we work here with skewed classes (there are much more positive than negative examples), accuracy is not the right way to measure the quality of the algorithm. Instead use the number of True/False Positives/Negatives or other statistical values like Precision, Recall or the $F_1$-Score.

How to come up with features:
If anomaly is not clearly distinguishable from the normal examples, try to find a new property that in compination with the existing features distinguish the normal and the anomalous examples. Start features that take on very small or very large values.


A generalization of the Gaussean Distribution Algorithm is the Multiple Gaussian Distribution Algorithm. As the upper mentioned algorithm usually yields to concentric acceptance areas (which is usually not wanted), they cannot not detect all anomalies sufficiently well. Here multiple Gaussian distribution is a useful tool to further improve the upper mentioned algorithm. However the price is paid with mor expensive calsulation:
$$p(x; \mu, \sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{(1/2)}}  * \exp(-\frac{1}{2} \frac{(x-\mu)^T }{\Sigma^{-1} (x-\mu)} )$$ where $|\Sigma|$ is the determinante of the matrix $\Sigma$
Here the acceptance areas depend on the values of the covariant matrix $\Sigma$ and are in general not concentric and not axis aligned.

Principal Component Analysis


Typical Data Science problems have a huge amout of input features, they are often deciding factors for model performance and make it difficult to visualize and structure data. Therefore reduction of the input features is often helpful or even necessary. The central question here is: Which are the important features? Which ones can be dropped without loosing too much data?
There are quite a few techniques to adress these questions.

Classical Techniques adressed the problem in two ways:
Backword elimination creates a model for all features, then deletes the feature that less raises the error when dropping. Forward selection goes the other way round, it chooses features stepwise choosing always the feature that improves the model the most.
As a combination of the upper mentioned techniques there is the stepwise regression: after adding a feature, a check is executed to decide which features can be deleted without increasing the error.
In all mentioned approaches the main problem is to decide, when the algorithm should stop. Also a problem is that all variables are considered independent, which in general is not the case.


More modern techniques collect subsets of features by filters or wrappers, these techniques allow insights on relations of variables, however these techniques are quite performance intense (note that the number of subsets is essentially bigger than the number of features!).


I want to present another algorithm using the Singular value decomposition of the covariant matrix:
Assume that you have an $n$-dimensial training data set of size m that is mean normalized and feature scaled, every data point $x = {x_1, ..., x_n}$ has $n$ components.
In the first step you calculate the covariance Matrix $$\Sigma =  \frac{1}{m} * \sum_{i=1}^{m} (x^{(i)} (x^{(i)})^T) $$ which is an ($n$ x $n$)-Matrix. Then determine the decomposition $$\Sigma = USV*$$ with a unitary matrices $U$ and $V$ and a diagonal matrix $S$ (use a predefined algorithm for that).
From the returning matrix $U$ choose only the first $k < n$ colums to denote a matrix $U_{k}$.

Use then $$z = U_{k}^T * x$$ with $k < n$ new features $z = {z_1, ..., z_k}$ instead of the existing $n$ ones.

To get back to the original parameters from $z$ to $x$ use the following approximation:
$$x \sim x_{approx} = U_{k}^T * z$$ which only works, if $k$ is properly chosen.

How should $k$ be chosen?
You can choose $k$ = smallest value so that the average squared projection error devided by the total variation in the data is below a boundary $\epsilon$, e.g. $\epsilon = 0.01$ to say that "99% of variance is retained":

$$\frac{\frac{1}{m} \sum_{i=1}^{m} || x^{(i)} - x_{approx}^{(i)} ||^2}{\frac{1}{m} \sum_{i=1}{m} ||x(i)||^2 } <= \epsilon$$
There is even a faster, direkt way to determine $k$, using the diagonal matrix $S$:
Chosse the smallest value for $k$ so that
$$1 -  \frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^n S_{ii}} <= \epsilon$$ or $$ \frac{\sum_{i=1}^{k} S_{ii}}{\sum_{i=1}^n S_{ii}} >=1-\epsilon$$


Notes: 
- PCA should be applied to the training set to determine the parameters of the mapping from $x$ to $z$. However the reduced matrix $U_{reduce}$ can later still be used for the data in the cross and test set.
- PCA should not be used to adress overfitting (as data is getting lost, not using the exising results), therefore the better solution is using regulatization parameters $\lambda$.
- PCA is often not needed, so the recommended approach is to first try to run an algorithm without PCA and only apply it, if it is really needed (visualization, performance).

Data Scientists - Skills


It is obvious that a data scientist should have an interest in data, a strong analytical, mathematical background and a good intuition on numbers. A profund basic of statistics (the more the better), an insight into the tools (Excel is a very powerful and useful tool!) and a good part of endurance to handle complex processes and work on large data sets.
However these skills are not the essential competencies of a data scientist, moreover it is important that a data scientist understands the details on the algorithms applied in order to decide how to further optimize models, performance and processes. Experience with performance optimization algorithms, distributed calculation and programming are of great help. A good data scientist is able to structure and prioritize, distinguish between useful and unuseful information and also be able to work with people from different departments. Further imprescindible abilities are good communication and presentation skills to a wide audience, the task to help business in decisions on complex business processes is only fulfilled if the business actually understands and gets to "see" the results.
Last but not least a data scientist should always be open to new ideas, as the models and the requirements change regularily.

The Data Mining Process


The Cross Industry Standard Process for Data Mining (CRISP-DM) is a process model that describes the different steps data scientists use for analyzing data. Here some more details on the different steps:

In the Business Understanding Phase, you describe the current situation and the requirements from a business point of view, here the business defines, what it wants to accomplish with the data mining process. This included not only a business project plan, a list of risks and measures, costs and benefits, limitations and terminologies, but also a business success criteria.
In the Data Understanding Phase, the business requirements are translated into a data science problem. Here you describe the problem in technical terms, define inputs and outputs, detect dependencies and create a dynamic technical project plan. Here you also describe the data mining success criteria in a simple and easily understandable way. This coulds be a goal like the creation of BI-reports to learn from the historical data, cluster models, segmentations for unsupervized/undirected learning or a prediction, risk estimation or a classification for supervised/directed learning.
In both upper mentioned steps, the requirements should be described as concrete as possible.

The next phase is the Data Preparation Phase, in which an initial data set is being collected from documented tables in documented locations, described (missing fields, amounts of data, relations,...), explored (first findings, simple associations, initial hypothesis,...)  and verified (is the collected data correct/useful/intuitive?). Here you also transform class categories into numerical values, aggregate suitable datasets, clean and format data (eg. unsort data for neural networks).

The Modeling Phase stands in close contact to the Data Preparation Phase. In it the available data is cut into training, test and cross validation set. Here starting with the initially retrieved data you define new records (e.g. areas from witdh and length), datasets from different tables are collected to create an analytical record (or DNA) of a target object (also called entity, e.g. a customer), that hold the whole amount of data needed. Here you also identify outliers, select and sample data to be used in a model and analyze groupings of data. You decide on the model to use and the features (delete unrelevant features, include new features).

In the Evaluation Phase you analyze the results of the model, the accurancy and robustness of the model, the error and optimization opportunities. With these results a discussion round with the business is needed in order to verify if the results from the evaluated data are meeting the business requirements and that all important insights and goals have been included into the analysis. Usually more than one round is needed.

In the Deployment Phase the model is deployed, so that the functionality can be handed over and taken care of by the business. This could be in form of a BI-report or in the implementation of programs that allow the regular update of data.

As the models are created for a certain point in time, a certain business goal and on historic data, the data mining process has to be adapted and updated with new data.

Neural Networks - Basics

What are neural networks?

Neural networks are algorithms that get input data and adapt parameters on an internal (unvisible) model so it works well on the known data and (hopefully) delivers good results on new data.
Neural networks are not new, the first approaches have been introduced already in 1943 by Warren McCulloch and Walter Pitts under the name of "Threshold Logic" models. In the '80s the effective backpropagation algorithm was introduced which spread the usage of neuronal networks. But still large scaled applications have not been used as they required huge datasets and storage. This is why they got popular in the last 10 years, when those parameters do not provide any difficulties anymore.


How do neural networks look like?

If you see pictures of neural networks, they will usually look like this:


The round forms are called nodes, they are ordened in different colums called layers.
- the first layer (on the left) is called the input layer. It determines the number of input parameters.
- the last layer (on the right) is called the output layer which holds the calculated results of the algorithm.
- the columns in the middle are called hidden layers (unvisible from outside).

For a neural network you have the input x (here x is a vector with two components) that are passed to the input layer, mapped through the network and create an output (here a vector with two components), that in the ideal case is a good approximation to the known output y. During the way through the network, every input component is multiplied with parameters and enriched with some constant bias, the composition is then passed to a node, in which the activation is applied.
From the error of the prediction with the choosen parameters the algorithm can calculate deltas and adapt each parameter, so that in the next run the prediction will get closer to the actual known result.
So the algorithm learns by minimizing the error.

An example for supervised learning is the recognition of handwriting: by feeding the system with lots of letter sets of different hand writing styles and providing the correct result the system can find substructures and compare them to a new dataset. Then it can calculate probabilities that the input data matches one of the known letters and propose a solution. This often cited example is used as introducting program in Google's Machine Learning Platform Tensorflow, which is a very powerful tool for the interested community.

k-Means - how to choose k

After you understood how the basics of the  k-Means-Algorithm works, you will be wondering:
Into how much clusters should I devide the data, or in other words, how should I choose k?

Unfortunately there is no general approach to choose k.
Consider the following data:
If I choose k=2 the result could look like this:
If I choose k=3, it could look like that:
The higher you choose k the smaller will be the distortion (except the case the algorithm stucks in some local minimum for the distortion). So the lowest distortion will be creating a cluster for every point. But this would obvious not solve any problem. So when should I stop?
This depends a lot on the business scenario. In general you will be able to estimate borders for k in the description of the usage of your analysis, typical examples are market segmentations in which the requirement is e.g. to find different customer profiles, so k is usually a small number (2-7).


k-Means - How to initialize

k-Means - How to initialize
When you study the k-Means-Algorithm and understand, how it works, a natural question that ariese is:

How can you choose the starting points in order to get the best fit into k clusters?

In order to answer the question, it is important to define, what a "good fit" means. In the context of k-Means the best fit is the distribution of the data into k different clusters so that the sum of the distances to the corresponding cluster centers (the so called "distortion") is minimal.

For up to 10 Clusters the common way to find the best fit is the following:
Choose k different points of the dataset randomly as initial cluster centers. Then run the algorithm and calculate the upper mentioned distortion.

Run the k-Means-Algorithm often (100-1000 times, providing that you have enough data points) and choose the one with the lowest distortion.


By this approach, you can also avoid the situation in which no point is assigned to a cluster center (in this case you could randomly initialize the center again or directly remove it).



The k-Means Algorithm - Basics

Unsupervised Learning tries to find structures in datasets, one method to do so is by clustering the data. The most popular and widely used algorithm therefore is the k-Means-Algorithm.
k thereby stands for the number of clusters in which the data to be analyzed is to be devided. Lets start with an 2-dimensional example, imagine that we have the following set of data:
Suppose we want to cluster this data into different groups. On the first sight we see two separate clusters:
So this is what we want to achieve for general set of data, therefore we use the k-Means-Algorithm with K = 2. Now how does it work:

First of all initialize the data, therefore we choose two random points in the test area (for a better choice see this post), lets say those two points indicated by a "X":
Now run the following two steps in a loop:
1. Assign each point to the cross that lies nearest to it (in general this is not unique, in such cases just select one)

2. Move each cross to the center of the average (means) of the assigned points
Going on with step 1...
... and step 2 ...
...brings us into this situation:
Now further steps will not change the assignment of the cluster centers anymore => the k-Means-Algorithm converged.

Note that the algorithm is continuously improving the sum of the distances between the center and the data points by choosing the assignments with the smallest distance (step 1) and moving the centers (step 2). However the assignments on a converged k-Means-Algorithms do not have to be the same for every run (depending on the starting points). 


Two questions arise (click to see the answer): 




Machine Learning

What does "machine learning" mean, how can a computer "learn"?



The term "learning" is used in this context to refer to the fact that a machine uses new data in order to better up previous results. For sure the machine will not develop a brain or a physically similar organ. But for optimization or minimizing tasks, especially tasks including complex calculations and transformations, a machine controlled adaption is a useful and often indispensable tool. The basics for such tasks are often machine learning algorithms like artificial neural networks or clustering algorithms which run iterative trying to minimize errors in each calculation step.

Machine learning algorithms usually can be devided in two main groups:
  • Supervised learning algorithms, in which the output is known and the rules are being trained by using the input and trying to minimize the error between the predicted output and the given result. Neuronal networks are here the most promising examples, they are called like that as they reflect the way our brains work: given some input the human brain learns by trying out and correct until it finds the perfect rule to explain the result.
  • Unsupervised learning algorithms in which data is given to an algorithm which then tries to find pattern in the data. As an example think about astronomical data: if you can cluster the stellar data you could find a structure in it and learn about the past and future.
In addition modern algorithms can also be active learning algorithms, in which the input can be completed by additional requests for input trying to minimize the numbers of these additional inputs.

What are these algorithms needed for and why are they considered promising?

Machines have the capacity to calculate fast and storage is cheap nowadays, also in presicion they are unbeatable and often parallelize their tasks. As nearly unlimited data is available and often more data leads to better results (not always, an intelligent way to sort out data is one of the reasons for a data scientist!) machines and computer visualizations are essential reasons for machine supported analysis. In addition cloud computing and distributed systems improve the way data is collected, loaded and analyzed.
Internet of Things scenarios are considered the modern way to improve business processes and drive Industry 4.0 by collecting huge amounts of (sensor) data. To analyze those datasets machines are not only helpful, but necessary.