The Data Mining Process


The Cross Industry Standard Process for Data Mining (CRISP-DM) is a process model that describes the different steps data scientists use for analyzing data. Here some more details on the different steps:

In the Business Understanding Phase, you describe the current situation and the requirements from a business point of view, here the business defines, what it wants to accomplish with the data mining process. This included not only a business project plan, a list of risks and measures, costs and benefits, limitations and terminologies, but also a business success criteria.
In the Data Understanding Phase, the business requirements are translated into a data science problem. Here you describe the problem in technical terms, define inputs and outputs, detect dependencies and create a dynamic technical project plan. Here you also describe the data mining success criteria in a simple and easily understandable way. This coulds be a goal like the creation of BI-reports to learn from the historical data, cluster models, segmentations for unsupervized/undirected learning or a prediction, risk estimation or a classification for supervised/directed learning.
In both upper mentioned steps, the requirements should be described as concrete as possible.

The next phase is the Data Preparation Phase, in which an initial data set is being collected from documented tables in documented locations, described (missing fields, amounts of data, relations,...), explored (first findings, simple associations, initial hypothesis,...)  and verified (is the collected data correct/useful/intuitive?). Here you also transform class categories into numerical values, aggregate suitable datasets, clean and format data (eg. unsort data for neural networks).

The Modeling Phase stands in close contact to the Data Preparation Phase. In it the available data is cut into training, test and cross validation set. Here starting with the initially retrieved data you define new records (e.g. areas from witdh and length), datasets from different tables are collected to create an analytical record (or DNA) of a target object (also called entity, e.g. a customer), that hold the whole amount of data needed. Here you also identify outliers, select and sample data to be used in a model and analyze groupings of data. You decide on the model to use and the features (delete unrelevant features, include new features).

In the Evaluation Phase you analyze the results of the model, the accurancy and robustness of the model, the error and optimization opportunities. With these results a discussion round with the business is needed in order to verify if the results from the evaluated data are meeting the business requirements and that all important insights and goals have been included into the analysis. Usually more than one round is needed.

In the Deployment Phase the model is deployed, so that the functionality can be handed over and taken care of by the business. This could be in form of a BI-report or in the implementation of programs that allow the regular update of data.

As the models are created for a certain point in time, a certain business goal and on historic data, the data mining process has to be adapted and updated with new data.
Previous
Next Post »
0 Comment