k-Means - how to choose k

After you understood how the basics of the k-Means-Algorithm works, you will be wondering:
Into how much clusters should I devide the data, or in other words, how should I choose k?

Unfortunately there is no general approach to choose k.
Consider the following data:

If I choose k=2 the result could look like this:

If I choose k=3, it could look like that:

The higher you choose k the smaller will be the distortion (except the case the algorithm stucks in some local minimum for the distortion). So the lowest distortion will be creating a cluster for every point. But this would obvious not solve any problem. So when should I stop?
This depends a lot on the business scenario. In general you will be able to estimate borders for k in the description of the usage of your analysis, typical examples are market segmentations in which the requirement is e.g. to find different customer profiles, so k is usually a small number (2-7).

Modern Data Analysis

k-Means - how to choose k

Mirko

0 Comment