Into how much clusters should I devide the data, or in other words, how should I choose k?
Unfortunately there is no general approach to choose k.
Consider the following data:
If I choose k=2 the result could look like this:
If I choose k=3, it could look like that:
The higher you choose k the smaller will be the distortion (except the case the algorithm stucks in some local minimum for the distortion). So the lowest distortion will be creating a cluster for every point. But this would obvious not solve any problem. So when should I stop?
This depends a lot on the business scenario. In general you will be able to estimate borders for k in the description of the usage of your analysis, typical examples are market segmentations in which the requirement is e.g. to find different customer profiles, so k is usually a small number (2-7).
0 Comment