k-Means - How to initialize

When you study the k-Means-Algorithm and understand, how it works, a natural question that ariese is:

How can you choose the starting points in order to get the best fit into k clusters?

In order to answer the question, it is important to define, what a "good fit" means. In the context of k-Means the best fit is the distribution of the data into k different clusters so that the sum of the distances to the corresponding cluster centers (the so called "distortion") is minimal.

For up to 10 Clusters the common way to find the best fit is the following:
Choose k different points of the dataset randomly as initial cluster centers. Then run the algorithm and calculate the upper mentioned distortion.

Run the k-Means-Algorithm often (100-1000 times, providing that you have enough data points) and choose the one with the lowest distortion.

By this approach, you can also avoid the situation in which no point is assigned to a cluster center (in this case you could randomly initialize the center again or directly remove it).

Modern Data Analysis

k-Means - How to initialize

How can you choose the starting points in order to get the best fit into k clusters?

Mirko

0 Comment