The k-Means Algorithm - Basics

Unsupervised Learning tries to find structures in datasets, one method to do so is by clustering the data. The most popular and widely used algorithm therefore is the k-Means-Algorithm.
k thereby stands for the number of clusters in which the data to be analyzed is to be devided. Lets start with an 2-dimensional example, imagine that we have the following set of data:

Suppose we want to cluster this data into different groups. On the first sight we see two separate clusters:

So this is what we want to achieve for general set of data, therefore we use the k-Means-Algorithm with K = 2. Now how does it work:

First of all initialize the data, therefore we choose two random points in the test area (for a better choice see this post), lets say those two points indicated by a "X":

Now run the following two steps in a loop:

1. Assign each point to the cross that lies nearest to it (in general this is not unique, in such cases just select one)

2. Move each cross to the center of the average (means) of the assigned points

Going on with step 1...

... and step 2 ...

...brings us into this situation:

Now further steps will not change the assignment of the cluster centers anymore => the k-Means-Algorithm converged.

Note that the algorithm is continuously improving the sum of the distances between the center and the data points by choosing the assignments with the smallest distance (step 1) and moving the centers (step 2). However the assignments on a converged k-Means-Algorithms do not have to be the same for every run (depending on the starting points).

Two questions arise (click to see the answer):

How do I initialize the data?
How should I choose k?

Modern Data Analysis

The k-Means Algorithm - Basics

Mirko

0 Comment