K-Means

Vegeedog
Algorithm notes
Published in
2 min readMay 7, 2022

--

Intro: Given a data set with N data points, for which we would like to find K >0 clusters. K-Means is an algorithm to find reasonable cluster centers and even an assignment of each data point to its cluster.

Algorithm: Define rₙₖ = 1 if data point n is supposed to belong to cluster k, and 0 otherwise. The goal is to minimize the following objective function over μₖ and rₙₖ.

Objective function

To do this, basically two steps.

  1. Minimize J for fixed μₖ. It can clearly be done by simply setting

2. Minimize J for fixed assignments over µk. This can be done by setting each µₖ to the center (i.e., the center of mass or the average) of all the input points belonging to it according to the current assignment.

In short, first step: assign each data point to its cluster (of which center is closest to the data point), and second step: re-calculate the center of cluster (based on the current assignment)

The K-Means algorithm starts with given initial centers and repeats these two steps until a stopping criterion is fulfilled.

Code (Python):

Implement ‘my_kmeans’ function, taking the

  1. original dataset
  2. initial cluster centers (would be initialized by k_means++)
  3. maximum number of iterations

as input arguments, and then output K cluster centers.

Sine K-Means is sensitive to initialization, we normally use k-means++ algorithm to give a good initialization of K cluster centers, and then proceed regular k-means.

Here we directly use the “kmeans_plusplus” function provided by sklearn.cluster

Run on toy dataset:

Test the code above to a toy dataset, which is plotted below.

Run with different number of clusters K = 2,3,4,5. Result as below.

The complete code, including creating dataset & running k-means.

--

--