What is K-Means Clustering?

A friendly introduction to the most used unsupervised machine learning algorithm.

Harsha S
3 min readJan 4, 2022
Image by the author

Before we look into K Means Clustering and how it can be implemented in Python, let’s brush up on the basics.

What is Unsupervised Machine Learning and how does it vary from the Supervised Machine Learning?

Unsupervised Learning is where we have only input data(X) and no corresponding target variable(y). Contrastingly, in Supervised Learning, we have input data(X) and a corresponding target variable(y).

Here we look for the distribution of the data points and try to group similar ones. There is no outcome or prediction using the data.

Some of the general algorithms include:

  • Clustering
  • Dimensionality Reduction
  • Association

Here, is what you need to know about Clustering:

Clustering is the process of grouping a set of data points in such a way that attributes in the same group , called a Cluster are more similar to each other than to those in other groups.

In simple words, data points in one cluster have strikingly similar properties and are categorized into a cluster. Different clusters with the same mechanism are also formed.

Photo by Greyson Joralemon on Unsplash

How does the K Means Clustering work?

K Means Clustering is an unsupervised machine learning algorithm, based on the proximity measure, to group the unlabeled data into different clusters.

‘K ‘ — number of clusters

‘Means’ — average centroid of a cluster

It is a recursive algorithm that divides the data points into ‘K’ different clusters.

The main objective of this algorithm is to :

  • Increase variation between the clusters, i.e. the inter cluster distance, the distance between the centroid of the clusters, has to be maximum.
  • Reduce variation within the cluster, i.e. the intra cluster distance, the distance between the data points of the same cluster, has to be minimum

Steps involved in K Means Clustering:

  • Step 1:

Choose a distance measure and arbitrarily choose the value of ‘K’.

  • Step 2:

Randomly pick ‘K’ points as cluster centroids.

  • Step 3:

Allot the closest centroid to every relevant grouped data points.

  • Step 4:

Take each data point in sequence and compute its distance from the centroid of each of the clusters.

  • Step 5:

If a data point is not correctly assigned to the nearest centroid, change it to the correct closest cluster.

  • Step 6:

Centroids do not move if they are at the center of the cluster data points. Repeat steps 4 and 5 until the convergence of the centroids is achieved.

Visual illustration of the steps:

Distribution of the data points
Arbitrarily a centroid is assigned to the data points
Centroids are updated such that they lie at the center of the cluster
After a few iterations, centroids now lie at the center of each cluster

Try it yourself:

Implementing K-Means from scratch using make_blobs:

Import the necessary libraries:

Generate the data using make_blobs:

Plot the values of X:

Plot the values of X against the y values:

Thanks for reading this article. For more such content follow me on Medium.

--

--

Harsha S

NLP Engineer | I love to write about AI in beginner way