What is K-Means Clustering?

A friendly introduction to the most used unsupervised machine learning algorithm.

3 min readJan 4, 2022

Before we look into K Means Clustering and how it can be implemented in Python, let’s brush up on the basics.

What is Unsupervised Machine Learning and how does it vary from the Supervised Machine Learning?

Unsupervised Learning is where we have only input data(X) and no corresponding target variable(y). Contrastingly, in Supervised Learning, we have input data(X) and a corresponding target variable(y).

Here we look for the distribution of the data points and try to group similar ones. There is no outcome or prediction using the data.

Some of the general algorithms include:

Clustering
Dimensionality Reduction
Association

Here, is what you need to know about Clustering:

Clustering is the process of grouping a set of data points in such a way that attributes in the same group , called a Cluster are more similar to each other than to those in other groups.

In simple words, data points in one cluster have strikingly similar properties and are categorized into a cluster. Different clusters with the same mechanism are also formed.

How does the K Means Clustering work?

K Means Clustering is an unsupervised machine learning algorithm, based on the proximity measure, to group the unlabeled data into different clusters.

‘K ‘ — number of clusters
‘Means’ — average centroid of a cluster

It is a recursive algorithm that divides the data points into ‘K’ different clusters.

The main objective of this algorithm is to :

Increase variation between the clusters, i.e. the inter cluster distance, the distance between the centroid of the clusters, has to be maximum.
Reduce variation within the cluster, i.e. the intra cluster distance, the distance between the data points of the same cluster, has to be minimum