"""Clustering & Clustering Algorithms"""

February 06, 2021

What is Clustering?

The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in Data Analysis. It is Unsupervised Clustering Algorithm.

Simple Case where clustering can be useful:

Imagine, you own shop and want to understand preferences of your costumers to improve profit margin. It is not possible to look at details, buying habits, patterns of each consumer and plan a separate business strategy for each of them. Instead of that you can cluster all of your consumers into say 5 groups depending on their purchasing habits and use a separate strategy for consumers in each of these 10 groups. This is what called as clustering.

There are so many clustering algorithms. But we will see popular algorithms among them

K-means CLustering:

The following diagram shows K means clustering operation on mixed data points.

K-means clustering follows partitioning & observations in k clusters approach. It requires

a) defined distance metric

b) number of clusters

c) initial guess as to cluster centroids

Its unsupervised model

k-means clustering is a method of vector quantization.

K-means is not deterministic and it also consists of number of iterations.

K-means algorithm can be used for clustering problems & feature learning

K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest Mean

The goal for K-Means cost function is to minimize squared error function where error function represents distance between data points and cluster centroid

K-Means squared error function is related to Euclidian distance

In each iteration of K-Means, we need a way to find the nearest centroid to each item in the dataset. One of the simplest ways to calculate the distance between two feature vectors is to use Euclidean Distance. The Euclidean distance between two vectors like [p1, q1] and [p2, q2] is equal to:

K means clustering is called as non hierarchical clustering.

Hierarchical clustering:

Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

Hierarchical clustering is used since we don't have to define size of clusters like k means

Hierarchical clustering groups data over a variety of scales & uses Euclidian distance by creating a cluster tree or dendrogram.

This algorithm has been implemented above using bottom up approach. It is also possible to follow top-down approach

The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :

Euclidean distance
Squared Euclidean distance
Manhattan distance
Maximum distance
Mahalanobis distance

Difference between K Means and Hierarchical clustering:

Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of KMeans is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n2).

In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.

K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).

K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram.

Applications of Clustering:

Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection

Search This Blog

Its All About Analytics.....