"""Clustering & Clustering Algorithms"""
What is Clustering?
The method of identifying similar groups of data in a dataset is called clustering. It is one of the most popular techniques in Data Analysis. It is Unsupervised Clustering Algorithm.
Simple Case where clustering can be useful:
Imagine, you own shop and want to understand preferences of your costumers to improve profit margin. It is not possible to look at details, buying habits, patterns of each consumer and plan a separate business strategy for each of them. Instead of that you can cluster all of your consumers into say 5 groups depending on their purchasing habits and use a separate strategy for consumers in each of these 10 groups. This is what called as clustering.
There are so many clustering algorithms. But we will see popular algorithms among them
K-means CLustering:
The following diagram shows K means clustering operation on mixed data points.
K-means clustering follows partitioning & observations in k clusters approach. It requires
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
Its unsupervised model
k-means clustering is a method of vector quantization.
K-means is not deterministic and it also consists of number of iterations.
K-means algorithm can be used for clustering problems & feature learning
K-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest Mean
The goal for K-Means cost function is to minimize squared error function where error function represents distance between data points and cluster centroid
K-Means squared error function is related to Euclidian distance
In each iteration of K-Means, we need a way to find the nearest centroid to each item in the dataset. One of the simplest ways to calculate the distance between two feature vectors is to use Euclidean Distance. The Euclidean distance between two vectors like [p1, q1] and [p2, q2] is equal to:
K means clustering is called as non hierarchical clustering.
Hierarchical clustering:
Hierarchical clustering is used since we don't have to define size of clusters like k means
Hierarchical clustering groups data over a variety of scales & uses Euclidian distance by creating a cluster tree or dendrogram.
This algorithm has been implemented above using bottom up approach. It is also possible to follow top-down approach
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
- Euclidean distance
- Squared Euclidean distance
- Manhattan distance
- Maximum distance
- Mahalanobis distance
Difference between K Means and Hierarchical clustering:
Applications of Clustering:
- Recommendation engines
- Market segmentation
- Social network analysis
- Search result grouping
- Medical imaging
- Image segmentation
- Anomaly detection



Comments
Post a Comment