Introduction to K-means Clustering

digital particle mesh technology background design

What is clustering? What is k-means? This article will answer these questions. Apart from all this, we will also learn more about K-means clustering and its implementation by defining K-means fit function.

What is clustering?

Clustering is an unsupervised learning technique. It is used to group different data points based on similar features or characteristics.

For example, A company wants to know to whom they should display a particular ad such the chances of clicking it increases. Now suppose you have all the user’s clusters with the ads each group mostly clicks. With the help of these clusters, you can examine the group of the new user and check the probability of clicking that particular ad.

K-means Clustering (Flat clustering):

As the name suggests, K-means is something to do with the mean values, and k here represents the number of clusters. What k-means do is that if we have the final k clusters, then the distance of the data point from the cluster’s mean to which it belongs is minimum.

source: https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning

The above graph has 3 clusters; now, you encounter a new data point and assign a group to it. How will you do this? The answer is you will see all the clusters(means) and choose the one closest to this data point.

K-means Algorithm:

Use the steps below to implement your own k-means classifier:

  • Let’s randomly pick k mean values
  • Assign each data point a cluster by simply seeing which means is closest to the data point.
  • Find the new mean values of each cluster.
  • Now repeat this from step 2, and you can stop after a number of iterations, or there is no change in assigning the clusters.

Now let’s write the fit function using simple data in order to understand what is happening inside the fit function. The scatter plot of the data after standardization is given below:

Implementing fit function:

def fit(data,k = 2,max_iter = 100):
    means = []
    for i in range(k):
        means.append(data[i])   #initializing the mean with the first k data points
    for i in range(max_iter):
        clusters = []
        for j in range(k):
            clusters.append([])
        for point in data:
            distance = [((point - m)**2).sum() for m in means] #distance of the point with every cluster's mean
            min_dis = min(distance) #fining the minmum if them
            l = distance.index(min_dis) #cluster's index
            clusters[l].append(point)   #appending that point to the cluster 
        change = False   #for the stopping criteria
        for j in range(k):
            new_m = np.average(clusters[j],axis = 0) #calculating the new mean of each cluster
            if not np.array_equal(new_m,means[j]):
                change = True
            means[j] = new_m
        if not change:
            break
    return means

Defining own predict function:

def predict(data,means):
    predictions = []
    for point in data:
        distance = [((point - m)**2).sum() for m in means]
        min_dis = min(distance)
        l = distance.index(min_dis)
        predictions.append(l)
    return predictions

Output after applying the above algorithm:

Yellow and purple color represents the two clusters, and the light blue color represents the mean value of both groups.

Conclusion:

This was one of the basic algorithms for clustering. But still, there is a question, how will we determine the optimal number of clusters for a particular data, i.e., the optimal value of k? For this, we have something named K-means Silhouette analysis.  The upcoming article will have a detailed explanation of this analysis.

Thank you for the read! I hope it was helpful.