# Introduction to K-means Clustering

What is clustering?

Clustering is an unsupervised learning technique. It is used to group different data points based on similar features or characteristics.

For example, A company wants to know to whom they should display a particular ad such the chances of clicking it increases. Now suppose you have all the user’s clusters with the ads each group mostly clicks. With the help of these clusters, you can examine the group of the new user and check the probability of clicking that particular ad.

K-means Clustering (Flat clustering):

As the name suggests, K-means is something to do with the mean values, and k here represents the number of clusters. What k-means do is that if we have the final k clusters, then the distance of the data point from the cluster’s mean to which it belongs is minimum.

The above graph has 3 clusters; now, you encounter a new data point and assign a group to it. How will you do this? The answer is you will see all the clusters(means) and choose the one closest to this data point.

K-means Algorithm:

Use the steps below to implement your own k-means classifier:

• Let’s randomly pick k mean values
• Assign each data point a cluster by simply seeing which means is closest to the data point.
• Find the new mean values of each cluster.
• Now repeat this from step 2, and you can stop after a number of iterations, or there is no change in assigning the clusters.

Now let’s write the fit function using simple data in order to understand what is happening inside the fit function. The scatter plot of the data after standardization is given below:

Implementing fit function:

``````def fit(data,k = 2,max_iter = 100):
means = []
for i in range(k):
means.append(data[i])   #initializing the mean with the first k data points
for i in range(max_iter):
clusters = []
for j in range(k):
clusters.append([])
for point in data:
distance = [((point - m)**2).sum() for m in means] #distance of the point with every cluster's mean
min_dis = min(distance) #fining the minmum if them
l = distance.index(min_dis) #cluster's index
clusters[l].append(point)   #appending that point to the cluster
change = False   #for the stopping criteria
for j in range(k):
new_m = np.average(clusters[j],axis = 0) #calculating the new mean of each cluster
if not np.array_equal(new_m,means[j]):
change = True
means[j] = new_m
if not change:
break
return means``````

Defining own predict function:

``````def predict(data,means):
predictions = []
for point in data:
distance = [((point - m)**2).sum() for m in means]
min_dis = min(distance)
l = distance.index(min_dis)
predictions.append(l)
return predictions``````

Output after applying the above algorithm:

Yellow and purple color represents the two clusters, and the light blue color represents the mean value of both groups.

Conclusion:

This was one of the basic algorithms for clustering. But still, there is a question, how will we determine the optimal number of clusters for a particular data, i.e., the optimal value of k? For this, we have something named K-means Silhouette analysis.  The upcoming article will have a detailed explanation of this analysis.