How To Use The Kmeans Algorithm To Find Groups In Data With Python

By Mr. Data Science

Let’s say we are data scientists working for a retail company and our boss wants to create a targeted marketing campaign. In order to focus the campaign, we have to divide the set of customers into smaller subsets based on the features in our customer dataset. Features are just the columns in the dataset and each row represents a unique customer. So as the data science team, our job is to somehow find those groups.

This task is different from many other machine learning tasks in that we don’t have any labelled data so we can’t use any kind of supervised learning techniques(supervised learning algorithms such as support vector machines and naive Bayes classifiers require labelled training data). Instead, there are unsupervised algorithms, like the K-means clustering algorithm, that can help us.

Imagine a simple dataset with just three dimensions (3 features). We can plot each of these points, each customer, using x, y and z axes. The way K-means works is, you decide on a value for K, maybe we choose K=5. This gives us 5 centroids (points within our three dimensional space), the K-means algorithm then randomly creates 5 centroids and calculates the Euclidean distance from each centroid to each point and groups the points, all the points that are closer to centroid 1 belong to group 1 and so on. It is an iterative process starting with random centroids, getting the mean of all data points assigned to that centroid’s cluster and recalculating distances until there is convergence — no further changes after recalculating. The Kmeans algorithm can of course deal with multidimensional data beyond three features.

You may have noticed a potential problem with this approach, the K-means algorithm requires us to input a value for K, but we don’t know how many groups there should be so how can we decide on a value for K? If the value for K is too low the groups will lack the focus required for the marketing campaign but if the value of K is too high we may end up with overlapping groups and some groups that are almost empty. To determine a value for K we will use two different approaches, the elbow method and the silhouette method. These approaches are not are the only ways to calculate K but they are commonly used [1]

Another potential issue is the data itself, different numerical features can have very different ranges. Standardizing the dataset features can potentially improve the algorithm effectiveness.[2] So when we come to look at a real world dataset in example 3 we will also look at a process called feature scaling.

In order to sub-divide a dataset using Kmeans, you will need to install the following python libraries:

To explore the Kmeans algorithm, lets look at a few examples:

In this first example we’ll investigate what happens if we choose an inappropriate value for K. To do this we’ll generate some artificial data, that way we know what K should be. Then we’ll plot the results and see how the algorithm behaves when we choose a value that’s too low or too high.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
%matplotlib inline

The make_blobs function in the Sckit Learn library will generate some data for us:

X, y = make_blobs(n_samples=200, centers=6, random_state=1)

We can plot this data and check that it has generated the clusters:

plt.rcParams.update({'figure.figsize':(10,7.5), 'figure.dpi':100})
plt.scatter(X[:, 0], X[:, 1]);
png

So the make_blobs function has created the clusters. Now let’s fit the K-means algorithm to the data using K=6 since we told the function to create 6 clusters.

kmeans = KMeans(n_clusters=6)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

We can plot the groups together with the calculated centroids:

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);
png

The red transparent circles are the centroids, they seem reasonable. Let’s try this again but set K=3

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);
png

The group near the top of the plot includes about 50% of the data points. When we choose a value for K which is too low we might be failing to capture all of the possible groups, in other words we are failing to get the focus required for the marketing campaign.

Next let’s see what happens when we try to divide the data into 12 groups.

kmeans = KMeans(n_clusters=12) 
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);
png

When we choose a value for K which is too high we might have overlapping groups and some groups which are almost empty. This example illustrates visually why it is important to get a good balanced value for K. While there is no such thing as a correct value for K, we do need to choose a value for K which gives us a balance between being too general and too specific. In the next example we will look at a couple of methods we can use to generate possible values for K.

We’ll use a library called Yellowbrick to try and find a value for K and visualize the process, we’ll use two approaches: the elbow method and the silhouette method.

Once we have our groups we can measure the mean squared distance and think of it as a kind of error, we want the error to be as low as possible, in other words we want each of the data points in each group to be tightly packed around the group centroid. If we plot this ‘error’ against K we will get a graph which generally decays as K increases but there will usually be an inflection point on the graph beyond which the decrease in the error for an increase in K becomes small. That point is our best value for K. It is called the elbow method because it is like an elbow or a sharp bend in the graph.

model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

visualizer.fit(X)
visualizer.poof();
png

In this case the elbow method is giving us a value of K=5. There are some alternatives, we can also pass a metric value to the function, see the library link above for documentation on the functions available.

model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12), metric='calinski_harabasz', timings=False)

visualizer.fit(X)
visualizer.poof();
png

This also suggests a value of K=5. We can also try the silhouette method:

The silhouette value is a measure of cohesion and separation, it ranges from -1 to +1. A value of +1 indicates the point is well matched to its group and poorly matched to neighboring groups.

We can use a for-loop to calculate the silhouette score for a range of K values. The K value that gives us the highest silhouette score is the best value for K.

range_n_clusters = list (range(3,9))
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(X)
centers = clusterer.cluster_centers_

score = silhouette_score(X, preds)
print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))
For n_clusters = 3, silhouette score is 0.6090456488686238)
For n_clusters = 4, silhouette score is 0.6229686987765262)
For n_clusters = 5, silhouette score is 0.5503874661966188)
For n_clusters = 6, silhouette score is 0.47317820875716576)
For n_clusters = 7, silhouette score is 0.42455127231234846)
For n_clusters = 8, silhouette score is 0.42719717893585374)

In this case we are looking for the highest silhouette score. this approach is suggesting a value of K=4. It is not uncommon for the different methods to give slightly different results. Remember there is no such thing as a single correct value for K. In the next example we will apply all of this to a real world dataset.

I am using a dataset from Kaggle, it has data on 167 countries. Of course you can try using Kmeans on any dataset you have access to. In this case let’s say we are working for an NGO. The NGO has money which it wants to use to improve the lives of people in some countries. The money they have available is limited so they want to focus on just a few countries. In real life we would have to do a full analysis of the dataset to identify which countries have the most need but for this tutorial we will do just the first part of that analysis, dividing the countries into groups.

We can import the data into a Pandas dataframe and look at the first few rows of the dataset:

df = pd.read_csv('Country-data.csv')
df.head()
png

The dataset has ten columns and 167 rows and includes data on things like life expectancy, GDP per capita …

The data set does not contain any nulls, you can check this with the following line of code:

df.isna().sum()country       0
child_mort 0
exports 0
health 0
imports 0
income 0
inflation 0
life_expec 0
total_fer 0
gdpp 0
dtype: int64

Most machine learning algorithms don’t like nulls so it is always worth checking before passing data into any algorithm. Next let’s look at a summary of the data:

df.describe()
png

Note the min, max and mean values for the different features (columns). Income goes up to values of hundreds of thousands whereas life expectancy is less than 100. If we use the data as it currently is we may find the clustering process is dominated by one or two features. We can fix this using a technique called feature scaling. The process of feature scaling just reduces the different data into numerically similar ranges. Because of the way Kmeans works leaving the variances unequal would be equivalent to putting more weight on certain features, in this case for example income would be significantly more important than life expectancy.

The country name is a string and is not relevant to the clustering process so I will drop that feature then perform some feature scaling.

df_no_country = df.drop(['country'],axis=1)
scaler = StandardScaler()
X_std = scaler.fit_transform(df_no_country)
X_std[0:2] #look at the first two 'rows'
array([[ 1.29153238, -1.13827979, 0.27908825, -0.08245496, -0.8082454 ,
0.15733622, -1.61909203, 1.90288227, -0.67917961],
[-0.5389489 , -0.47965843, -0.09701618, 0.07083669, -0.3753689 ,
-0.31234747, 0.64786643, -0.85997281, -0.48562324]])

The scaler returns a numpy array, you can see the first two elements in this array in the output above.

Next we need to estimate a possible value for K. This time we are using real data so we have no idea what value K might be. We’ll start by calculating the silhouette score for a range of possible K values.

range_n_clusters = list (range(2,9))
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(X_std) #fitting the algorithm to the scaled values rather than the actual values
centers = clusterer.cluster_centers_

score = silhouette_score(X_std, preds)
print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))
For n_clusters = 2, silhouette score is 0.2873566892140671)
For n_clusters = 3, silhouette score is 0.28329575683463126)
For n_clusters = 4, silhouette score is 0.29595170577528157)
For n_clusters = 5, silhouette score is 0.30088229124112015)
For n_clusters = 6, silhouette score is 0.22799367131601095)
For n_clusters = 7, silhouette score is 0.24807915771876124)
For n_clusters = 8, silhouette score is 0.20409970851973838)

The silhouette method suggests a value of K=5, so let’s also get a K value using the elbow method.

model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,8), metric='calinski_harabasz', timings=False)

visualizer.fit(X)
visualizer.poof();
png

The elbow method also suggests K=5. So let’s fit the scaled data to the algorithm with K=5.

kmeans = KMeans(n_clusters = 5,random_state = 111)
kmeans.fit(X_std)
KMeans(n_clusters=5, random_state=111)pd.Series(kmeans.labels_).value_counts()1 85
2 46
3 30
4 3
0 3
dtype: int64

So using a value of K=5 produces two almost empty groups. Just for curiosity, what happens if we use K=4:

kmeans = KMeans(n_clusters = 4,random_state = 111)
kmeans.fit(X_std)
KMeans(n_clusters=4, random_state=111)pd.Series(kmeans.labels_).value_counts()1 88
2 46
3 30
0 3
dtype: int64

We now have one almost empty group and group 1 is slightly larger. If we really were employed by an NGO we would have to do a lot more investigating at this stage to decide exactly what grouping works best but that is beyond the scope of this tutorial.

There is one more thing we have to do. We have sub-divided the dataset into smaller groups but which countries are in which groups? We need to create a new column in our original dataset showing the group labels:

preds = kmeans.labels_
kmeans_df = pd.DataFrame(df)
kmeans_df['Labels'] = preds
kmeans_df.head(15)
png

So now we can see which country is in which group. The Kmeans algorithm is easy to use and with just a few lines of code it divides the data into organic groups which help us in gaining a much deeper understanding of our dataset.

To sum up the Kmeans algorithm enabled us to take some data and sub-divide it based on ten numerical variables. This process has real world applications in fields like business, life sciences, biology and many others. I highly encourage you to test it out for yourself.

  1. Kodinariya, T., Review on determining number of Clusters in K-Means Clustering, date retrieved: 02/22/2021. https://www.researchgate.net/profile/Biniam_Kefiyalew/post/In-cluster-sampling-method-On-what-basis-we-calculate-the-number-of-clusters-to-be-selected/attachment/5f8726bc7600090001ee620b/AS%3A946576348442624%401602692796307/download/Review_on_determining_number_of_Cluster_2.pdf
  2. Usman, D., Standardization and Its Effects on K-Means Clustering Algorithm, date retrieved: 02/22/2021. https://www.researchgate.net/profile/Dauda_Usman2/publication/288044597_Standardization_and_Its_Effects_on_K-Means_Clustering_Algorithm/links/56b5f9b908aebbde1a79bce7/Standardization-and-Its-Effects-on-K-Means-Clustering-Algorithm.pdf

MrDataScience.com, GitHub, Medium,

I’m just a nerdy engineer that has too much time on his hands and I’ve decided to help people around the world learn about data science!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store