Machine Learning with Python: K Means Clustering

Introduction

K Means clustering is a popular machine learning algorithm used for grouping data points into distinct clusters based on their similarities. This powerful technique is widely used in various fields such as finance, marketing, biology and many more. K Means clustering is an unsupervised learning algorithm which means it doesn’t require the input data to be labelled or pre-assigned with predefined output classes. Instead, it tries to partition the dataset into groups with minimal intra-cluster variance and maximum inter-cluster distance. In this blog post, we will explore the basic concepts of K Means clustering and understand how it works under the hood using Python and Scikit-Learn library. We’ll also delve deeper into its practical implementation, illustrating different application scenarios and best practices that can help you solve real-world problems effectively. So let’s get started!

Table of Contents:


How to Become a Data Scientist PDF

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!


Use Cases for K Means Clustering

Let’s discuss some use cases for K-Means clustering:

• Customer Segmentation: K Means clustering can be used to segment customers based on their online behavior, preferences, and orders. By analyzing customer data such as purchase history and demographic information, businesses can use K Means clustering to group customers with similar characteristics into clusters. This allows companies to tailor their marketing efforts towards each cluster, which leads to more effective customer acquisition and retention.

• Image Compression: The K Means algorithm can be applied in image compression by clustering pixel values that are near each other together. By reducing the number of unique colors in an image without losing too much quality, this technique helps to compress the size of the images while retaining its integrity.

• Anomaly Detection: K Means Clustering can also detect anomalies in a given dataset. A set of data points that considerably stray from the centroid (point representing cluster center) or fail to find any “suitable” cluster with a minimum distance threshold may fall into anomalous outliers.

• Text Clustering: K Means Clustering can help in grouping text documents into meaningful clusters. Such grouping has utility applications like document summarization, topic identification for large sets of text data, news sentiment analysis etc.

• Recommender Systems: With the help of k-Means Clustering Recommender Systems, we can identify users having similar profiles and provide personalized content or product recommendations . The model predicts what you might like based on your previously preferred items using collaborative filtering techniques.

How K Means Clustering Works

K Means Clustering is a popular unsupervised machine learning algorithm used for grouping similar data points together based on their proximity to each other. It helps to identify underlying patterns or structures in unstructured datasets.

Here’s how the K Means Clustering algorithm works:

1. Initialization: The first step is to select a value of ‘K’ (number of clusters) and randomly initialize ‘K’ centroids (a centroid is the center point of a cluster).

2. Assigning Data Points: Next, each data point in the dataset is assigned to its nearest centroid based on Euclidean distance calculation.

3. Updating Centroids: After all data points have been assigned, the mean (average) of each cluster’s data points is calculated and this will be the new position of the centroid.

4. Repeat Until Convergence: Steps 2 and 3 are repeated until there is no change in the assignment of data points to centroids or until some predefined threshold has been met. This means that we have achieved convergence – i.e., our clusters have stabilized.

5. Output Results: Once convergence has been reached, we can now use the model for prediction or classification of new data points based on which cluster’s centroid they are closest to.

To illustrate this further, here’s a simple example:

Assume we want to group emails into different categories such as spam or not-spam, work-related or personal etc. Emails can be represented as vectors with features such as sender name, email title, body content etc.

We could apply K Means clustering by specifying ‘K=2’ for two clusters – one representing spam email and another for non-spam email. We would then initialize two random centroids and assign each email to its closest centroid until convergence has been reached.

Once convergence has been achieved, all emails close enough to each other will belong in the same group and we can label them appropriately (spam vs non-spam). If desired, we could also visualize these groups using scatter plots or heat maps.

In summary, K Means Clustering works by iteratively assigning data points to their nearest centroids and updating these centroids until convergence has been reached. It’s an effective algorithm for unsupervised learning tasks such as pattern recognition, anomaly detection etc., but it’s important to carefully consider the choice of ‘K’ before applying it to any given problem.

K Means Clustering Python Example

Here’s an example of how to perform K-Means Clustering in Python using the Scikit-Learn library, and how to visualize the results using Matplotlib.

First, let’s import the necessary libraries:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

Next, let’s generate some artificial data that we can use for our clustering analysis. We will use the `make_blobs` function from Scikit-Learn to generate random clusters:

# Generate random data with 3 clusters
X, y_true = make_blobs(n_samples=300, centers=3,
cluster_std=0.60, random_state=0)

We now have a 2D array `X` containing our data points and their respective dimensions (in this case, two dimensions). We also have a matching array `y_true` which contains the true labels (i.e., which cluster each point in `X` belongs to).

Let’s take a quick look at our data by plotting it using Matplotlib:

plt.scatter(X[:, 0], X[:, 1], s=50);
plt.title('Data Before Clustering')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

This should give us a scatter plot of our data:

Scatterplot Data Before Clustering

Now we’re ready to run K-Means clustering on our dataset. We’ll create an instance of the `KMeans` class from Scikit-Learn and fit it to our data:

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)

We’ve selected three clusters (`n_clusters=3`) because we know that is how many clusters we generated above using `make_blobs`. In practice, you would want to select an appropriate number of clusters based on insight into your specific problem or trying different values.

Now that we have clustered our data points, let’s extract and display k-means centroids.

To do this in just a few lines of code:

centers = kmeans.cluster_centers_

# Plot our clustered data with centroids overlayed:
fig, ax = plt.subplots(figsize=(9,6))
scatter = ax.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='rainbow',
alpha=.7)

ax.set_xlabel('$x$')
ax.set_ylabel('$y$')

centers_draws = ax.scatter(centers[:, 0], centers[:, 1], linewidths=[4]*4,
color="darkmagenta", edgecolor='black', zorder=10)

legend=['Original Data', 'Centroids']
ax.legend((scatter , centers_draws), legend)

plt.title('Clustered Data with Centroids Overlayed');

plt.show()
Scatterplot Data with Centroid Overlayed

Our colored dots correspond with each individual cluster. Looking closer, one can see that there are several “misclassified” – this is due to the generated choice of initial centroid locations as well as somewhat arbitrary sizes of initial blobs. Keep in mind though, that this is an unsupervised learning algorithm, not actually meant for true classification where a historical label was previously known.

That’s all for this example! You now know how to perform K-Means Clustering in Python and visualize your results using Matplotlib.

Pros and Cons of K Means Clustering

Pros of K Means Clustering

Here are some advantages in using K Means Clustering:

• K Means Clustering is a simple and efficient algorithm that can handle large datasets. It is faster than most hierarchical clustering algorithms, as it has a linear time complexity, which makes it well suited for big data analysis. It works by partitioning data points into clusters based on their similarity to each other in terms of distance measured by Euclidean distance metric.

• The algorithm can be applied to a wide range of applications such as image segmentation, recommendation systems, customer segmentation, anomaly detection, and more. For example, when applied to image segmentation problems, K means clustering divides the objects in the images into the same numbered groups based on their gray-scale intensities.

• One great advantage of K Means Clustering is that it is an unsupervised learning technique; this means that it does not need labeled data for the algorithm to learn. However, you’ll still need to choose a value for “k,”the number of clusters desired- but there are many ways to make informed decisions about what k value will produce optimal results.

• K Means Clustering is easy to implement since most programming languages provide libraries with built-in functionalities. Libraries like SKLearn and SciPy have pre-built functions available that allow users to perform K means on any given dataset without worrying too much about technical details.

• Another significant advantage of KMeans Clustering is that it allows you to visualize high-dimensional datasets in 2D or 3D space better. This visual representation enables users to grasp insights easily from patterns and trends present in their data sets making it straightforward when presenting results or sharing information with stakeholders.

Cons of K Means Clustering

Now let’s quickly touch upon some of the drawbacks of K-Means Clustering:

• One of the major disadvantages of K-Means clustering is that it requires prior specification of the number of clusters in the data set to be analyzed. This can be a challenging task for large datasets, as it requires significant domain expertise and intuition.

• Another disadvantage of K-Means clustering is that it assumes clusters are spherical and have equal variances. This means that if clusters overlap or if their shapes are irregular, K-Means may not be the best option.

• K-Means is very sensitive to outliers, which can significantly affect cluster formation or cause them to form around an outlier. It does not perform well when there are non-linear or non-spherical boundaries between data points.

• As for initialization, although there is a random initialization in Scikit-Learn’s implementation by default, it uses smart heuristics to improve the speed of convergence. However, when dealing with problematic initializations (for instance those where some clusters have very few sample) convergence might take too much time or never converge at all.

• Lastly, since this algorithm relies on Euclidean distance calculation measures, high-dimensional data such as images or text documents cannot perform its efficiency because geometric interpretation fails in higher dimensions.

Conclusion

In conclusion, K Means Clustering is a powerful tool for discovering patterns and structures within data that can be used to inform decision-making processes in a wide range of applications. With the help of Python and machine learning libraries like Scikit-Learn and Matplotlib, it is easier than ever to apply K Means Clustering techniques to real-world problems. However, it is important to carefully consider the specific nuances of each problem before selecting an appropriate number of clusters, as well as assessing the quality of the resulting clusters using different evaluation metrics. By understanding these essential components of K Means Clustering, you will be well-equipped to harness its full potential in your own work.

Pierian Training
Pierian Training
Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing. Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

You May Also Like

Data Science, Tutorials

Guide to NLTK – Natural Language Toolkit for Python

Introduction Natural Language Processing (NLP) lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit […]

Machine Learning, Tutorials

GridSearchCV with Scikit-Learn and Python

Introduction In the world of machine learning, finding the optimal set of hyperparameters for a model can significantly impact its performance and accuracy. However, searching through all possible combinations manually can be an incredibly time-consuming and error-prone process. This is where GridSearchCV, a powerful tool provided by Scikit-Learn library in Python, comes to the rescue. […]

Python Basics, Tutorials

Plotting Time Series in Python: A Complete Guide

Introduction Time series data is a type of data that is collected over time at regular intervals. It can be used to analyze trends, patterns, and behaviors over time. In order to effectively analyze time series data, it is important to visualize it in a way that is easy to understand. This is where plotting […]