Machine Learning with Python: K Nearest Neighbors

Introduction

K Nearest Neighbors (KNN) is a popular supervised machine learning algorithm that has been widely used in a variety of fields, including marketing, healthcare, and image recognition. It is a simple yet powerful algorithm that belongs to the category of instance-based learning or lazy learning. The KNN algorithm can be applied both for classification and regression problems with ease. It works by finding the K nearest data points to the test point in feature space and then classifying or predicting the value based on their labels or values.

In this blog post, we will explore the basics of KNN algorithm along with its implementation using Python’s Scikit-Learn library, which provides an efficient implementation of KNN that can handle large datasets easily. We’ll also take a look at some practical applications of KNN along with its strengths and weaknesses. So, let’s dive right into it!

Table of Contents:


How to Become a Data Scientist PDF

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!


Use Cases for K Nearest Neighbors

Let’s discuss some potential use cases for K-Nearest Neighbors:

• Recommender Systems: One of the most common use cases for KNN is to create a recommender system, in which we recommend items to users based on their similarities with other users. In this scenario, the algorithm identifies the “k” nearest neighbors (other users) who have rated or consumed similar items as the target user and then recommends new items based on their preferences.

• Image Recognition: The K-Nearest Neighbors algorithm can be used in image processing tasks such as facial recognition where it examines images of faces to ascertain their identities. Here, each pixel in an image represents a feature and distance calculations are carried out between each point. This technique helps identify patterns that help distinguish one person from another.

• Anomaly Detection: Another interesting way to use the KNN algorithm is for anomaly detection. Here, instances that deviate significantly from the norm can be located by defining what constitutes normal data points using appropriate statistical techniques such as mean value or standard deviation.

• Text Classification: KNN is also useful for text classification applications such as spam filtering or sentiment analysis, where documents are represented by vectors representing word frequencies. These vectors are then compared using distance metrics between them, allowing us to classify new documents into labeled categories accurately.

• Healthcare Diagnosis: Medical practitioners use machine learning algorithms to diagnose diseases and health conditions based on patient data collected over time. The K-Nearest Neighbor model analyzes various parameters like pulse rate,, blood pressure levels among others before recommending diagnoses.

• Credit Risk Analysis: Financial institutions use machine learning algorithms such as k-nearest neighbors (KNN) to predict loan balances when analyzing credit risk factors like payment history or previous defaults data points analysis is carried out enabling banks and finance companies improve customer services while minimizing risk exposure simultaneously.

How K Nearest Neighbors Works

K Nearest Neighbors (KNN) is a classification algorithm that works by identifying the k number of nearest neighbors to a given point in the feature space. The class label of the new data point is then determined by analyzing the class labels of these nearest neighbors.

Here’s how KNN works step-by-step:

1. Load the data: First, you need to load and preprocess your data. This includes cleaning the data, removing missing values or outliers, and splitting your dataset into training and testing sets.

2. Choose K: You need to choose a value for k, which represents the number of nearest neighbors you want to consider when making predictions.

3. Define distance metric: A distance metric measures how far apart two points are from each other in feature space. The most commonly used distance metrics are Euclidean distance, Manhattan distance, Minkowski distance, etc.

4. Calculate distances: Using the selected distance metric, calculate the distances between each point in your training set and your new data point.

5. Identify K nearest neighbors: Selecting k closest points based on calculated distances

6. Assign classes: Analyze class labels of these respective closest K neighbors and assign classes for our test/example datapoint.

7. Make predictions: Convert classifications computed above as vote counts for each individual category or as probabilities returning predicted class label with the highest vote count/probability.

Fortunately, KNN is very easy to understand and implement when it comes to both classification and regression tasks although requires one needs effort choosing optimal parameters such as value of k ,distance metric type among others

K Nearest Neighbors Python Example

Let’s work through an example code for K Nearest Neighbors (KNN) algorithm in Python using the Scikit-Learn and Matplotlib libraries:

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt

# Load iris dataset
iris = load_iris()

# Split features and target variable
X = iris.data[:, :2] # only first two features for visualization
y = iris.target

# Create a KNN classifier with k=3 neighbors
knn = KNeighborsClassifier(n_neighbors=3)

# Train the model on the data
knn.fit(X, y)

# Visualize decision boundaries of trained classifier
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), np.arange(y_min, y_max, 0.01))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10,7))
plt.contourf(xx, yy, Z)

# Plot the training points onto the figure with different colors for each class label.
scatter_x = X[:,0]
scatter_y = X[:,1]
group = y.tolist()
cdict ={0:'purple',1:'lightgreen',2:'orange'}
for g in np.unique(group):
    ix=np.where(group==g)
    plt.scatter(scatter_x[ix], scatter_y[ix], c=cdict[g], label=g,s=100)

plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('KNN Decision Boundaries')
plt.legend(title='Classes')
plt.show()

This will give us the KNN Decision Boundaries:

Let’s step through this code to understand how it works:

– First we import the required libraries: `load_iris` from Scikit-Learn datasets which contains a built-in Iris dataset that we will use later to train our model. Next is `KNeighborsClassifier` to build our classifier model and `matplotlib.pyplot` for visualizing decision boundaries.

– We then load up the Iris dataset using Scikit-Learn’s `load_iris()` function.

– After that we split out multiple feature variables (in our case Sepal Length and Sepal Width) into one array (`X`)and column vector containing Labels are assigned to another array (`y`).

– Then we create a new instance of a `KNeighborsClassifier`, setting n_neighbors parameter equal to three because we want our model to use three closest neighbours when predicting classes.

– Now that our classifiers are defined ,we need to train them on labeled examples. This is done via `.fit()` method by passing feature data (`X`), along with corresponding labels (`Y`).

– In order visualize some result after training, the remaining portion of code plots these results by creating meshgrid separating all possible points across prediction surface, the predicted classes at those locations as colors using contourf . Scatterplots representing various classses are overlaid on top of these contours.

We can run this sample code block ,which should show a graph titled “KNN Decision Boundaries” illustrating how K-nearest neighbors splits input space into classification regions based on training data provided.

Pros and Cons of K Nearest Neighbors

Let’s discuss the pros and cons of using KNN.

Pros of KNN

• One of the main advantages of K Nearest Neighbors (KNN) is that it can be used for both classification and regression problems. For classification tasks, it identifies which class a particular data point belongs to based on its proximity to other classified data points. Meanwhile, in regression tasks, KNN estimates values for continuous variables by averaging the values of its nearest neighbors.

• Another advantage of KNN is its simplicity and interpretability. The algorithm is easy to understand and implement, as well as being highly intuitive. Furthermore, it allows users to inspect how each data point was classified or predicted by looking at its nearest neighbors.

• KNN requires no prior knowledge about the distribution or parameters of the underlying data set since it relies entirely on similarity between instances in order to make predictions. This makes it a powerful tool when dealing with complex data structures where feature distributions are not necessarily known beforehand.

• The non-parametric nature of KNN, means that no assumptions or constraints have been placed on the shape of the decision boundaries between different classes making this approach very flexible and suitable for many different types of real-world datasets. Additionally, because new training examples can easily be incorporated into existing models without requiring any retraining or model adjustments (hence having low training time), KNN can adapt easily to changes in input patterns.

• Lastly, unlike some other machine learning algorithms which may perform poorly with small datasets, KNN actually performs better when given smaller amounts of data since there are fewer chances for over-fitting due to reduced variance between isolated neighborhoods within smaller sample sizes than larger ones. Therefore this method becomes useful in cases where there isn’t an abundance amount of available relevant information about a certain problem domain e.g medical applications .

Cons of K Nearest Neighbors

• The first disadvantage of K Nearest Neighbors algorithm is that it requires the entire dataset to be loaded into memory. As a result, this limits its scalability and makes it unsuitable for large datasets with high dimensions. Since all data points of the dataset are compared to classify each new instance, the algorithm can become very slow when dealing with larger datasets.

• Another disadvantage of KNN model is that it produces no model representation which means every prediction must be done from scratch using fresh calculations based on the training data. This results in higher computational costs as more neighbors are taken into consideration while making predictions.

• KNN regression is highly sensitive to irrelevant or redundant input features, which can lead to poor performance if not properly preprocessed beforehand. Fitting such models require careful selection and preprocessing of relevant features, as noisy or irrelevant variable may dilute the impact of important predictors.

• A crucial challenge in implementing KNN model is determining the optimal value of k (the number of nearest neighbors), which solely relies on trial and error method. Alternatively, mathematical techniques like cross-validation could be explored although they add an extra computational cost and make building predictive models more complex.

• Lastly, class imbalance can cause difficulties for KNN learning systems while dealing with rare events as most instances are biased towards one specific class. This creates a problem because even if we have a significant number of observations related to rare events, these instances do not have enough influence over classification decision due to their reduced representation relative to dominant class. To avoid such issues an efficient sampling strategy should be devised before fitting the model.

Conclusion

In conclusion, K Nearest Neighbors is a simple yet powerful algorithm that can be used for both classification and regression tasks. It works by finding the closest K data points to a new data point and uses their labels or values to classify or predict the value of the new point. Despite its simplicity, KNN can achieve high accuracy and is widely used in real-world applications such as image recognition, medical diagnosis, and recommendation systems. However, choosing the right value of K and selecting meaningful features are crucial for the performance of KNN. Additionally, preprocessing the data and handling imbalanced datasets can also improve its accuracy. Finally, as with any machine learning algorithm, it’s important to avoid overfitting by using cross-validation and testing on unseen data.

If you’re interested in learning more about KNN and Python for Machine Learning, be sure to check out our Python for Machine Learning course!

Pierian Training
Pierian Training
Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing. Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

You May Also Like

Data Science, Tutorials

Guide to NLTK – Natural Language Toolkit for Python

Introduction Natural Language Processing (NLP) lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit […]

Machine Learning, Tutorials

GridSearchCV with Scikit-Learn and Python

Introduction In the world of machine learning, finding the optimal set of hyperparameters for a model can significantly impact its performance and accuracy. However, searching through all possible combinations manually can be an incredibly time-consuming and error-prone process. This is where GridSearchCV, a powerful tool provided by Scikit-Learn library in Python, comes to the rescue. […]

Python Basics, Tutorials

Plotting Time Series in Python: A Complete Guide

Introduction Time series data is a type of data that is collected over time at regular intervals. It can be used to analyze trends, patterns, and behaviors over time. In order to effectively analyze time series data, it is important to visualize it in a way that is easy to understand. This is where plotting […]