Machine Learning, Tutorials

Full Guide to Feature Scaling in Scikit-Learn

Posted on: 20 April 2023
Updated on: 24 October 2023
Written by: Pierian Training

Introduction

Feature scaling is an essential preprocessing step in machine learning that involves transforming the numerical features of a dataset to a common scale. The goal of feature scaling is to improve the performance and accuracy of machine learning models by ensuring that each feature contributes equally to the learning process.

In many real-world datasets, features have different scales, ranges, and units of measurement. For instance, consider a dataset that contains information about houses such as the number of bedrooms (1-10), the price in dollars (50,000-5,000,000), and the area in square feet (500-10,000). If we use this dataset to train a machine learning model without feature scaling, it is likely that some features will dominate others in terms of their influence on the model’s output. For example, the price feature may be more important than the number of bedrooms or area, simply because it has a larger range and magnitude.

To avoid such bias and ensure that all features are treated equally during training, we can apply various scaling techniques to normalize or standardize their values. In this guide, we will explore the most popular feature scaling methods in Python and Scikit-Learn library and discuss their advantages and disadvantages. We will also provide code examples to demonstrate how to implement these methods on different datasets.

What is Feature Scaling?

Machine learning algorithms usually work by identifying patterns and relationships between different features of the data. However, not all features have the same scale or range of values. This can cause problems for some machine learning algorithms, especially those that are distance-based or gradient descent-based. Feature scaling is a technique used to standardize the range of independent variables or features of data so that they can be compared on common grounds by the machine learning algorithm.

Feature scaling is a process where we transform our input data to fit within a specific scale. In other words, it is a method to rescale the input features in such a way that they have similar ranges and units. There are two main types of feature scaling techniques: normalization and standardization.

Normalization scales all feature values to fall within a range of 0 and 1. This is done by subtracting each value from its minimum value and dividing the result by the difference between maximum and minimum values:

x_normalized = (x – x_min) / (x_max – x_min)

Standardization scales each feature so that its mean becomes zero and its variance becomes one. Standardization maintains the relative relationships between different feature values but changes their absolute values:

x_standardized = (x – mean(x)) / std(x)

Real World Use Cases for Feature Scaling

There are several use cases and reasons to conduct feature scaling with real world data. Let’s explore a few of them!

Gradient Descent Optimization
Gradient descent optimization is a widely used technique in machine learning algorithms such as linear regression, logistic regression, support vector machines (SVM), and neural networks. In gradient descent, the algorithm updates the parameters of the model iteratively by minimizing a cost function based on the error between predicted and actual values. Feature scaling can help accelerate convergence by facilitating efficient weight updates during each iteration and reducing oscillations in the gradient descent path.
Distance-Based Algorithms
Distance-based algorithms such as k-nearest neighbors (KNN) and cluster analysis rely heavily on measuring distances between data points to identify patterns or groups within datasets. Since these algorithms are sensitive to differences in scale among features, it is essential to normalize them before applying distance metrics such as Euclidean or Manhattan distance.
Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that helps identify patterns and structure within high-dimensional datasets by projecting them onto lower-dimensional subspaces while preserving as much variance as possible. As part of the PCA process, feature scaling is necessary to ensure that each variable contributes equally to the principal components’ calculations.
Regularization Techniques
Regularization techniques such as L1 regularization (Lasso) and L2 regularization (Ridge) are commonly used in linear regression models to reduce overfitting by adding penalty terms proportional to either absolute values of coefficients (L1) or squared values of coefficients (L2). Feature scaling can help stabilize these methods by ensuring that all variables have similar scales and magnitudes.
Neural Networks
Neural networks are powerful machine learning models capable of learning complex relationships within data. However, they can be sensitive to variations in feature scaling, leading to unstable training and suboptimal performance. Feature scaling is crucial for neural networks to ensure that the activation functions’ derivatives are balanced across all input features.
Image Processing
Image processing involves analyzing digital images to extract meaningful information or enhance their visual appearance. Feature scaling is essential in this domain as it helps normalize pixel values across different channels (RGB, grayscale) and image resolutions. This normalization improves the models’ accuracy in tasks such as image classification, segmentation, and object detection.
Natural Language Processing (NLP)
NLP deals with processing human language data such as text and speech to derive insights or perform specific tasks such as sentiment analysis, text classification, and machine translation. Feature scaling is necessary in NLP because text data can have varying lengths and distributions of word frequencies across different documents or corpora.

Feature Scaling in Scikit-Learn and Python

Let’s explore the different ways to perform feature scaling with Scikit-Learn!

StandardScaler

One of the most commonly used feature scaling techniques is StandardScaler. It scales the data such that the mean is 0 and the standard deviation is 1. This means that the scaled data will have a normal distribution with a mean of 0 and a standard deviation of 1.

To use StandardScaler in Scikit-Learn, we first need to import the library and create an instance of the StandardScaler class. We can then fit this scaler on our training data and transform both the training and test sets.

Here’s an example code snippet that demonstrates how to use StandardScaler in Scikit-Learn:


from sklearn.preprocessing import StandardScaler

# Create an instance of the scaler
scaler = StandardScaler()

# Fit on training data
scaler.fit(X_train)

# Transform both training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In this example, we first imported the StandardScaler class from Scikit-Learn’s preprocessing module. We then created an instance of the scaler by calling `StandardScaler()`.

Next, we fit the scaler on our training data by calling `fit()` on our training set `X_train`. This step calculates the mean and standard deviation of each feature in our training set.

Finally, we transformed both our training and test sets using `transform()`. This step applies the scaling transformation to each feature based on the previously calculated mean and standard deviation.

It’s important to note that when using feature scaling techniques like StandardScaler, we should fit our scaler on our training data only. This ensures that there is no information leakage from our test set into our model during training.

In summary, StandardScaler is a powerful tool for scaling features in machine learning models. With just a few lines of code, we can easily apply this technique to our datasets and improve the performance of our models.

MinMaxScaler

In Scikit-Learn, the MinMaxScaler function scales the data to a fixed range of 0 to 1. This is done by subtracting the minimum value of the feature and then dividing by the range of the feature. The formula for MinMaxScaler is given as:

X_scaled = (X – X.min(axis=0)) / (X.max(axis=0) – X.min(axis=0))

where X is the feature matrix.

Let’s take an example to understand how to use MinMaxScaler in Scikit-Learn.


from sklearn.preprocessing import MinMaxScaler
import numpy as np

# create a sample feature matrix
X = np.array([[10, 20, 30],
              [15, 25, 35],
              [25, 30, 40]])

# create a MinMaxScaler object
scaler = MinMaxScaler()

# fit and transform the feature matrix
X_scaled = scaler.fit_transform(X)

print(X_scaled)

Output:

[[0. 0. 0. ]
[0.33333333 0.5 0.5 ]
[1. 1. 1. ]]

In this example, we first import the MinMaxScaler function from Scikit-Learn and create a sample feature matrix `X`. We then create a MinMaxScaler object `scaler` and fit and transform `X` using the `fit_transform()` method.

The output shows that each value in `X` has been scaled between 0 and 1 based on its minimum and maximum values across all samples in each feature column.

MinMaxScaler can be useful when we have features with varying scales and we want to bring them all to a common scale for comparison or modeling purposes. However, it may not be suitable for features with outliers as it can compress the range of the majority of the data. In such cases, we can consider using other scaling techniques like RobustScaler or StandardScaler.

RobustScaler

One of the most commonly used feature scaling techniques in Scikit-Learn is the RobustScaler. Similar to StandardScaler and MinMaxScaler, RobustScaler is also used to scale numerical features before feeding them into a machine learning algorithm. However, unlike StandardScaler and MinMaxScaler, RobustScaler uses median and quartiles instead of mean and variance.

RobustScaler scales the feature values based on the interquartile range (IQR) which is defined as the difference between the third quartile (75th percentile) and the first quartile (25th percentile). It then applies the following formula to each feature:

(x – Q2) / (Q3 – Q1)

where x is the feature value, Q2 is the median, Q3 is the third quartile, and Q1 is the first quartile.

The advantage of using RobustScaler over StandardScaler and MinMaxScaler is that it is more robust to outliers. Since it uses median and quartiles instead of mean and variance, it can handle features with a large number of outliers without being influenced by them.

Here’s an example of how to use RobustScaler in Python:


from sklearn.preprocessing import RobustScaler
from sklearn.datasets import load_boston

# Load Boston Housing dataset
boston = load_boston()

# Initialize RobustScaler
scaler = RobustScaler()

# Fit and transform data
X_scaled = scaler.fit_transform(boston.data)

print(X_scaled)

In this example, we first load the Boston Housing dataset using `load_boston()` function. We then initialize a RobustScaler object and fit-transform our data using `fit_transform()` method. Finally, we print the transformed data.

Overall, if your dataset contains many outliers or if you want to be more conservative in your feature scaling approach, then RobustScaler can be a good choice for you.

MaxAbsScaler

MaxAbsScaler is a scaling technique that scales the data in such a way that the absolute maximum value of each feature is 1.0. This scaler is particularly useful for sparse data, where other scaling techniques may not work as expected.

The formula used by MaxAbsScaler to scale the data is given by:

X_scaled = X / max(abs(X))

where X is the original feature matrix and X_scaled is the scaled feature matrix.

Let’s see an example of how to use MaxAbsScaler in Scikit-Learn:


from sklearn.preprocessing import MaxAbsScaler
import numpy as np

# Create sample data
X = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

# Create MaxAbsScaler object
scaler = MaxAbsScaler()

# Fit and transform the data
X_scaled = scaler.fit_transform(X)

print(X_scaled)

Output:

[[0.14285714 0.25 0.33333333]
[0.57142857 0.625 0.66666667]
[1. 1. 1. ]]

As we can see from the output, each feature has been scaled such that its absolute maximum value is equal to 1.0.

MaxAbsScaler can be particularly useful when dealing with sparse matrices, where other scaling techniques may not work as expected due to the presence of many zero values. In such cases, using MaxAbsScaler can help preserve the sparsity of the data while still ensuring that the features are properly scaled.

In summary, MaxAbsScaler is a simple yet effective scaling technique that can be used for both dense and sparse data. It scales each feature in such a way that its absolute maximum value is equal to 1.0, making it particularly useful for ensuring that all features are on the same scale.

Difference between .fit() and .fit_transform()

When it comes to feature scaling in Python and Scikit-Learn, there are two important methods that you need to know about: `.fit()` and `.fit_transform()`.

In simple terms, the `.fit()` method is used to calculate the parameters of the scaler based on the data, while the `.fit_transform()` method is used to actually transform the data using those parameters.

Let’s take a closer look at each of these methods.

The `.fit()` Method

The `.fit()` method is used to calculate the parameters of the scaler based on the data. These parameters include things like the mean and standard deviation of each feature in the dataset.

Here’s an example:


from sklearn.preprocessing import StandardScaler

# create a StandardScaler object
scaler = StandardScaler()

# fit the scaler to the data
scaler.fit(X_train)

In this example, we create a `StandardScaler` object and then use the `.fit()` method to fit it to our training data `X_train`. This calculates the mean and standard deviation of each feature in `X_train`.

The `.fit_transform()` Method

The `.fit_transform()` method is used to actually transform the data using the parameters calculated by the `.fit()` method. This means that it both fits the scaler to the data and transforms it in one step.

Here’s an example:


from sklearn.preprocessing import StandardScaler

# create a StandardScaler object
scaler = StandardScaler()

# fit and transform the scaler on X_train
X_train_scaled = scaler.fit_transform(X_train)

In this example, we create a `StandardScaler` object and then use the `.fit_transform()` method to fit it to our training data `X_train` and transform it at once. This calculates the mean and standard deviation of each feature in `X_train` and applies them to transform `X_train`.

It’s important to note that when using the `.fit_transform()` method, you should only fit the scaler to your training data and then use the same scaler to transform your test data. This ensures that your test data is transformed in the same way as your training data.

In summary, the `.fit()` method is used to calculate the parameters of the scaler based on the data, while the `.fit_transform()` method is used to both fit the scaler to the data and transform it in one step.

Visualizing Feature Scaling

When it comes to feature scaling, it’s essential to understand the impact of scaling techniques on the data distribution. Matplotlib is a popular visualization library in Python that allows us to plot various types of graphs and charts.

Let’s consider an example dataset that contains two features: age and income. The age feature ranges from 18 to 60, while the income feature ranges from 20,000 to 200,000.


import matplotlib.pyplot as plt
import numpy as np

# Create sample data
np.random.seed(42)
age = np.random.randint(low=18, high=60, size=(50,))
income = np.random.randint(low=20000, high=200000, size=(50,))

# Plot the data
plt.scatter(age, income)
plt.xlabel("Age")
plt.ylabel("Income")
plt.title("Example Dataset")
plt.show()

This code will generate a scatter plot of our example dataset:

As we can see from the plot above, the age and income features are not on the same scale. The age feature ranges from 18 to 60, while the income feature ranges from 20,000 to 200,000. This difference in scale can cause problems for some machine learning algorithms.

We can use Matplotlib to visualize how different scaling techniques affect the distribution of our data. Let’s consider two scaling techniques: Min-Max scaling and Standardization.


from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Create scaler objects
minmax_scaler = MinMaxScaler()
standard_scaler = StandardScaler()

# Scale the data using different techniques
scaled_data_minmax = minmax_scaler.fit_transform(np.array([age,income]).T)
scaled_data_standard = standard_scaler.fit_transform(np.array([age,income]).T)

# Plot the scaled data
fig, axs = plt.subplots(2, 2, figsize=(10, 8))
axs[0, 0].scatter(age, income)
axs[0, 0].set_title("Original Data")
axs[0, 1].scatter(scaled_data_minmax[:,0], scaled_data_minmax[:,1])
axs[0, 1].set_title("Min-Max Scaling")
axs[1, 0].scatter(scaled_data_standard[:,0], scaled_data_standard[:,1])
axs[1, 0].set_title("Standardization")
plt.tight_layout()
plt.show()

This code will generate a plot that compares the original data with the data scaled using Min-Max scaling and Standardization techniques.

As we can see from the plot above, both scaling techniques have normalized the data to some extent. However, Min-Max scaling has compressed the range of values between 0 and 1 while Standardization has centered the data around mean zero with unit variance.

Visualizing feature scaling can help us understand how different techniques affect our data distribution. It’s important to choose a technique that best suits our machine learning algorithm and dataset.

Best Practices for Feature Scaling

When it comes to feature scaling, there are some best practices that you should keep in mind to ensure that your data is properly scaled and that your machine learning models are accurate.

1. Scale all Features: It is important to scale all of the features in your dataset, not just a subset of them. This is because many machine learning algorithms assume that all features are on the same scale, and failure to do so can result in poor model performance.

2. Don’t Scale Target Variable: It is important to note that you should not scale your target variable as it is what you are trying to predict. Scaling it would change its original values, causing incorrect predictions.

3. Choose Appropriate Scaling Method: There are several methods for feature scaling such as StandardScaler, MinMaxScaler, RobustScaler, and MaxAbsScaler. You should choose the appropriate method based on the distribution of your data and the requirements of your machine learning algorithm.

4. Be Careful with Outliers: Outliers can significantly affect the scaling process and may cause issues with some scaling methods. It’s important to handle outliers before scaling or use robust scaling methods that can handle them.

5. Consider Normalization for Distance-Based Algorithms: If you’re using distance-based algorithms like k-Nearest Neighbors (k-NN) or Support Vector Machines (SVM), normalization may be a better choice than standardization as it preserves the distance between points.

Conclusion

In conclusion, feature scaling is a crucial step in preparing data for machine learning models. It helps to ensure that all features are on the same scale and have equal importance in the model’s decision-making process.

In this article, we covered several methods of feature scaling, including standardization, min-max scaling, and robust scaling. We also discussed when to use each method and their pros and cons.

When working with Scikit-Learn, it is easy to apply feature scaling to your data using the `StandardScaler`, `MinMaxScaler`, or `RobustScaler` classes. These classes can be incorporated into your machine learning pipeline alongside other preprocessing steps such as one-hot encoding or dimensionality reduction.

Remember that feature scaling is not always necessary, but it can significantly improve the performance of your models in certain cases. Always consider the nature of your data and the requirements of your model before deciding whether to apply feature scaling or not.

We hope that this guide has provided you with a comprehensive understanding of feature scaling in Python and Scikit-Learn. Happy coding!
Interested in learning more? Check out our Introduction to Python course!

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!

Pierian Training

Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing. Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.