Machine Learning in Python: Principal Component Analysis (PCA)

Introduction

PCA or Principal Component Analysis is a widely used technique in machine learning and data analysis. It is a statistical method that helps to identify patterns in the data by reducing its dimensionality. In other words, it simplifies the complexity of the data by transforming it into a lower dimensional space while retaining most of the information.

PCA is particularly useful when dealing with high-dimensional datasets where the number of features is large. It can be used for various applications such as image processing, speech recognition, and data compression.

The main idea behind PCA is to find a new set of variables, called principal components, which are linear combinations of the original features. These principal components are chosen in such a way that they explain the maximum amount of variance in the data. The first principal component captures the most significant variation in the data, followed by the second one, and so on.

Let’s first understand what Principal Component Analysis is and how it works!

What is Principal Component Analysis (PCA)?

Principal Component Analysis (PCA) is a widely used technique in machine learning and data analysis. It is an unsupervised learning algorithm that reduces the dimensionality of the data while preserving as much of the original variance as possible. In simpler terms, PCA is a mathematical technique that helps to identify patterns in data by reducing the number of variables without losing too much information.

PCA works by creating new variables, called principal components, which are linear combinations of the original variables. These principal components are ordered in such a way that the first component captures the most variation in the data, followed by the second component, and so on. The number of principal components is equal to the number of original variables.

The goal of PCA is to find a low-dimensional representation of the data that explains most of its variability. This can be useful for visualizing high-dimensional data or for reducing noise in the data. For example, if you have a dataset with many features, some of which may be correlated with each other, PCA can help you identify which features are most important for explaining the variability in the dataset.

To perform PCA in Python, we can use the scikit-learn library. The first step is to import the necessary modules:


from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

Next, we can load a sample dataset from scikit-learn:


iris = load_iris()
X = iris.data
y = iris.target

Here, `X` contains the features (or independent variables) and `y` contains the target variable (or dependent variable). We can then perform PCA on `X` using `PCA()` from scikit-learn:


pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

Here, we have specified that we want to reduce `X` to two dimensions (`n_components=2`). The `fit_transform()` method fits the PCA model to the data and then transforms it to the reduced dimensionality.

We can then plot the reduced data using matplotlib:


plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y)
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.show()

This will create a scatter plot of the reduced data, where each point is colored according to its target variable. The first principal component is shown on the x-axis and the second principal component is shown on the y-axis. Here is the resulting plot output:

Principal Component Analysis is a useful technique for reducing the dimensionality of high-dimensional data while preserving as much of its original variability as possible. It can be implemented in Python using scikit-learn and can be helpful for visualizing and analyzing complex datasets.

Why use PCA?

PCA, or Principal Component Analysis, is a widely-used technique in machine learning for feature extraction and data dimensionality reduction. The idea behind PCA is to transform a high-dimensional dataset into a lower-dimensional space while retaining as much of the original variance as possible. This can be especially useful when dealing with datasets that have a large number of features, as it can reduce the computational complexity of training machine learning models on such data.

One of the main reasons to use PCA is to remove correlated features from the dataset. Correlated features are those that are highly dependent on each other, meaning that they contain similar information. By removing these redundant features, we can simplify the dataset and improve the performance of our machine learning models.

Another reason to use PCA is to visualize high-dimensional data in a lower-dimensional space. For example, if we have a dataset with 100 features, it can be difficult to visualize patterns and relationships between data points in this high-dimensional space. However, by using PCA to reduce the dataset to just 2 or 3 dimensions, we can easily plot and explore the data.

Overall, PCA is a powerful tool for reducing the dimensionality of high-dimensional datasets while retaining important information about the underlying structure of the data. In the next section, we’ll dive into how to implement PCA using scikit-learn in Python.


# Example code for implementing PCA using scikit-learn

from sklearn.decomposition import PCA
import numpy as np

# Create a random dataset with 1000 samples and 50 features
X = np.random.rand(1000, 50)

# Create a PCA object with 2 components
pca = PCA(n_components=2)

# Fit the PCA model to our dataset
pca.fit(X)

# Transform our dataset into the new lower-dimensional space
X_pca = pca.transform(X)

How does PCA work?

Principal Component Analysis (PCA) is a popular technique used for dimensionality reduction in machine learning. Dimensionality reduction is the process of reducing the number of features or variables in a dataset while retaining as much information as possible. PCA transforms high-dimensional data into a lower-dimensional space by identifying the most important features or principal components.

PCA works by computing the covariance matrix of the data and then finding the eigenvectors and eigenvalues of this matrix. The eigenvectors represent the directions in which the data varies the most, while the eigenvalues represent the amount of variance explained by each eigenvector.

The first principal component is the direction with the highest variance, and each subsequent principal component is orthogonal to the previous ones and captures as much of the remaining variance as possible. By projecting the data onto these principal components, we can reduce its dimensionality while retaining most of its variability.

Let’s see an example of how to perform PCA using Scikit-Learn library in Python:


from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# Load iris dataset
iris = load_iris()
X = iris.data

# Perform PCA with two components
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Original shape: ", X.shape)
print("Transformed shape: ", X_pca.shape)

In this example, we loaded the iris dataset and performed PCA with two components using Scikit-Learn’s `PCA` class. We then transformed our original data into a lower-dimensional space using `fit_transform()` method.

Finally, we printed out the shapes of our original and transformed data to see how many dimensions have been reduced. As we can see, our original dataset had four features, but after performing PCA, it has been reduced to only two principal components.

PCA is a powerful technique that can help us reduce the dimensionality of our data while retaining most of its variability. It has many applications in machine learning, including image recognition, natural language processing, and data compression.

Step-by-step PCA with Python and Scikit-Learn

In this section, we will go through a step-by-step implementation of PCA using Python and Scikit-Learn. The steps involved are:

  • Step 1: Import Libraries and Load Data
  • Step 2: Standardize the Data
  • Step 3: Compute Covariance Matrix
  • Step 4: Compute Eigenvectors and Eigenvalues
  • Step 5: Sort Eigenvalues in Descending Order
  • Step 6: Choose Principal Components
  • Step 7: Project Data Onto Lower-Dimensional Linear Subspace

Let’s dive into each step in detail:

Step 1: Import Libraries and Load Data

The first step is to import the necessary libraries and load the data that you want to perform PCA on. For this example, we will use the Iris dataset, which is a commonly used dataset in machine learning.


import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
y = iris.target

Here, we import NumPy for numerical computing and load_iris from Scikit-Learn’s datasets module to get the Iris dataset. We then assign the feature values to X and target values to y.

Step 2: Standardize the Data

PCA is sensitive to the scale of the input data, so it’s important to standardize the data before performing PCA. Standardization involves scaling the data so that it has a mean of 0 and a standard deviation of 1.


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Here, we import StandardScaler from Scikit-Learn’s preprocessing module and create an instance of it. We then fit and transform the feature values in X to get X_scaled.

Step 3: Compute Covariance Matrix

The next step is to compute the covariance matrix of the standardized data. The covariance matrix represents the relationships between the different features in the data.


cov_matrix = np.cov(X_scaled.T)

Here, we use NumPy’s cov function to compute the covariance matrix of X_scaled after transposing it.

Step 4: Compute Eigenvectors and Eigenvalues

The eigenvectors and eigenvalues of the covariance matrix are used to determine the principal components of the data. The eigenvectors represent the directions of maximum variance in the data, while the corresponding eigenvalues represent the amount of variance explained by each eigenvector.


eigen_values, eigen_vectors = np.linalg.eig(cov_matrix)

Here, we use NumPy’s linalg.eig function to compute both the eigenvalues and eigenvectors of cov_matrix.

Step 5: Sort Eigenvalues in Descending Order

The next step is to sort the eigenvalues in descending order. This will allow us to choose the principal components that explain the most variance in the data.


sorted_index = np.argsort(eigen_values)[::-1]
sorted_eigenvalue = eigen_values[sorted_index]
sorted_eigenvectors = eigen_vectors[:,sorted_index]

Here, we use NumPy’s argsort function to get the indices that would sort the eigenvalues in ascending order. We then reverse the order using [::-1] and use it to sort both the eigenvalues and eigenvectors.

Step 6: Choose Principal Components

The next step is to choose the principal components that we want to keep. We can do this by selecting the top k eigenvectors that correspond to the k largest eigenvalues.


k = 2
principal_components = sorted_eigenvectors[:,:k]

Here, we choose k=2 as an example and select the first two eigenvectors from sorted_eigenvectors to be our principal components.

Step 7: Project Data Onto Lower-Dimensional Linear Subspace

The final step is to project the data onto the lower-dimensional linear subspace defined by the principal components. This will transform the data from its original high-dimensional space into a lower-dimensional space while retaining as much of the original information as possible.


X_new = np.dot(X_scaled, principal_components)

Here, we use NumPy’s dot function to compute the dot product between X_scaled and principal_components to obtain X_new, which is our transformed data.

That’s it! We have successfully performed PCA on our dataset using Python and Scikit-Learn.

Evaluating the Results of PCA

After performing Principal Component Analysis (PCA), it is important to evaluate the results to ensure that it has achieved its purpose. There are several ways to evaluate the results of PCA, including variance explained, cumulative variance explained, and scree plot.

Variance Explained: This metric tells us the proportion of variance in the original data that is explained by each principal component. It is calculated by dividing the eigenvalue of each principal component by the sum of all eigenvalues. The higher the proportion, the more important the principal component is in explaining the data.

Cumulative Variance Explained: This metric tells us the proportion of total variance in the original data that is explained by a certain number of principal components. It is calculated by adding up the variances explained by each principal component up to a certain number and dividing by the total variance. This metric can help us decide how many principal components to keep for our analysis.

Scree Plot: This is a graphical representation of the eigenvalues of each principal component plotted against their corresponding number. The plot shows a curve that starts high and gradually flattens out. The point at which the curve flattens out represents the number of principal components that should be retained for analysis. Generally, we retain all principal components with eigenvalues greater than 1.

Let’s look at an example code snippet below that demonstrates how to calculate variance explained, cumulative variance explained and plot a scree plot using Scikit-Learn library:


from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

# Create an array with random values
X = np.random.rand(100,5)

# Fit PCA on X
pca = PCA().fit(X)

# Calculate Variance Explained
var_exp = pca.explained_variance_ratio_

# Calculate Cumulative Variance Explained
cum_var_exp = np.cumsum(var_exp)

# Plot Scree Plot
plt.plot(range(1,len(var_exp)+1), pca.explained_variance_ratio_, 'ro-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Variance Explained')
plt.show()

In the code above, we first create an array with random values and fit PCA on it. We then calculate the variance explained by each principal component using the `explained_variance_ratio_` attribute of PCA. Next, we calculate the cumulative variance explained by each principal component using the `cumsum()` function of numpy.

Finally, we plot a scree plot using the `plot()` function of matplotlib. The x-axis represents the principal components and the y-axis represents the variance explained. The scree plot helps us determine how many principal components to retain for analysis. The plot output would look like this:

Conclusion

In this blog post, we have explored the concept of Principal Component Analysis (PCA) and how it can be used for dimensionality reduction in machine learning. We started by discussing the need for dimensionality reduction and how PCA helps us achieve it. We then went on to explain the mathematical concepts behind PCA, including eigenvectors and eigenvalues.

Next, we implemented PCA using Python and Scikit-Learn library. We used the Iris dataset to illustrate how PCA can be used to reduce the dimensions of a dataset while preserving most of its important features. We also visualized the results of our analysis using scatter plots.

Finally, we discussed some of the limitations of PCA and some alternatives that can be used depending on the specifics of a given problem.

In conclusion, Principal Component Analysis is an essential technique in machine learning that helps us reduce the dimensions of our datasets while preserving important features. It is a powerful tool that can help us gain insights into our data and make better predictions.


Interested in learning more? Check out our Machine Learning courses or download our brochure on how to become a data scientist below:


How to Become a Data Scientist PDF

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!


Pierian Training
Pierian Training
Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing. Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

You May Also Like

Data Science, Tutorials

Guide to NLTK – Natural Language Toolkit for Python

Introduction Natural Language Processing (NLP) lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit […]

Machine Learning, Tutorials

GridSearchCV with Scikit-Learn and Python

Introduction In the world of machine learning, finding the optimal set of hyperparameters for a model can significantly impact its performance and accuracy. However, searching through all possible combinations manually can be an incredibly time-consuming and error-prone process. This is where GridSearchCV, a powerful tool provided by Scikit-Learn library in Python, comes to the rescue. […]

Python Basics, Tutorials

Plotting Time Series in Python: A Complete Guide

Introduction Time series data is a type of data that is collected over time at regular intervals. It can be used to analyze trends, patterns, and behaviors over time. In order to effectively analyze time series data, it is important to visualize it in a way that is easy to understand. This is where plotting […]