Pearson Correlation Coefficient with Scipy Pearsonr

Introduction

When analyzing data, it is often useful to measure the strength of the relationship between two variables. One common method for doing this is by calculating the Pearson correlation coefficient. The Pearson correlation coefficient measures the linear relationship between two variables and is a value between -1 and 1.

What is Pearson Correlation Coefficient?

The Pearson correlation coefficient is a measure of the strength and direction of the linear relationship between two variables. It is denoted by r and ranges from -1 to 1. A value of -1 indicates a perfectly negative correlation, 0 indicates no correlation, and 1 indicates a perfectly positive correlation.

In other words, if two variables have a high positive correlation, it means that when one variable increases, the other variable also tends to increase. On the other hand, if they have a high negative correlation, it means that when one variable increases, the other variable tends to decrease.

For example, let’s say we have data on the number of hours studied and the exam scores of a group of students. We can use the Pearson correlation coefficient to determine whether there is a relationship between these two variables. If there is a positive correlation (r > 0), we can conclude that students who study more tend to score higher on exams. If there is a negative correlation (r < 0), we can conclude that students who study more tend to score lower on exams. In Python, we can calculate the Pearson correlation coefficient using the `pearsonr` function from the `scipy.stats` module. Here’s an example:


from scipy.stats import pearsonr

# Example data
hours_studied = [5, 10, 15, 20, 25]
exam_scores = [60, 70, 80, 90, 100]

# Calculate Pearson correlation coefficient
r, p_value = pearsonr(hours_studied, exam_scores)

print("Pearson correlation coefficient:", r)

Output:

Pearson correlation coefficient: 0.9999999999999999

In this example, we have perfect positive correlation between hours studied and exam scores (r = 1). Note that the `pearsonr` function also returns a p-value, which is a measure of the statistical significance of the correlation coefficient. We won’t go into detail about p-values here, but in general, a lower p-value indicates a stronger evidence against the null hypothesis (i.e., no correlation).

How does Pearson Correlation Coefficient work?

The Pearson Correlation Coefficient is a measure of the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfectly negative linear correlation, 0 indicates no linear correlation, and 1 indicates a perfectly positive linear correlation.

To calculate the Pearson Correlation Coefficient with Scipy’s `pearsonr` function, we need two arrays of data that represent the two variables we want to compare. The function returns two values: the correlation coefficient and the p-value.

The correlation coefficient tells us how strong the linear relationship is between the two variables. A value closer to -1 or 1 indicates a stronger linear relationship, while a value closer to 0 indicates a weaker linear relationship.

The p-value tells us whether the correlation coefficient is statistically significant or not. If the p-value is less than our chosen significance level (typically 0.05), we can conclude that there is a significant linear relationship between the two variables.

Scipy’s Pearsonr function

Scipy is a powerful Python library that provides various tools for scientific computing. One of the many functions available in Scipy is Pearsonr, which is used to calculate the Pearson correlation coefficient between two arrays of data.

The Pearson correlation coefficient, also known as Pearson’s r, is a measure of the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a strong negative correlation, 0 indicates no correlation, and 1 indicates a strong positive correlation.

To use the Pearsonr function in Scipy, we need to import it from the `scipy.stats` module. Here’s an example:


from scipy.stats import pearsonr

# Two arrays of data
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]

# Calculate Pearson's r
corr_coef, p_value = pearsonr(x, y)

print("Pearson correlation coefficient:", corr_coef)
print("p-value:", p_value)

In this example, we have two arrays of data `x` and `y`. We then pass these arrays to the `pearsonr` function and store the result in `corr_coef` and `p_value`. The `corr_coef` variable contains the Pearson correlation coefficient while the `p_value` variable contains the two-tailed p-value.

It’s important to note that for the Pearson correlation coefficient to be meaningful, both variables should be normally distributed. If this assumption is not met, other correlation coefficients such as Spearman’s rank correlation coefficient may be more appropriate.

In summary, Scipy’s Pearsonr function provides an easy and efficient way to calculate the Pearson correlation coefficient between two arrays of data in Python.

Example 1: Finding Pearson Correlation Coefficient between two variables

To find the Pearson Correlation Coefficient between two variables, we can use the `pearsonr` function from the `scipy.stats` module. This function takes in two arrays or lists of data points as its arguments and returns two values – the first value is the correlation coefficient and the second value is the p-value.

Let’s say we have two arrays `x` and `y` with some data points that we want to find the correlation coefficient for. We can do this as follows:


from scipy.stats import pearsonr

# Two arrays of data points
x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]

# Finding Pearson Correlation Coefficient
corr_coef, p_value = pearsonr(x, y)
print("Correlation Coefficient:", corr_coef)

In this example, we have taken two arrays `x` and `y` with five data points each. We then passed these arrays as arguments to the `pearsonr` function and stored the returned values in two variables – `corr_coef` and `p_value`. Finally, we printed out the value of `corr_coef`, which gives us the Pearson Correlation Coefficient between `x` and `y`.

The output of this code will be:


Correlation Coefficient: -1.0

Here, we get a perfect negative correlation (-1.0) between `x` and `y`, which means that when one variable increases, the other variable decreases in a perfectly linear fashion.

Example 2: Finding Pearson Correlation Coefficient between multiple variables

In the previous example, we calculated the Pearson correlation coefficient between two variables. However, it is also possible to calculate the Pearson correlation coefficient between multiple variables using Scipy’s `pearsonr` function.

Let’s say we have a dataset with three variables: x, y, and z. We can calculate the Pearson correlation coefficient between all possible pairs of variables using nested for loops and the `pearsonr` function.


import numpy as np
from scipy.stats import pearsonr

# create dataset with three variables
x = np.array([1, 2, 3, 4, 5])
y = np.array([6, 7, 8, 9, 10])
z = np.array([11, 12, 13, 14, 15])

# create list of variable names
variables = ['x', 'y', 'z']

# loop through all possible pairs of variables
for i in range(len(variables)):
    for j in range(i+1,len(variables)):
        # calculate Pearson correlation coefficient and p-value
        corr_coef, p_value = pearsonr(eval(variables[i]), eval(variables[j]))
        # print result
        print("Pearson Correlation Coefficient between", variables[i], "and", variables[j], "is", corr_coef)

In this example code block, we created a dataset with three variables – x, y and z – and stored them in numpy arrays. We then created a list of variable names and used nested for loops to loop through all possible pairs of variables.

Inside the nested loop, we used the `eval()` function to dynamically reference the correct variable names based on the loop indices. We then called the `pearsonr()` function on each pair of variables and stored the results in `corr_coef` and `p_value`.

Finally, we printed out the Pearson correlation coefficient between each pair of variables. The output of this code block will be:


Pearson Correlation Coefficient between x and y is 1.0
Pearson Correlation Coefficient between x and z is 1.0
Pearson Correlation Coefficient between y and z is 1.0

Since all three variables have a perfect positive correlation, the Pearson correlation coefficient between all pairs of variables is 1.0.

Conclusion

In conclusion, the Pearson correlation coefficient is a powerful tool for measuring the strength and direction of the linear relationship between two variables. With the help of Scipy’s pearsonr function, we can easily calculate this coefficient in Python.

It’s important to keep in mind that correlation does not imply causation, and a high correlation coefficient does not necessarily mean that one variable causes the other. However, it can provide valuable insights into the relationship between variables and inform further analysis.

When interpreting the results of a Pearson correlation coefficient calculation, it’s important to consider both the magnitude of the coefficient and its p-value. A high coefficient with a low p-value indicates a strong linear relationship that is unlikely to occur by chance.

In addition to understanding how to calculate and interpret Pearson correlation coefficients, it’s also important to ensure that your data meets the assumptions for this test. Specifically, your data should be normally distributed and have equal variances. If these assumptions are not met, alternative methods such as Spearman’s rank correlation coefficient may be more appropriate.

Overall, the Pearson correlation coefficient is a valuable tool for analyzing relationships between variables in many different fields, from finance to social science to biology. By using Scipy’s pearsonr function in Python, we can easily calculate this powerful statistic and gain insights into our data.
Interested in learning more? Check out our Introduction to Python course!


How to Become a Data Scientist PDF

Your FREE Guide to Become a Data Scientist

Discover the path to becoming a data scientist with our comprehensive FREE guide! Unlock your potential in this in-demand field and access valuable resources to kickstart your journey.

Don’t wait, download now and transform your career!


Pierian Training
Pierian Training
Pierian Training is a leading provider of high-quality technology training, with a focus on data science and cloud computing. Pierian Training offers live instructor-led training, self-paced online video courses, and private group and cohort training programs to support enterprises looking to upskill their employees.

You May Also Like

Data Science, Tutorials

Guide to NLTK – Natural Language Toolkit for Python

Introduction Natural Language Processing (NLP) lies at the heart of countless applications we use every day, from voice assistants to spam filters and machine translation. It allows machines to understand, interpret, and generate human language, bridging the gap between humans and computers. Within the vast landscape of NLP tools and techniques, the Natural Language Toolkit […]

Machine Learning, Tutorials

GridSearchCV with Scikit-Learn and Python

Introduction In the world of machine learning, finding the optimal set of hyperparameters for a model can significantly impact its performance and accuracy. However, searching through all possible combinations manually can be an incredibly time-consuming and error-prone process. This is where GridSearchCV, a powerful tool provided by Scikit-Learn library in Python, comes to the rescue. […]

Python Basics, Tutorials

Plotting Time Series in Python: A Complete Guide

Introduction Time series data is a type of data that is collected over time at regular intervals. It can be used to analyze trends, patterns, and behaviors over time. In order to effectively analyze time series data, it is important to visualize it in a way that is easy to understand. This is where plotting […]