In your everyday life, have you ever wondered if there is any type of relationship between your height and your weight, your income and your expenditure, your age and your working ability and so on? The concept of correlation deals with such kind of problems. In this article, we will learn how these concepts and interpretation of the correlation coefficient in data science
What is the correlation coefficient and how it is different from the regression coefficient?
The correlation coefficient between two variables X (say height) and Y (say weight) is a numerical measure of linear relationship between them, denoted by ‘r’.
-1≤ ᵖ(X,Y) ≤ +1
Example: Suppose we have data on the number of study hours(X) and the number of sleeping hours(Y) of different students. If we need to check up to how much extent the two affect each other.
r (X, Y) = E(X-E(X)) (Y-E(Y)) / √V(X) √V(Y)
The negative value shows they have association in opposite directions, if the value of one increases the other decreases .i.e as the number of sleeping hours increase the study hours of student decrease.
Some Correlation Values And Their Interpretation:
It should be clearly understood that Correlation is described as the analysis which lets us know The linear association or absence of the linear relationship between two variables ‘x’ and ‘y’. The following table clearly explains the difference between the correlation coefficient and the regression coefficient
Why is correlation a useful metric?
In almost any business problem, it is useful to express one quantity in terms of its relationship with others. For example, sales might increase when the marketing department spends more on TV advertisements, or a customer’s average purchase amount on an e-commerce website might depend on a number of factors related to that customer.
Often, correlation is the first step to understanding these relationships and subsequently building better business and statistical models.
Not only this, other examples can be: Consumer spending and GDP are two metrics that maintain a positive relationship with one another. When spending increases, GDP also rises as firms produce more goods and services to meet consumer demand. Conversely, firms slow production amid a slowdown in consumer spending to bring production costs in line with revenues and limit excess supply.
Statistical importance (helpful for analysis of correlated data)
Statistical inference based on Pearson’s correlation coefficient often focuses on one of the following two aims:
- One aim is to test the null hypothesis that the true correlation coefficient ρ is equal to 0, based on the value of the sample correlation coefficient r :
If r is observed correlation coefficient in the sample of n pairs of observations from a bivariate normal population,
H0: ρ=0 i.e. population correlation coefficient is zero.
t = r√n-2 / √(1-r2) follows t distribution with (n-2) degree of freedom.R123R-code:cor.test(X,Y,method=c("pearson")) # X and Y are the collected sample values
- The other aim is to derive a confidence interval that, on repeated sampling, has a given probability of containing ρ
Do we always need the exact values of the variables to calculate the correlation between them?
Many of the times instead of quantitative data, we come across qualitative data given in specific order or rank. So in such situations, we can find a special type of correlation called Spearman’s Rank Correlation Correlation.
Spearman’s rank correlation coefficient can be defined as a special case of Pearson ρ applied to ranked (sorted) of the variables. The formula for Spearman’s coefficient looks very similar to that of Pearson, with the distinction of being computed on ranks instead of raw scores:
ρ (rank.x,rank.y)=Cov(rank.x,rank.y)/SD[x] SD[y]
If all ranks are unique (i.e. there are no ties in ranks), you can also use a simplified version:
Where n is the number of units ranked in data?
It should be noted that Pearson’s coefficient gives the exact value but the Spearman’s coefficient gives the approximate value, for example, consider the following cases:
1. For the Pearson correlation coefficient to be +1 when one variable increases than the other variable increases by a consistent amount. This relationship forms a perfect line. The Spearman correlation coefficient is also +1 in this case:
Pearson = +1, Spearman = +1
2.If the relationship is that one variable increases when the other increases, but the amount is not consistent, the Pearson correlation coefficient is positive but less than +1. The Spearman coefficient still equals +1 in this case.
Pearson = +0.851, Spearman = +1
3. When a relationship is random or non-existent, then both correlation coefficients are nearly zero.
Pearson = −0.093, Spearman = −0.093
4. If the relationship is a perfect line for a decreasing relationship, then both correlation coefficients are −1.
Pearson = −1, Spearman = −1
5. If the relationship is that one variable decreases when the other increases, but the amount is not consistent, then the Pearson correlation coefficient is negative but greater than −1. The Spearman coefficient still equals −1 in this case
Pearson = −0.799, Spearman = −1
Thus, we found the various aspects where the correlation coefficient is used to find the association between variables. Pearson’s correlation measures only linear relationships. Consequently, if your data contain a perfect curvilinear relationship, the correlation coefficient will not detect it, it will show value zero.