Correlation Analysis Using R

Do you come across questions like, “Is X related to Y?”. Such questions are common in almost every sphere, right from knowing the relation between the income and education to the relation between no. of hours studied and marks in the exam. Correlation tells what relationship the variables under study have. In this article, we’ll have a look at the linear relationship.

Scatterplot

Scatterplots are used to visualize the data and get a rough idea of the existence and degree of correlation between two variables. The plot shows the scatter of the points, hence called scatterplot.

Here, for the birthweight data, we have a look at the scatterplot of Gestation period (in weeks) v/s Birthweight.

The scatterplot indicates a positive degree of correlation between the Gestation period (in weeks) v/s Birthweight. Now, to quantify this relationship, we use the coefficient of correlation.

Coefficient of Correlation

The coefficient of linear correlation provides a measure of how well a linear regression model explains the relationship between two variables. To put it simply, it gives a quantification for the relationship between the variables under study.

The degree of association between the x and y values is summarised by the value of an appropriate correlation coefficient each of which take values from -1 to +1. A negative value indicates that the two variables move together in opposite directions, the eg. speed of the train and time taken to reach the destination exhibits a negative correlation. A positive value indicates that the two variables move together in the same direction, eg. the height and weight of a human being.

In this section we look at three correlation coefficients: Pearson, Spearman’s rank and Kendall’s rank.

Pearson’s correlation coefficient

Pearson correlation coefficient (also called Pearson’s product-moment correlation coefficient) measures the strength of the linear relationship between two variables.

The correlation between two variables is calculated in R using cor( ) function.

For the Birthweight data, we observe a moderate positive correlation between Gestation period (in weeks) and Birthweight.

Spearman’s rank correlation

Spearman’s rank correlation measures correlation based on the ranks of observations.  If data are quantitative, then it is less precise than Pearson’s correlation coefficient as we use actual observations for Pearson’s correlation coefficient which gives more information than their ranks.

Q. Find the Spearman’s rank correlation between Mathematics and Statistics marks scored by 2nd-year college students.

There is a moderate positive correlation between Mathematics and Statistics marks scored by 2nd-year college students.

Kendall’s rank correlation

Kendall’s rank correlation coefficient τ measures the strength of dependence of rank correlation between two variables. Any pair of observations (Xi, Yi) ; (Xj, Yj) and where i≠j, is said to be concordant if the ranks for both elements agree, i.e. Xi > Xj and Yi> Yj or Xi <Xj and Yi < Yj, otherwise it is said to be discordant.

Q. Two judges ranked 10 contestants in a fancy dress competition. The ranks are from most favorite to least favorite. Calculate Kendall’s rank correlation

There is a positive high degree correlation between ranks given by 2 judges.

Inference Procedures For Correlation Coefficient

The sample correlation coefficient (studied so far) measures the extent of the linear relationship between the two variables (X, Y) for the sample data. The population parameter, ρ, measures the extent of the linear relationship between the variables (X, Y) in the population.

We are usually interested in testing whether the population correlation coefficient is significant or not. The hypothesis is stated as,

H0: ρ = 0  and is tested against one of the following alternatives,
H1: ρ ≠ 0
H1: ρ > 0
H1: ρ < 0

Inference under Pearson correlation

For the Birthweight data, we test the hypothesis:

H0: The population correlation coefficient between birthweight and gestation period is equal to 0
v/s
H1: The population correlation coefficient between birthweight and gestation period is not equal to 0

The test is carried out in R using cor.test() function in R.

As the p-value is less than 0.05, reject H0 and conclude that the population correlation coefficient between birthweight and gestation period is not equal to 0.

Inference under Spearman’s Rank Correlation

Since we are using ranks rather than the actual data, no assumption is needed about the distribution of X, Y or (X, Y), i.e. it is a non-parametric test. For the data giving Mathematics and Statistics marks scored by 2nd-year college students, test the following hypothesis:

H0: The population correlation coefficient between Mathematics and Statistics marks is equal to 0
v/s
H1: The population correlation coefficient between Mathematics and Statistics marks is not equal to 0

As p-value < 0.05, reject H0 and conclude that the population correlation coefficient between Mathematics and Statistics marks is not equal to 0.

Inference under Kendall Rank correlation

For the data giving ranks of 10 contestants in a fancy dress competition by 2 judges, test the following hypothesis:

H0: The population correlation coefficient between rank given by Judge 1 and Judge 2 is equal to 0
v/s
H1: The population correlation coefficient between rank given by Judge 1 and Judge 2 is not equal to 0

As the p-value < 0.05, reject H0  and conclude that the population correlation coefficient between rank given by Judge 1 and Judge 2 is not equal to 0.

Data Used in the above example can be downloaded from here

You might also like More from author