# A Refresher on Regression Analysis

In statistics, a data build a model, but it is also true that a model can build the data. For example, the yield of a crop can be predicted by using the amount of fertilizers, irrigation techniques and fertility of the soil as a predictor or independent variables (data builds the model).

However, if we are given the optimum combinations of these predictor variables, we can build a model for the crop yield which can be used to predict the required crop yield (model builds data).

While examining a patient, the dosage is set keeping in mind his other illnesses and previous medical records such as blood sugar level, cholesterol, eyesight, etc. Here dosage can be considered as some dependent variable and his other illnesses, medical records are considered as independent variables.

Such a relationship, when a dependent variable needs to be measured considering all other independent variables is expressed through terms like correlation and regression.

In simple terms, regression helps us to predict or analyze relationships between two or more variables. The factor being predicted is known as a dependent variable and the factors that are used to predict the values of the dependent variable are called independent variables.

Regression analysis is used to do the same. For example, you might guess that there is a connection between how much you eat and how much you weigh, regression analysis can help you quantify that. Regression analysis will give us an equation for a graph so that we can make predictions about our data.

## Introduction

In statistics, some random numbers lying in a table make little sense to us. To make sense out of it, we can use regression and obtain some inferences about the future performance of the given random variable.

Suppose you’re a sales manager trying to predict next month’s numbers. You know that dozens, perhaps

even hundreds of factors from the weather to a competitor’s promotion to the rumor of a new and

the improved model can impact the number.

Perhaps people in your organization even have a theory about what will have the biggest effect on sales. “Trust me. The more rain we have, the more we sell.” “Six weeks after the competitor’s promotion, sales jump.”

Regression analysis is a way of mathematically sorting out which of those variables does indeed have an

impact. It answers the questions: Which factors matter most? Which can we ignore? How do those factors interact with each other? And, perhaps most importantly, how certain are we about all of these factors?

## Simple Linear Regression

The best way to understand linear regression is to relive the experience of childhood. If you ask a class fifth child to arrange people in his class by increasing order of weight, without asking them their weights, the child would likely look (visually analyze) at the height and build of the classmates and arrange them using a combination of these visible parameters.

The child has actually figured out that height and build would be correlated to the weight by a linear relationship. This is the linear regression in real life!

In simple terms, simple linear regression is predicting the value of a variable Y (the dependent variable)

based on some variable X (the independent variable) provided there is a linear relationship between the

variables X and Y.

If there are more than one independent variables, then we can predict the value of Y using Multiple Linear Regression. For example, when we predict rent based on square feet alone, then we can use simple linear regression, but when we predict the rent based in square feet and age of the building, then we will use multiple linear regression.

The linear relationship between the two variables can be represented by a straight line, called the **regression line**.

Now to determine if there is a linear relationship between two variables, we can simply plot the scatter plot (plotting of the coordinates (x,y) on a graph) of variable Y with variable X. If the plotted points are randomly scattered then it can be inferred that the variables are not related.

**There is a linear relationship between the variables.**

If there are points lying in a straight line, then there exists a linear relationship between the variables.

After drawing a straight line through the points plotted, we will find that not all the points lie on the line. This happens because the line that we have drawn may not be the best fit and the points plotted are probabilistic, i.e., our observations are approximate.

But, when there exists a linear relationship between X and Y, then we can plot more than one line through these points. How do we know which one is the best fit?

To help us choose the line of best fit, we use the method of least squares.

Least Squares

This is the mathematical relationship between the variables X and Y where,

X is the independent variable

Y is the dependent variable

𝑏𝑜 is the intercept of the regression line

𝑏1 is the slope of the regression line

e is the error or deviation from the actual/ observed variable of the variable Y

Here, is the difference between the ith observed value and the ith calculated value. This error could be positive or negative. We have to minimize this error to get the line of best fit. On minimizing the error sum of squares, we obtain the values of 𝑏𝑜 and 𝑏1 using the two normal equations

And

Then, we find the values of 𝑦𝑖 for the given values of 𝑥𝑖 and plot the line of best fit.

**R Code for Simple Linear Regression**

We use the *lm()* function to create a relationship model (between the predictor and the response variable). The basic syntax for *lm()* function is:

lm(formula,data)

formula – the symbol for presenting the relationship between x and y

data – vector on which the formula will be applied

**For example:**

1 2 3 4 5 6 7 |
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) # Apply the lm() function. relation <- lm(y~x) print(relation) |

So, the final code becomes:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
#Load Train and Test datasets #Identify feature and response variable(s) and values must be numeric and numpy arrays x_train <- input_variables_values_training_datasets y_train <- target_variables_values_training_datasets x_test <- input_variables_values_test_datasets x <- cbind(x_train,y_train) # Train the model using the training sets and check score linear <- lm(y_train ~ ., data = x) summary(linear) #Predict Output predicted= predict(linear,x_test) |

**For example:**

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# The predictor vector. x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) # The response vector. y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) # Apply the lm() function. relation <- lm(y~x) # Find weight of a person with height 170. a <- data.frame(x = 170) result <- predict(relation,a) print(result) |

## Multiple Regression Analysis

Multiple regression analysis is almost the same as simple linear regression. The only difference between

simple linear regression and multiple linear regression is in the number of independent ( or predictor)

variables used in the regression.

- Simple linear regression analysis uses a single X variable for each dependent Y variable. For example (𝑥𝑖𝑦𝑖).
- Multiple Regression uses multiple X variables for each dependent variable Y. For example (𝑥1, 𝑥2, 𝑥3, 𝑦𝑖).

For example, if we want to find out if weight, height, and age of the people explain the variance in their cholesterol levels, then multiple regression will come to our rescue. We may take weight, height, and age as independent variables 𝑥1, 𝑥2 𝑎𝑛𝑑 𝑥3 and cholesterol as our dependent variable.

**Assumptions:**

- Regression residuals (or error term) must be normally distributed.
- A linear relationship is assumed between the dependent and the independent variable.
- The error terms are homoscedastic and approximately rectangular shaped.
- The independent variables are not too highly correlated with each other.

There are three major uses of multiple regression analysis. First, it can be used to forecast effects or impacts of changes in the future, i.e., it helps us to understand how much will the dependent variable change when we change the independent variables. For example, we can use multiple regression to find how much GPA is expected to increase (or decrease) for every one point increase (or decrease) in IQ.

Also, it can be used to identify the strength of the effect that the independent variable has on a dependent variable.

Lastly, multiple linear regression analysis predicts future values. It can be used to get point estimates. For example, to know what factors affect the crop yield the most, multiple regression analysis can be used.

**Conducting a Multiple Linear Regression Analysis**

First, we will plot to scatter plots of every independent variable with the dependent variable. These scatter plots will help us understand the direction and correlation among the variables.

In the first plot, we see a positive correlation between the dependent and the independent variable.

Whereas, in the second plot, we see an arch-like curve. This indicates that a regression line might not be the best way to explain the data, even if the correlation between them is positive.

The second step of multiple linear regression is to formulate the model, i.e. that variables X1, X2 and X3 have a casual influence on variable Y and that their relationship is linear.

The last step is to fit the regression line.

The multiple linear regression equation is given as:

Proceeding in the same way as above to find the constants 𝑏𝑜, 𝑏1,… , 𝑏𝑚 and then obtaining the values of 𝑦𝑖

for the given values of 𝑥𝑖

Then, we plot the corresponding coordinates and draw the lines of best fit for each combination of independent and dependent variables.

Here, 𝑏0 is the intercept and 𝑏1,…., 𝑏𝑚 are regression coefficients. They can be interpreted the same way as slop. Thus, if 𝑏𝑖 =2.5, it would indicate that Y will increase by 2.5units if Xi increases by 1 unit.

If 𝑏𝑖 is more, then Y is more related to Xi, otherwise, it is less correlated.

**R Code for Multiple Linear Regression**

The code for multiple regression is similar to that of simple linear regression. We consider the following example to understand the code

Consider the data set “mtcars” available in the R environment. It gives a comparison between different car models in terms of mileage per gallon (mpg), cylinder displacement(“disp”), horsepower(“hp”), weight of the car(“wt”) and some more parameters.

The goal of the model is to establish the relationship between “mpg” as a response variable with “disp”,”hp” and “wt” as predictor variables. We create a subset of these variables from the mtcars dataset for this purpose.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
input <- mtcars[,c("mpg","disp","hp","wt")] # Create the relationship model. model <- lm(mpg~disp+hp+wt, data = input) # Show the model. print(model) # Get the Intercept and coefficients as vector elements. cat("# # # # The Coefficient Values # # # ","n") a <- coef(model)[1] print(a) Xdisp <- coef(model)[2] Xhp <- coef(model)[3] Xwt <- coef(model)[4] print(Xdisp) |

1 2 3 |
print(Xhp) print(Xwt) |

We use the coefficient values, we create the mathematical equation

1 2 |
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3 |

We can use the regression equation created above to predict the mileage when a new set of values for

displacement, horsepower and weight are provided.

For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −

1 2 |
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104 |