Beginner to Advance level: Steps to Make Regression Model

Part 1

You must have heard about Regression Models many times but you might not have heard about the techniques of solving or making a regression model step-wise.
First, we will talk about Simple Linear Regression: is a model with a single regressor (x) has a linear relationship with a response variable (y).
We all(who have an idea about regression) the linear regression equation:
For a given x, the corresponding observation Y consists of the value. Check above
We know to make some assumptions on the model.

  • ε(i) is a random variable with mean zero & variance σ^2(sigma square) [σ^2 is unknow]
    • i.e. E(ε(i)) = 0 & V(ε(i)) = σ^2 {E<- expectation}
  • ε(i) and ε(j) are uncorrelated i≠j . So cov(ε(i), ε(j)) = 0. Here uncorrelated means independent to each other
  • ε(i) is a normally distributed random variable, with mean zero and variance σ^2 
    • ε(i)=N(0,σ^2) {N – indicates random distributed, σ^2 – variance followed by mean = zero}

Assumptions in terms of Y[ (ε(Y))] :- Here i am not going in details to write the equation, I will tell you what to do just replace the ε(i) to ε(Y).
And Mean will become = β(0)+β(1)*x and variance will e same and equal to σ^2
Here is the Least Square Estimation of the Parameter we are going to discuss further.
Least Squares Estimation(LSE):- 

  • The parameters β(0), β(1) are unknown and must be estimated using same data.

(x1,y1),(x2,y2), – – – – , (xn,yn)

  • The line fitted by (LSE) is the one that makes the sum of squares of all Vertical discrepancies as small as possible. 

  • We estimate β(0), β(1) so that the sum of the square of all the difference between observation Y(i) and the fitted line is minimum = SS(Res), explained in the below snapshot
  • The least square estimator of β(0)& β(1), (β_0 ̂,β_1 ̂ ) must satisfy the following two equation a snapshot is addedEquation 1 and 2 are called normal equations and they are uniquely independent.

So, the estimator β_0 ̂,β_1 ̂ is the solution of the equation

∑(Y(i)-β0ˆ-βiˆ*xi) = 0 —–(1),

∑(Y(i)-β0ˆ-βiˆ*xi)*xi = 0 —-(2)

The solution of the above equations are attached as the image,  I am attaching the image here because I can’t write mathematical equation here, So enjoy snapshots 🙂
Equation Solution:
Equation 2 solution:
Above we have calculated the parameters using least squares estimator.
We have not discussed the benefits of using LSE, in one line the most important benefits of using LSE is the solution will be most correct with almost 95% accuracy.
Properties of Least Square Estimator:-

  • Sum of residuals in any regression model that contains an intercept β_0 is always 0.

∑y(i) = ∑(y(i)−y_iˆ) = 0   (perfect regression line)

  • ∑y(i) = ∑(y_iˆ) means the observation and estimated line of regression graph lie on each other(perfect regression line)
  • ∑x(i)*e(i) = 0
  • ∑y(i)*e(i) = 0

Statistical properties of LS estimation:- 

  • Both (β_0)ˆ and (β_1)ˆ are unbiased estimator of (β_0) and (β_1) respectively. Which means they should be equal in values.
  • (β_0)ˆ and (β_1)ˆ are linear combination of observation of y(i)

β_1ˆ= [∑(x(i)-mean(x))*(y(i)-mean(y))]/∑((x(i)-mean(x))^2

         =  [∑(x(i)-mean(x))*y(i)]/∑(x(i)-mean(x))^2]

β_1ˆ = ∑(e(i)*y(i)

Similarly we can also do for β_0ˆ

β_0ˆ = mean(y)−β_1ˆ*mean(x)

  = (1/n)∑y(i) – β_iˆ.(mean(x))

Take the value of beta 1 parameter from above equation.
Note:- I am not going to prove this, if you proof need please message me @ i will personally send my documents.
Similarly we will calculate the variance of beta_0 and beta_1

v(β_1ˆ) = [σ^2/S(xx)]             :where S(xx) = ∑[(x_i – mean(x)]^2

v(β_0ˆ) = σ^2[(1/n)+{mean(x)^2}/S(xx)]

NOTE:-β_0 & β_1 are unbiased estimator of σ^2
Estimation of σ^2:- is obtained from the residual sum of square

SS(res) = SS(yy) – (β_1ˆ)^2.

We got the values of Coefficients, sum of squared errors (Regressor, Regression and total) using these we can calculate the null hypothesis which is based on t and z test.
The t and z test value can be calculated using this formula:
Usually varinace(σ²) is unknow, if variance(σ²) is unknow then we will follow t-test hypothesis
t = (β_1ˆ-β_1)/√(MS_res)/S(xx)
if |t|>t[(α/2), (n-2)] we reject null hypothesis. 
∴ |t| is calculated value and t[(α/2), (n-2)] is tabulated value.
And when (σ²) knows we will follow z-test hypothesis
z = (β_1ˆ-β_1)/√(σ²)/S(xx) which follow random normal distribution(0,1)
if |z|>z(α/2) we reject the null hypothesis. 
∴ |z| is calculated value and z(α/2) is tabulated value.
Now we have all the values to calculate ANOVA table which will describe in the next article, so stay tuned.
Queries or Docs/ notes related please shoot me an email on

Part 2 Click Here

You might also like More from author