# Beginner to Advance level: Steps to Make Regression Model

**Part 1**

You must have heard about Regression Models many times but you might not have heard about the techniques of solving or making a regression model step-wise.

First, we will talk about **Simple Linear Regression**: is a model with a single regressor (x) has a linear relationship with a response variable (y).

We all(who have an idea about regression) the linear regression equation:

For a given x, the corresponding observation Y consists of the value. Check above

We know to make some assumptions on the model.

**Assumptions:- **

**ε(i)**is a random variable with**mean**zero &**variance σ^2(sigma square)**[σ^2 is unknow]- i.e.
**E(ε(i)) = 0 & V(ε(i)) = σ^2**{E<- expectation}

- i.e.
**ε(i)**and**ε(j)**are uncorrelated**i≠j**. So**cov(ε(i), ε(j)) = 0.**Here uncorrelated means independent to each other**ε(i)**is a normally distributed random variable, with mean zero and variance**σ^2**- ε(i)=N(0,σ^2) {N – indicates random distributed, σ^2 – variance followed by mean = zero}

**Assumptions in terms of Y[ (ε(Y))] :- **Here i am not going in details to write the equation, I will tell you what to do just replace the **ε(i) to ε(Y).**

**And Mean will become = β(0)+β(1)*x and variance will e same and equal to σ^2**

Here is the **Least Square Estimation of the Parameter** we are going to discuss further.

**Least Squares Estimation(LSE):- **

- The parameters
**β(0), β(1)**are unknown and must be estimated using same data.

(x1,y1),(x2,y2), – – – – , (xn,yn)

- The line fitted by (
**LSE**) is the one that makes the sum of squares of all**Vertical discrepancies**as**small as possible.**

- We estimate
**β(0), β(1)**so that the sum of the square of all the difference between observation Y(i) and the fitted line is**minimum = SS(Res)**, explained in the below snapshot - The least square estimator of
**β(0)& β(1), (β_0 ̂,β_1 ̂ )**must satisfy the following two equation a snapshot is addedEquation 1 and 2 are called normal equations and they are uniquely independent.

So, the estimator **β_0 ̂,β_1 ̂ **is the solution of the equation

**∑(Y(i)-β0ˆ-βiˆ*xi) = 0** —–(1),

** ∑(Y(i)-β0ˆ-βiˆ*xi)*xi = 0** —-(2)

The solution of the above equations are attached as the image, I am attaching the image here because I can’t write mathematical equation here, So enjoy snapshots 🙂

Equation Solution:

Equation 2 solution:

**Above we have calculated the parameters using least squares estimator.**

We have not discussed the benefits of using **LSE**, in one line the most important benefits of using LSE is the** solution will be most correct with almost 95% accuracy.**

**Properties of Least Square Estimator:-**

- Sum of residuals in any regression model that contains an intercept β_0 is always 0.

**∑y(i) = ∑(y(i)−y_iˆ) = 0 **(perfect regression line)

**∑y(i) = ∑(y_iˆ)**means the observation and estimated line of regression graph lie on each other(perfect regression line)**∑x(i)*e(i) = 0****∑y(i)*e(i) = 0**

**Statistical properties of LS estimation:- **

- Both (β_0)ˆ and (β_1)ˆ are unbiased estimator of (β_0) and (β_1) respectively. Which means they should be equal in values.
- (β_0)ˆ and (β_1)ˆ are linear combination of observation of y(i)

**β_1ˆ= [∑(x(i)-mean(x))*(y(i)-mean(y))]/∑((x(i)-mean(x))^2**

**=** [**∑(x(i)-mean(x))*y(i)]/∑(x(i)-mean(x))^2]**

**β_1ˆ** **= ∑(e(i)*y(i)**

Similarly we can also do for **β_0ˆ**

**β_0ˆ = mean(y)−β_1ˆ*mean(x)**

**= (1/n)∑y(i) – β_iˆ.(mean(x))**

Take the value of beta 1 parameter from above equation.

**Note:- I am not going to prove this, if you proof need please message me @ irrfankhann29@gmail.com i will personally send my documents.**

Similarly we will calculate the **variance** of **beta_0 and beta_1**

**v(β_1ˆ) = [σ^2/S(xx)] :where S(xx) = ∑[(x_i – mean(x)]^2**

**v(β_0ˆ) = σ^2[(1/n)+{mean(x)^2}/S(xx)]**

**NOTE:-β_0 & β_1 are unbiased estimator of σ^2**

**Estimation of σ^2:- **is obtained from the residual sum of square

**SS(res) = SS(yy) – (β_1ˆ)^2.**

We got the values of Coefficients, sum of squared errors (Regressor, Regression and total) using these we can calculate the **null hypothesis **which is based on **t** and **z test.**

The t and z test value can be calculated using this formula:

Usually varinace(σ²) is unknow, if variance(σ²) is unknow then we will follow t-test hypothesis

**t = (β_1ˆ-β_1)/√(MS_res)/S(xx)**

if |t|>t[(α/2), (n-2)] we reject null hypothesis.

∴ |t| is calculated value and t[(α/2), (n-2)] is tabulated value.

And when (σ²) knows we will follow z-test hypothesis

**z = (β_1ˆ-β_1)/√(σ²)/S(xx) which follow random normal distribution(0,1)**

if |z|>z(α/2) we reject the null hypothesis.

∴ |z| is calculated value and z(α/2) is tabulated value.

Now we have all the values to calculate ANOVA table which will describe in the next article, so stay tuned.

Queries or Docs/ notes related please shoot me an email on khanirfan.khan21@gmail.com.