When initiating your journey with Machine Learning and Data Science, the first algorithm which you may come across is Linear Regression. And, you may already guess from the name that it is a Regression-based Problem (predictive analysis) and moreover, the relationship which it focuses on should be LINEAR in nature. I will take you through the Application of Linear Regression Via Python sklearn Library.
Before wasting any further time, let’s dive into!
(If you are still confused between Regression and Classification tasks and need StepUp Analytics to do an article on it, just give us the orders 😉 We’ll be happy to help you out!).
Table Of Contents
- Introduction to Linear Regression
- The main purpose and its use case (problem domains)
- Basic hypothesis/mathematics behind the design of algorithm.
- Application of Linear Regression on a dataset via Python’s sklearn library
Introduction To Linear Regression
Linear regression is an approach of linearly-mapping a relationship between a scalar input variable
(or dependent variable) and one or more continuous output variables (or independent variables). In some terminologies, the dependent variable is also referred to as the Target Variable and the independent variables as the Predictor variables.
Now, you can infer from the number of variables that when output is computed on the basis of a single variable then the approach is known as Simple Linear Regression, and when two or more variables are accountable for the output then it is known as Multiple Linear Regression. The approach followed in the latter is the same as the Simple Linear Regression, so better we start with that instead.
In this article, we will cover the Simple Linear Regression to get a gist of what is actually happening behind the name.
Main Purpose and Use Case of Simple Linear Regression
As you may already know that a Machine Learning task is either classification (predicting a class/label) or regression (predicting a real-valued quantity). So, whenever you come across a regression task and you see a somewhat linear pattern between the input and the output variables then son, you may require the Linear Regression Model for your problem statement.
The Core Purpose In Linear Regression Is To Obtain A Line That Best Fits The Data.
Visualize the data problem as a graph as shown beside where all the points are values of the variable at that point. Each point (in blue) in the graph is known as a data point and the fitted line (in red) called a Regression Line.
So the task of the algorithm is to find the best fit line for which the actual values are as close to the line as possible. In other words, the task is that the total prediction error is as small as possible, where the error is the distance between the points to the regression line.
The problem statement which is covered in this model is like:
- Predicting weight from the height of a person (Single Linear Regression)
- Predicting price of a house on the basis of its age, number of bedrooms, square ft area, etc. (Multiple Linear Regression)
- Predicting the height of a child on the basis of his father’s (Single Linear Regression)
- Predicting the market stock prices of a company based on its past performances, and so on
Keep in mind that the input variables can’t be categorical at all. In case you’re dealing with any, then it will be better than you convert them into numerical before feeding them into the model.
Mathematical Idea Behind Linear Regression
Since the beginning of this article, we have been using the word “Linear” which implicitly means a straight line. Just a little squeeze in your brain and the idea or basic algorithm behind the Linear Regression will automatically come to you. It will surely have the straight line equation, right? Let us see.
Y = b0 + b1X1 + b2X2 + b3X3 + ….
The above equation is used in case of Multiple Linear Regression problems and the same for Simple Linear Regression would be updated as
Y = b0 + b1X1
Y = output/ target variable/ response
b0 = Bias Coefficient (adds intercept-wise flexibility to the line)
b1 = coefficient/parameter related to X1
b = coefficients/parameters learned and updated while training/fitting of the model
Now, if you compare the equation with that of straight line then you can see the similarity between the two.
Until now, we discussed the hypothesis used in linear regressions tasks, but after the regression line is fitted and output is produced, we also need to evaluate it. We need to evaluate the model’s performance based on the actual outcome and the output that the model predicted. And, there comes the need for evaluation and optimization techniques.
Out of the many techniques currently used, the “least squares” is the easiest for beginners and is mostly used with Simple Linear Regression. What it actually does is, it just computes the distance between the actual value (plotted on the graph) and the value predicted by the model (i.e. value on the regression line) and square it. Now, the aim of the model will be to minimize this distance as much as it can. There are various other optimization algorithms like gradient descent which will be covered explicitly in further articles.
Mathematically, it is implemented using the formula below,
Where n = number of instances (or number of examples)
Predi = Predicted value for ith instance
Yi = Actual value for ith instance
J = Cost Function (least squares in this case)
Deploying The Linear Regression Model using Python’s SKLEARN
The code to the notebook can be accessed via
Sklearn is a python’s library which has all the basic machine learning models already implemented in it. It also contains some classical datasets which you can play around with.
Sklearn makes the implementation of any machine learning model way too easier, however, I’d suggest you try and code the model from scratch. This way you’d get a clearer picture of the model.
To get a detailed understanding of what actually is happening in the code, feel free to reach out to us in the comments below!
In this article, you read about what actually is a Linear Model and how it is divided into Simple and Multiple Task depending upon the number of variables involved.
Moreover, we tried to understand the problem domains where this algorithm might stand out. Also, I tried to give you all a basic understanding of the mathematics involved in the name: Linear Regression. Then further we understood the need for optimization algorithms and also went through the Least Squares Method. The code implementation was also covered later which was done using SKLEARN Library. (Feel free to check out the Linear Regression Code from Sklearn).
You must know that Linear Regression is the most basic algorithm and it might not perform well for every other dataset. You must look out for the Linear Relation between the inputs and outputs and if you think it exists, then only move further.
At the end of the day, it all comes down to the accuracy of the model so if you think a Linear Regression isn’t performing good enough on your dataset, then you must know that it’s just a beginning and there are a lot of models which are still waiting to be discovered by you! Stay tuned with us and get to know them all.