# Gradient Boosting

Before going to the depth of gradient boosting let me tell a funny story. Once upon a time, there was a famous engineering college in India. There was an angry teacher named Viru Sahastrabuddhe. One day, Viru sir had sent a message to his four favorite students named Chatur, Rancho, Raju and Farhan to prepare a script for the lecture for 5th September. Chatur thought that he himself would prepare the script for teacher’s day program and he wouldn’t take any help from three others and then he would become the most favorite student of viru sir. He didn’t know Hindi properly. As a result, he did some serious mistakes in words without realizing the importance of mistyped words. As a result on that day he delivered the speech and spoiled the respect of the college. If they work together that problem didn’t take place and then the future of college could be more glorious. Then Chatur realized “United we stand divided we fall”. That was the story but that incident can take place in Data Science also. In data science, we try fit a model to the data. There is obviously some error. Then we can again try to improvise that model to minimize the error. Thus we can reach to the best model by minimizing the error of each model, sequentially. In this way we can generate the key concept of gradient boosting.

**What is ensemble modeling****?**

Ensemble means a group of items viewed as a whole rather than individually. So, ensemble modeling means combining a group of models. When we are creating a model we want to reduce error. Now, an error contains bias, variance and irreducible error. Now we can minimize the error by minimizing bias and variance simultaneously. With the increase of model complexity, bias can be reduced but variance decreases up to a certain point and after that if we increase the model complexity the variance increases.

So, we have to choose a model where we can have minimum bias and variance which increase the prediction power. In machine learning ensemble learning (modeling) is the way to minimize the bias and variance simultaneously. Here we combine different models in a specific manner and can make a model which have more prediction power and less error than the previous models.

There are some commonly used techniques for ensemble modeling such as * bagging*,

*.*

**boosting****Bagging (reduce variance)****:**

Bagging stands for Bootstrap Aggregating. With this technique we try to reduce the variance of the estimate. Here we generate additional data (by bootstrap sampling) from the training data sets which results in small variance as if we increase the size it can reduce the variance. For aggregating the outputs bagging uses voting for classification and averaging for regression.

**Boosting (improve predictions)****:**

Boosting is an iterative technique which allows us to covert weak learner to strong learner. It is a two-step technique. First, we use the subsets of the data and make some models. After that we boost the performance by combining them using a cost function. Unlike bagging, in boosting the subsets are not random rather than it depends upon the previous model performance.

Boosting is one of the most promising methods in data analysis. The hypothesis boosting problem is first introduced to turn the weak learner to a strong learner and the algorithm used in that problem known as boosting algorithm. Any weak learner can be iteratively improved to become a strong learner and to provide the evidence for this **Schapire** and **Freund** first developed the boosting algorithms. Boosting does not manipulate the weak (base) learner itself, but manipulating the underlying training data by iteratively re-weighting the observations, it improves the performance of the model. We get a learner in each step of iterations and take it as base learner for the next step. Finally we combine all the learners with iteration specific coefficients and get a strong learner which is more accurate than the weak learner. So boosting is a sequential ensemble modeling.

There are many types of boosting. I would like to explain different types of boosting like AdaBoost, Gradient Boosting and Extreme Gradient Boosting (most popular).

# AdaBoost:

Adaboost refers Adaptive Boosting algorithm. It is mostly used with simple classification problem (decision tree with one level). It works on improving the areas where the base learner fails. Any machine learning algorithms that accept weights on training data can be used as a base learner. AdaBoost constructs a strong model sequentially (stage-wise) by combining multiple weak (base) learners. Each model tries to correct the errors of previous model. We get better precision by AdaBoost compared to random chance in simple classification.

**Gradient Boosting****:**

Gradient boosting is a ML technique for both regression and classification problems. The idea originated by Leo Breiman that boosting can be interpreted as an optimization algorithm on a suitable cost function. Then regression gradient boosting algorithms were developed by J. H. Friedman. Like other boosting methods, gradient boosting also develops a strong learner by combining weak learners in iterative way.

Let us start with a general example:

We have a model M as, with model accuracy 79%.

Now we see that there a correlation of this error with Y. So, what we do here is build a model for this error 1 based on our observations x. Suppose, we got the model as, with model accuracy 84%.

Now if I want more accuracy, then proceeding this way we may get a model with accuracy 90%.

Now combining the models we have,

Now we can see that assigning some weights at each stage we get model,

With the above idea of modeling we can get best model but we have to keep in mind there will be an over-fitting after some step and we have to stop at that step.

**Algorithm:**

Generally, the optimization problem for estimating the regression function f(·) of a statistical model, relating the predictor variables X with the outcome Y , can be expressed as,

ρ(·) denotes a loss function. If the response variable is binary, i.e. y ∈ {0, 1}, we can consider the binomial loss function. If the response variable is continuous, i.e., y ∈ R, we can use squared loss function. In practice, with a learning sample of observations (y1, x1), …,(yn, xn) we minimize the empirical risk,

The fundamental idea of gradient boosting is to fit the base-learner not to re-weighted observations, as in AdaBoost, but to the negative gradient vector u(m) of the loss function ρ(y,) evaluated at the previous iteration m − 1,

Now for mth iteration the base-learner is the error of the previous iteration i.e. y-(m-1)(x). We fit the negative gradient u(m) separately with the base-learner at every mth iteration. Let the best fitted model is . Now we update the model as,

where is the mth iteration specific coefficient (weight). So, this is the algorithm of gradient boosting.

**Visualization of Gradient Boosting**:

Now let’s visualize what is happening in gradient boosting. Consider the following plots:

Here in the 1^{st} iteration we fit a simple model which gives us many different prediction values compared to actual values. So using the loss function we are trying to reduce the error such that the model gives the values close to the actual values. We perform the 2^{nd} iteration and try to model the errors such that the combined model gives us better values instead of poor predictions in the 1^{st} iteration. Proceeding similar way, in the 3^{rd} iteration we can notice our model predictions are approaching towards the actual values and the residuals are moving towards zero.

Now proceeding this way, we can see after 20^{th} iteration the new ensemble model gives predictions which are very close to the actual values and the residuals are near about zero.

Now you may think if we proceed further then it may give more good predictions. But no, as I said before after a certain time the model start over-fitting.

See the results from iteration 50. It looks like very good fitting but this will lead us to over-fitting. So, after 20^{th} iteration the predictions are close to the actual values and the errors are near about zero. We should stop and combining the weak-learners which finally give a strong-learner.

So, this is the visualization of gradient boosting.

**Extreme gradient boosting****:**

The most popular and frequently used boosting method is extreme gradient boosting. It is nothing but an improvement over gradient boosting. Its algorithm is as same as the normal gradient boosting but it is a more regularized model to control the over-fitting and present it as a prediction model with a higher accuracy. This feature makes extreme gradient boosting most popular in data science.

Gradient boosting divides the optimization problem into two parts by first determining the direction of the step and then optimizing the step length. But extreme gradient boosting determines the step directly by solving for each x.

In gradient boosting, the weight is the average of the gradients; while in extreme gradient boosting the weight is the sum of gradients scaled by the sum of hessian

Why extreme gradient boosting (XGBoost) is so popular rather than gradient boosting model (GBM)? Its execution is faster than gradient boosting. The main part is the regularization which doesn’t have in GBM. This enables the model to avoid over-fitting.

Now we have to know how we can perform gradient boosting in R studio. Taking an inbuilt dataset in R named “** OrchardSprays**”, we want to predict the variable “

**” depending on other variables. The R code is as follows:**

*decrease*
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
library(caTools) library(xgboost) library(Matrix) df <- OrchardSprays str(df) set.seed(65265) split <- sample.split(rownames(df),SplitRatio=3/4) train <- df[split==T , ]; str(train) test <- df[split == F , ]; str(test) sparse.train <- sparse.model.matrix(decrease~.-1,data = train) sparse.test <- sparse.model.matrix(decrease~.-1,data = test) model <- xgboost(data= sparse.train, label = train$decrease, nround = 83) a <- predict(model,sparse.test) par(mfrow=c(1,2)); plot(test$decrease,type="l",col="black",main="Absolute vs Predicted "); lines(a,type="l",col="green");plot(abs(train$decrease-a),type="p") |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 |
> library(caTools) > library(xgboost) > library(Matrix) > df <- OrchardSprays > str(df) 'data.frame': 64 obs. of 4 variables: $ decrease : num 57 95 8 69 92 90 15 2 84 6 ... $ rowpos : num 1 2 3 4 5 6 7 8 1 2 ... $ colpos : num 1 1 1 1 1 1 1 1 2 2 ... $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 1 3 2 ... > set.seed(65265) > split <- sample.split(rownames(df),SplitRatio=3/4) > train <- df[split==T , ];str(train) 'data.frame': 48 obs. of 4 variables: $ decrease : num 57 95 8 69 92 90 15 84 6 36 ... $ rowpos : num 1 2 3 4 5 6 7 1 2 4 ... $ colpos : num 1 1 1 1 1 1 1 2 2 2 ... $ treatment: Factor w/ 8 levels "A","B","C","D",..: 4 5 2 8 7 6 3 3 2 4 ... > test <- df[split == F , ];str(test) 'data.frame': 16 obs. of 4 variables: $ decrease : num 2 127 5 72 9 60 71 27 76 13 ... $ rowpos : num 8 3 3 7 4 3 8 6 8 3 ... $ colpos : num 1 2 3 3 4 5 5 6 6 7 ... $ treatment: Factor w/ 8 levels "A","B","C","D",..: 1 8 1 7 3 7 6 4 8 3 ... > sparse.train <- sparse.model.matrix(decrease~.-1,data = train) > sparse.test <- sparse.model.matrix(decrease~.-1,data = test) > model <- xgboost(data= sparse.train, + label = train$decrease, + nround = 83) [1] train-rmse:44.224010 [2] train-rmse:34.871674 [3] train-rmse:27.506603 [4] train-rmse:22.271288 [5] train-rmse:18.391102 [6] train-rmse:15.546820 [7] train-rmse:13.209415 [8] train-rmse:11.418989 [9] train-rmse:9.899110 [10] train-rmse:8.801632 [11] train-rmse:7.977108 [12] train-rmse:7.058861 [13] train-rmse:6.199394 [14] train-rmse:5.635022 [15] train-rmse:4.629980 [16] train-rmse:3.988866 [17] train-rmse:3.333488 [18] train-rmse:2.784358 [19] train-rmse:2.352587 [20] train-rmse:2.027802 [21] train-rmse:1.716642 [22] train-rmse:1.504118 [23] train-rmse:1.364217 [24] train-rmse:1.273572 [25] train-rmse:1.095709 [26] train-rmse:0.921850 [27] train-rmse:0.784287 [28] train-rmse:0.705741 [29] train-rmse:0.621458 [30] train-rmse:0.556999 [31] train-rmse:0.508760 [32] train-rmse:0.444837 [33] train-rmse:0.393593 [34] train-rmse:0.362841 [35] train-rmse:0.340297 [36] train-rmse:0.296963 [37] train-rmse:0.266101 [38] train-rmse:0.243270 [39] train-rmse:0.217793 [40] train-rmse:0.197203 [41] train-rmse:0.177590 [42] train-rmse:0.160360 [43] train-rmse:0.149259 [44] train-rmse:0.131243 [45] train-rmse:0.124694 [46] train-rmse:0.115595 [47] train-rmse:0.105399 [48] train-rmse:0.097588 [49] train-rmse:0.089501 [50] train-rmse:0.083174 [51] train-rmse:0.070194 [52] train-rmse:0.059420 [53] train-rmse:0.050683 [54] train-rmse:0.043452 [55] train-rmse:0.038619 [56] train-rmse:0.033028 [57] train-rmse:0.031354 [58] train-rmse:0.028085 [59] train-rmse:0.026243 [60] train-rmse:0.023337 [61] train-rmse:0.021366 [62] train-rmse:0.019485 [63] train-rmse:0.016923 [64] train-rmse:0.014856 [65] train-rmse:0.013360 [66] train-rmse:0.012067 [67] train-rmse:0.011203 [68] train-rmse:0.009941 [69] train-rmse:0.008738 [70] train-rmse:0.007786 [71] train-rmse:0.006910 [72] train-rmse:0.006253 [73] train-rmse:0.005585 [74] train-rmse:0.005109 [75] train-rmse:0.004692 [76] train-rmse:0.004443 [77] train-rmse:0.004021 [78] train-rmse:0.003649 [79] train-rmse:0.003215 [80] train-rmse:0.002843 [81] train-rmse:0.002533 [82] train-rmse:0.002533 [83] train-rmse:0.002533 > a<- predict(model,sparse.test) > par(mfrow=c(1,2)); > plot(test$decrease,type="l",col="black", main="Absolute vs Predicted"); > lines(a,type="l",col="green");plot(abs(train$decrease-a),type="p") |

From the above fitting we can see how good the model is

**Why XGBoost is so popular in ML**?

Now let’s see the advantages for which it is the most frequently used in kaggle competitions or other data science methods.

- It is enabled with parallel processing (using OpenMP). So, while running XGBoost, it would use all the cores of your machine.
- XGBoost has provision for regularization. Regularization over GBM is very important which enables to control the over-fitting.
- We use external packages in R such as caret and mlr to obtain cross-validation results. But, XGBoost is enabled with internal cross-validation function.
- XGBoost is designed to deal with missing values internally.
- Unlike GBM, where tree pruning stops once a negative loss is encountered, XGBoost grows the tree up to max_depth and then prune backward until the improvement in loss function is below a threshold.
- It has a feature to allow us to save our data and model for future reference.
- It is available for programming languages such as R, Python, Java etc.

**Comparison between gradient boost and random forest****:**

Both are ensemble learning methods and predict (regression or classification) by combining the outputs from individual trees. They differ in the way the trees are built in order and the way the results are combined.

Random Forests train each tree independently, using a random sample of the data. This randomness helps to make the model more robust than a single decision tree and less likely to over fit on the training data. There are typically two parameters in Random Forest – number of trees and no. of features to be selected at each node.

Gradient Boosting Techniques build trees one at a time, where each new tree helps to correct errors made by previously trained tree. With each tree added, the model becomes even more expressive. There are typically three parameters in gradient boosting – number of trees, depth of trees and learning rate. Each tree built is generally shallow.

GBDT (Gradient Boosting Descent Technique) training generally takes longer because of the fact that trees are built sequentially. However benchmark results have shown GBDT are better learners than Random Forests.

Author: **Rahul Roy**, **Debjoy Thakur**