Bagging Vs Boosting

10 min readMay 5, 2022

Let's have a quick recap of the history of Bagging and Boosting

Bootstrapping

Before we get to Bagging, let’s take a quick look at an important foundation technique called the bootstrap.

The bootstrap is a powerful statistical method for estimating a quantity from a data sample. This is easiest to understand if the quantity is a descriptive statistic such as a mean or a standard deviation.

Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample.

We can calculate the mean directly from the sample as:

mean(x) = 1/100 * sum(x)

We know that our sample is small and that our mean has an error in it. We can improve the estimate of our mean using the bootstrap procedure:

1. Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can select the same value multiple times).

2. Calculate the mean of each sub-sample.

3. Calculate the average of all of our collected means and use that as our estimated mean for the data.

For example, let’s say we used 3 resamples and got the mean values 2.3, 4.5, and 3.3. Taking the average of these we could take the estimated mean of the data to be 3.367.

Bagging

The motivation for using ensemble models is to reduce the generalization error of the prediction. As long as the base models are diverse and independent, the prediction error of the model decreases when the ensemble approach is used. The approach seeks the wisdom of crowds in making a prediction.

The Need for Bagging

Decision trees are sensitive to the specific data on which they are trained. If the training data is changed (e.g. a tree is trained on a subset of the training data) the resulting decision tree can be quite different and in turn, the predictions can be quite different.

Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees.

Let’s assume we have a sample dataset of 1000 instances (x) and we are using the CART algorithm. Bagging of the CART algorithm would work as follows.

1. Create many (e.g. 100) random sub-samples of our dataset with replacement.

2. Train a CART model on each sample.

3. Given a new dataset, calculate the average prediction from each model.

For example, if we had 5 bagged decision trees that made the following class predictions for an input sample: blue, blue, red, blue, and red, we would take the most frequent class and predict blue.

When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characteristics of sub-models when combining predictions using bagging.

When you are using Bagging, typically 63% of the data is selected in sampling with replacement.

Its important to understand that the chance of not being selected in any of n draws from n samples with replacement is (1−1/n)^n, which works out to approximately 1/3 chance of not being selected.

To summarize-

Bagging uses Bootstrapping which is picking up data values with replacement
Bagging doesn't prune any decision trees, in fact, it allows them to grow fully and goes for the majority class as its output
Bagging typically uses only 63% of the values.
Bagging also uses all the features/predictors from the data.

Since bagging uses all the features that are present, it inadvertently introduces an element of correlation among trees. Hence, one can say that the outputs of these bagging models are correlated. Random Forest to the rescue!

Random Forest-

As such, even with Bagging, the decision trees can have a lot of structural similarities and in turn, have a high correlation in their predictions.

Combining predictions from multiple models in ensembles works better if the predictions from the sub-models are uncorrelated or at best weakly correlated

The random forest changes the algorithm for the way that the sub-trees are learned so that the resulting predictions from all of the subtrees have less correlation.

It is a simple tweak. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search.

The number of features that can be searched at each split point (m) must be specified as a parameter to the algorithm. You can try different values and tune them using cross-validation.

For classification a good default is: m = sqrt(p)
For regression, a good default is: m = p/3

Each tree in Random Forest is grown as follows:

1. A sample is drawn from the data set, with replacement and used for growing the trees.

2. If there are total m input features, m<M features are selected at random.

3. The best split on these ‘m’ features is used to split the node.

4. Each tree is grown to the largest extent possible or pre-defined depth.

5. The new data is predicted by aggregating the predictions of all the trees.

Bagging is a great ensemble technique but it introduces bias to the dataset. Each learner has an equal weight in predicting the outcome, which is seldom the case in real-life. For the aforementioned reasons, we have to introduce Boosting.

Boosting

Here are some of the key concepts of Boosting.

●It builds/trains weak learners sequentially.

●Weights of each misclassified point (red cross samples)are increased and weights for each correctly classified (green ticked samples) are decreased for the subsequent learner.

●The misclassified points of the previous model are given more weightage by the next model.

●The same process will be repeated until the stopping criteria are reached.

●The final prediction is decided by taking the weighted average or voting.

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees

Think of these residuals as mistakes committed by our predictor model. Although tree-based models (considering decision tree as base models for our gradient boosting here) are not based on assumptions of zero mean and const std dev, if we think logically (not statistically) about this assumption, we might argue that, if we are able to see some pattern of residuals around 0, we can leverage that pattern to fit a model.

So, the intuition behind the gradient boosting algorithm is to repetitively leverage the patterns in residuals and strengthen a model with weak predictions and make it better. Once we reach a stage where residuals do not have any pattern that could be modeled, we can stop modeling residuals (otherwise it might lead to overfitting).

Adaptive Boosting

A single classifier may not be able to accurately predict the class of an object, but when we group multiple weak classifiers with each one progressively learning from the others’ wrongly classified objects, we can build one such strong model (Adaptive Boosting) is a very popular boosting technique that aims at combining multiple weak classifiers to build one strong classifier

Now we may ask, what is a “weak” classifier? A weak classifier is one that performs better than random guessing but still performs poorly at designating classes to objects. For example, a weak classifier may predict that everyone above the age of 40 could not run a marathon but people falling below that age could. Now, you might get above 60% accuracy, but you would still be misclassifying a lot of data points!

Let’s try to understand how AdaBoost works with Decision Stumps. Decision Stumps are like trees in a Random Forest, but not “fully grown.” They have one node and two leaves. AdaBoost uses a forest of such stumps rather than trees.

Stumps alone are not a good way to make decisions. A full-grown tree combines the decisions from all variables to predict the target value. A stump, on the other hand, can only use one variable to make a decision.

Step 1: A weak classifier (e.g. a decision stump) is made on top of the training data based on the weighted samples. Here, the weights of each sample indicate how important it is to be correctly classified. Initially, for the first stump, we give all the samples equal weights.

Step 2: We create a decision stump for each variable and see how well each stump classifies samples to their target classes. Coming back to the marathon example, we check for Age, Eating Junk Food, and Exercise. We’d look at how many samples are correctly or incorrectly classified as Fit or Unfit for each individual stump.

Step 3: More weight is assigned to the incorrectly classified samples so that they’re classified correctly in the next decision stump. Weight is also assigned to each classifier based on the accuracy of the classifier, which means high accuracy = high weight!

Step 4: Reiterate from Step 2 until all the data points have been correctly classified, or the maximum iteration level has been reached.

A tree with just one node and 2 leaves is called a stump. They are not great at making accurate classifications because they use only variables to make decisions. Technically, they are called weak learners. That’s the way AdaBoost likes it

In contrast to the random forests, some stumps get more say in the final classification.

Order is important in AdaBoost. The error influence the following stumps.

The three main ideas about AdaBoost are-

1. Stumps

2. Some stumps have more say

3. They take previous mistakes into account

We calculate the Gini index for each stump in order to determine the final say in the classification.

Amount of say= ½ (log(1-total error)/(total error))

Total error= sum of the weights for incorrect samples, Let's say there were 8 samples and the stump got 3 of them incorrect. Hence,

Total error =1/8*3=3/8 (since each sample was equally weighted)

With a small total error, the amount of say is large

Now we modify the weights, we increase the sample weight for the sample that was incorrectly classified

New sample weight = sample weight * (e^amount of say)

The new sample is picked by bootstrapping, which means multiple samples can be added randomly until the sample becomes 4 times

Here the new sample weight would allow the wrongly classified sample to be weighted more and will have a higher chance of occurring in our bootstrapped sample

Hence the sample would occur more times reflecting its large weight

Now we will classify with the same sample weight for our new dataset and we repeat the process

Boosting

Using gradient boost for regression is different from using linear regression. It can used for classification too. I won't be getting into the math of this algorithm. The algorithm looks complicated since it was designed to be used in a variety of ways.

As opposed to AdaBoost which uses stumps to predict regression, Gradient boost makes a single leaf which serves as the starting point for the algorithm.

Furthermore, it makes a fixed size tree based on the previous tree’s error, but unlike AdaBoost each tree can be larger than a stump. Also, Gradient Boost scales the trees. It then continues to build trees in similar fashion until it builds the trees specified for.

The steps are as follows-

A weak learner is built on a subset of the original data. For instance, Let's say we want to predict the weights of individuals based on several factors like height, gender etc. In order to predict the weight of a person, it takes the average weight as the starting point

2. Errors are calculated after predictions by the weak learner by comparing the predictions and actual values. the next thing we do is build a tree based on the errors of the first tree, i.e. observed weight- predicted weight (Avg). The difference is called a pseudo residual . Its saved in a new column. The pseudo part is a reminder that we are doing a gradient boost not a linear regression

3. Next we build a tree using other features such as Height,Gender etc. However, the size of the tree is fixed. We usually restrict the #leaves such as to restrict the size of the residuals. Here, the new trees are trying to predict the residuals as opposed to the weight of the individual.

4. Now we can combine the original leaf with the new tree. We start with the original prediction (average weight) + weight predicted by the other features. However, in this case, the model fits the training data too well. (Low Bias- high variance)

5. Gradient Boost handles high variance using a learning rate which is any value between 0 and 1. With this new learning rate, we achieve lower variance. This is called scaling the tree.

6. We keep calculating these pseudo residuals by creating new trees. The new trees are making sure that we are making small incremental steps in the right direction. The trees can be different each time. Each time we keep adding trees, the residuals keep getting smaller.

In practice, max #leaves are set between 8 and 32

Loss function is something that evaluates how well the algorithm predicts weight (for this particular example). for gradient boost, the loss function is

(Observed — predicted)²/2

This loss function allows us to differentiate it w.r.t the predicted value.

We need to find the value that minimizes the loss function. This value would be the predicted weight or simply the average of the observed weights.

In my next blog, I will be talking about Extreme Gradient Boosting.

Bagging Vs Boosting

The Need for Bagging

Boosting

Adaptive Boosting

Boosting

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Saksham Gulati

No responses yet