XGBoost
--
XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework
It's a very complicated algorithm to begin with and has been touted as a game changer for many ML frameworks.
I won't be getting into the mathematics and technical details about the algorithm or even the code implementation for that matter. This article is meant to provide you an intuitive understanding of how the algorithm works.
The very first step is to make an initial prediction. Be default, the prediction is 0.5 irrespective if you are using XGB for regression or classification
Now, just like Gradient Boost, XGBoost fits a tree to predict the residuals. you can know more about how Gradient Boosting works from my previous article.
However, unlike the Gradient Boost which uses regular regression trees (CART) , XGBoost uses a unique regression tree. There are many ways to build XGBoost tree. We will focus on the most common way to build such trees.
Each Tree starts out as a single leaf and we calculate the similarity score for each leaf
Note, Lambda is a regularization parameter. we will discuss this later.
We calculate the number of residuals in the leaf in the denominator and take the sum of residuals, squared in the numerator to find out the base similarity score. we do not square the residuals before adding the residuals so there is a high chance that the residuals will cancel each other out.
Let's take an example for our purpose where we are trying to find out the drug effectiveness vs drug dosage.
As per our earlier discussion, we calculate the baseline residuals with a split of 0.5 (the black line parallel to the x-axis; drug dosage). After calculating the residuals with 0.5 as the baseline we calculate the base similarity score.
Now the question is whether or not, we can do a better job than the base similarity score in order to cluster the similar residuals by splitting them into two groups.
Let's say we go with the average of the lowest two dosages and keep a split at 15 drug dosage.
At this point our similarity score becomes the following snippet-
Once we have the similarity score for both the nodes, we calculate the gain. This gain can be thought of as the improvements in putting similar clusters together
Gain= Leaf (Right) +Leaf (Left) — Root (Base)
We keep on shifting our thresholds over and keep measuring the gains for the new thresholds. We go with the split which has the highest gain.
Once we have decided which feature gets us the highest gain at whichever threshold, we further keep on splitting the nodes. For instance, lets say after multiple iterations of shifting thresholds, we find that our initial threshold of 15 gave us the best gain. In this case, we would decide to go further down with right node and try to further split the residuals to get a better gain.
By default, the tree is built upto 6 levels of depth.
Pruning
We start by picking a number called gamma. This gamma is an arbitrary number that can be hypertuned. the idea is to remove any branches if the difference between the gain and gamma is negative.
Many a times, entire trees are removed too
Regularization
The trees are regularized with lambda. Lambda serves as the regularization parameter in our similarity score equation. The higher the value of lambda is, lower are the similarity scores.
Calculating the Output values
Output values are calculated in a similar fashion as the equation for similarity scores. however, we do not take the square of the residuals. Instead, we take the sum and divide by (#residuals + lambda)
Predictions with XGBoost
Similar to how we had in Gradient Boosting trees, the new predictions are a function of initial predictions added to the multiplicative function of learning rate and residuals by each tree.