In this page we discuss fundementals of building gradient boosting trees for continunous target variable and binary target variable. This is a study note from StatQuest with Josh Starmer
Steps for Gradient Boosting for continuous target variable (gradient boost for regression).
The idea of the learning is that for each tree that was build, we make a small steps toward (hopefully) to the true value.
When making the first prediction, the predicted value is the sample average of the target variable in the node. It can be shown by the formula below.
Loss Function
\[L(y_i, F(x)) = \sum_{i=1}^{n} \frac{1}{2} (y_i-\hat{y})^2\]
Step 1: initialize the model with constant value \[F_0(x) = \underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L(y_i,\gamma)\]
where \(\gamma\) here is the predicted value. By taking the first derivative of the loss function and find \(\gamma\) such that first derivative of the loss function equals 0. We get that the \(\gamma\) is the average of the sample.
Step 2: for m = 1 to M: (M trees)
2.1 Compute Pseudo Residuals \[r_{i,m} = -[\frac{\partial L(y_i,F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}\] note that this derivative \([\frac{\partial L(y_i,F(x_i))}{\partial F(x_i)}]\) is the gradient that the gradient boosting is named after
2.2 Fit a regression tree to the Pseudo Residuals \(r_{i,m}\) and create terminal regions (leaves) \(R_{j,m}\) for \(j= 1...J_m\)
2.3 Again, Find the \(\gamma\) (output value) that minimize the loss function
\[\gamma_{j,m} = \underset{\gamma}{\operatorname{argmin}} \sum_{x\in{R_{i,j}}}L(y_i, F_{m-1}(x_i)+\gamma)\] for \(j=1...J_m\). This can be solved analytically. Given our chose of loss function, the \(\gamma\) is always the average of the (pseudo) residuals in the same leaf.
where
2.4 Make new predictions for each sample: \[F_m(x) = F_{m-1}(x)+ \nu \sum_{j=1}^{J_m} \gamma_{j,m}I(x\in R_{j,m})\]
where \(\nu\) is the Learning Rate and is a value between 0 and 1. A small learning rate reduce the effect of each tree on the final production and improve the accurary in the long run.
Gradient Boosting for binary target variable is similar to that for continunous target variable, but
\[-\sum_{i=1}^{n} y_i \cdot log(\hat{p})+(1-y_i) \cdot log(1-\hat{p}) \]
Now, we want to make it a function of predicted log-odd ratio, and then simplify it.
\[ \begin{aligned} -[y_i \cdot log(\hat{p_i})+(1-y_i) \cdot log(1-\hat{p_i})] &= -y_i \cdot log(\hat{p_i})-(1-y_i) \cdot log(1-\hat{p_i})\\ &= -y_i \cdot log(\hat{p_i})-log(1-\hat{p_i})+y_i \cdot log(1-\hat{p_i}) \\ &= -y_i \cdot [log(\hat{p_i})-log(1-\hat{p_i})]- log(1-\hat{p_i}) \\ &= -y_i \cdot log(\frac{\hat{p_i}}{1-\hat{p_i}})- log(1-\hat{p_i}) \\ &= -y_i \cdot log(odds)- log(1-\hat{p_i}) \\ &= -y_i \cdot log(odds)- log(1-\frac{e^{log(odds)}}{1+e^{log(odds)}}) \\ &= -y_i \cdot log(odds)- log(\frac{1}{1+e^{log(odds)}}) \\ &= -y_i \cdot log(odds)- [log(1)-log({1+e^{log(odds)}})] \\ &= -y_i \cdot log(odds)+log({1+e^{log(odds)}})] \\ \end{aligned} \]
Now take the first derivative of the loss function with respect to \(log(odds)\) \[ \begin{aligned} \frac{d}{d log(odds)} L &= -y_i+ \frac{e^{log(odds)}}{1+e^{log(odds)}} \\ &= -y_i + \hat{p_i} \end{aligned} \]
Now we can do the
Step 1: Initialize the model with constant value \[F_0(x) = \underset{\gamma}{\operatorname{argmin}} \sum_{i=1}^{n} L(y_i,\gamma)\] where the initial \(\hat{p_i}\) for all the samples \(i= 1,...,n\) is the sum of \(y_i\) divided by the number of observations.
Step 2: for m = 1 to M: (M trees)
2.1 Compute Pseudo Residuals \[r_{i,m} = -[\frac{\partial L(y_i,F(x_i))}{\partial F(x_i)}]_{F(x)=F_{m-1}(x)}\] for \(i = 1,...,n\). We know from above that \(\frac{\partial L(y_i,F(x_i))}{\partial F(x_i)} = -y_i + \hat{p}\), so \(r_{i,m}\) is just the residuals: \(y_i- \hat{p}\)
2.2 Fit a regression tree to the Pseudo Residuals \(r_{i,m}\) and create terminal regions (leaves) \(R_{j,m}\) for \(j= 1...J_m\)
2.3 Again, Find the \(\gamma\) (\(log(odds)\)) for each termial leaves that minimize the loss function
\[ \begin{aligned} \gamma_{j,m} &= \underset{\gamma}{\operatorname{argmin}} \sum_{x_i\in{R_{i,j}}}L(y_i, F_{m-1}(x_i)+\gamma) \\ &= \underset{\gamma}{\operatorname{argmin}} \sum_{x_i\in{R_{i,j}}}-y_i \cdot [F_{m-1}(x_i)+ \gamma]+log({1+e^{F_{m-1}(x_i)+\gamma}})] \end{aligned} \]
for \(j=1...J_m\). Solving for this analytically is difficult, therefore we approximate the loss function with second order of Taylor’s expansion
\[ L(y_i, F_{m-1}(x_i)+\gamma) \approx L(y_i, F_{m-1}(x_i) + \frac{d}{d\gamma}L(y_i, F_{m-1}(x_i)\gamma) + \frac{1}{2} \frac{d^2}{d\gamma^2}L(y_i, F_{m-1}(x_i))\gamma^2 \]
Now we take the derivative:
\[ \frac{d}{d\gamma}L(y_i, F_{m-1}(x_i)+\gamma) \approx \frac{d}{d\gamma}L(y_i, F_{m-1}(x)) + \frac{d^2}{d\gamma^2}L(y_i, F_{m-1})\gamma \] Now set it to 0 and solve for \(\gamma\):
\[ \begin{aligned} \gamma_{j,m} &= \frac{-1 \cdot \sum_{x_i \in R_{ij}}\frac{d}{d\gamma}L(y_i, F_{m-1}(x_i))}{\sum_{x_i \in R_{ij}}\frac{d^2}{d\gamma^2}L(y_i, F_{m-1})\gamma} \\ &= \frac{\sum_{x_i \in R_{ij}} (y_i-\hat{p_i})}{\sum_{x_i \in R_{ij}}[\hat{p_i} \cdot (1-\hat{p_i})]} \end{aligned} \]
where \(\hat{p_i}\) is the estimated probability from the previous tree.
2.4 Update the prediction: \[F_m(x) = F_{m-1}(x) + \nu \sum_{j=1}^{Jm}\gamma I(x \in R_{jm})\]
Step 3: Output \(F_{M}(x)\)
XGBoost stands for extreme gradient boosting. It is different from the original gradient boost by
\[\underset{O_{value}}{\operatorname{argmin}}[\sum_{i=1}^{n}L(y_i, F_{m-1}(x)+O_{value})+\gamma T+\frac{1}{2}\lambda O_{value}^2]\] The term \(\lambda T\)
2nd order Taylor approximation for both regression & classification \[L(y, )\]