> [!Information]
> - **Authors/Presenters**: Andreas Mayr, Harald Binder, Olaf Gefeller, Matthias Schmid
> - **Journal/Seminar Name**: Methods of Information in Medicine
> - **Date**: August 12, 2014
> - **Related Topics**: Boosting Algorithm
> - **Link**: [The evolution of boosting algorithms. From machine learning to statistical modelling - PubMed](https://pubmed.ncbi.nlm.nih.gov/25112367/)
### One-line Summary
Boosting was translated as an additive model and became an interpretable model.
### Introduction
- Boosting is considered as one of the prime models for supervised learning.
- **This review has two purposes:**
- One is to spotlight the advancement of boosting from a black-box model to an interpretable statistical model.
- The other is to show that there is the same historical context between two different statistical boosting algorithms: gradient boosting and likelihood boosting.
### Boosting in Machine Learning
- **The basic idea of boosting is to improve the accuracy of the prediction by combining sub-predictions of various weak classifiers.**
```python
# Boosting algorithm:
# Input I : training data, I_new : test data
# Input M : max iterator to stop training
# Input H : weak classifier class, Loss : loss function
# Input V : the specific rule of voting
w_1 <- initial weight vector of training data I
for m = 1 to M do:
h_m <- optimal h in H to minimize Loss(w_m, I)
a_m <- score of performance of h_m based on Loss value
w_m+1 <- make new weight vector by increasing
the weight of the data difficult to fit
return V({h_m(I_new)}, {a_m})
```
- **Most important example of boosting algorithm is *AdaBoost***
```python
# Adaptive Boosting algorithm:
# Input I : training data, I_new : test data
# Input M : max iterator to stop training
# Input H : weak binary classifier class, Loss : loss function
w_1 <- initial weight vector of training data I
for m = 1 to M do:
h_m <- optimal h in H to minimize Loss(w_m, I)
a_m <- score of performance of h_m based on Loss value
w_m+1 <- make new weight vector by increasing
the weight of the data difficult to fit
{z_m} <- a set of a_m * h_m(I_new) for m = 1, 2, ... M
return sign(sum(z_m))
```
- In practice, simple classification trees or stumps are often used as weak classifiers.
- Breiman, a leading expert in machine learning, stated, “Boosting is the best off-the-shelf classifier in the world.”
- *AdaBoost* seems to resist overfitting. The test error does not increase even when the complexity of its final solution increases.
- There are two ways to explain this strange phenomenon: to use margin interpretation and to consider the appropriateness of the criteria.
### Statistical Boosting
- **Boosting was translated as an additive model and became an interpretable model.**
- As an additive model, the final solution of the boosting algorithm is expressed by $\hat{f}_M(\cdot) = \beta_0 + \sum_{j=1}^{p}h_j(x_j)$.
- Each base learner $h_j(\cdot)$ reflects the partial effect of a feature $x_j$. Researchers can **interpret** how the feature $x_j$ affects an estimator by analyzing the shape of $h_j$.
- **Typical example of statistical boosting is Gradient Boosting**
- **Gradient Boosting**
- Theoretical goal:
find $\hat{f}(\cdot) = argmin_{f(\cdot)}E_{Y, X}[Loss(Y,~f(X))]$
- Practical goal:
find $\hat{f}(\cdot) = argmin_{f(\cdot)}\sum{Loss(y_i,~f(x_i))}$
- Fundamental Idea:
1. Use the negative gradient vector $u$ as a weight vector where:
$u^{[m]} = (-\frac{\partial}{\partial f}Loss(y_i, f(X_{i\cdot}))|_{f=\hat{f}^{[m-1]}})_{i= 1,~...,~n}$
2. Select the base learner $\hat{h}_j$ which optimally fits $u$:
$\hat{h}^{[m]} = argmin_{j, h}~||u^{[m]} -h_j^{[m]}(X_{\cdot j})||$
```python
# Gradient Boosting algorithm:
# Input I : training data, I_new : test data
# Input M : max iterator to stop training
# Input {h_j} : a set of base learners, Loss : loss function
# Input sl : a small step length. 0 < sl << 1
f^[0] <- initial predictor, e.g. f^[0] = 0
for m = 1 to M do:
m += 1
u^[m] <- The negative gradient vector of the loss function
evaluated at the previous iteration
h^[m] <- The optimal base learner which fits u^[m]
f^[m] = f^[m-1] + sl * h^[m]
return f^[M](I_new)
```
- Notation:
- In each iteration $m$, only the most powerful base learner with $x_j$ is added to the final predictor. As a result, the substantive number of features is intentionally reduced, similar to Lasso regulation.
- Any function $f$ can be the loss function of gradient boosting, if the function $f$ is at least convex and differentiable.
- Any random variable $X$ can be used, even when the distribution of $X$ is not a member of the exponential family.