Summary - The Evolution of Boosting Algorithms 🌐

> [!Information] > - **Authors/Presenters**: Andreas Mayr, Harald Binder, Olaf Gefeller, Matthias Schmid > - **Journal/Seminar Name**: Methods of Information in Medicine > - **Date**: August 12, 2014 > - **Related Topics**: Boosting Algorithm > - **Link**: [The evolution of boosting algorithms. From machine learning to statistical modelling - PubMed](https://pubmed.ncbi.nlm.nih.gov/25112367/) ### One-line Summary Boosting was translated as an additive model and became an interpretable model. ### Introduction - Boosting is considered as one of the prime models for supervised learning. - **This review has two purposes:** - One is to spotlight the advancement of boosting from a black-box model to an interpretable statistical model. - The other is to show that there is the same historical context between two different statistical boosting algorithms: gradient boosting and likelihood boosting. ### Boosting in Machine Learning - **The basic idea of boosting is to improve the accuracy of the prediction by combining sub-predictions of various weak classifiers.** ```python # Boosting algorithm: # Input I : training data, I_new : test data # Input M : max iterator to stop training # Input H : weak classifier class, Loss : loss function # Input V : the specific rule of voting w_1 <- initial weight vector of training data I for m = 1 to M do: h_m <- optimal h in H to minimize Loss(w_m, I) a_m <- score of performance of h_m based on Loss value w_m+1 <- make new weight vector by increasing the weight of the data difficult to fit return V({h_m(I_new)}, {a_m}) ``` - **Most important example of boosting algorithm is *AdaBoost*** ```python # Adaptive Boosting algorithm: # Input I : training data, I_new : test data # Input M : max iterator to stop training # Input H : weak binary classifier class, Loss : loss function w_1 <- initial weight vector of training data I for m = 1 to M do: h_m <- optimal h in H to minimize Loss(w_m, I) a_m <- score of performance of h_m based on Loss value w_m+1 <- make new weight vector by increasing the weight of the data difficult to fit {z_m} <- a set of a_m * h_m(I_new) for m = 1, 2, ... M return sign(sum(z_m)) ``` - In practice, simple classification trees or stumps are often used as weak classifiers. - Breiman, a leading expert in machine learning, stated, “Boosting is the best off-the-shelf classifier in the world.” - *AdaBoost* seems to resist overfitting. The test error does not increase even when the complexity of its final solution increases. - There are two ways to explain this strange phenomenon: to use margin interpretation and to consider the appropriateness of the criteria. ### Statistical Boosting - **Boosting was translated as an additive model and became an interpretable model.** - As an additive model, the final solution of the boosting algorithm is expressed by $\hat{f}_M(\cdot) = \beta_0 + \sum_{j=1}^{p}h_j(x_j)$. - Each base learner $h_j(\cdot)$ reflects the partial effect of a feature $x_j$. Researchers can **interpret** how the feature $x_j$ affects an estimator by analyzing the shape of $h_j$. - **Typical example of statistical boosting is Gradient Boosting** - **Gradient Boosting** - Theoretical goal: find $\hat{f}(\cdot) = argmin_{f(\cdot)}E_{Y, X}[Loss(Y,~f(X))]$ - Practical goal: find $\hat{f}(\cdot) = argmin_{f(\cdot)}\sum{Loss(y_i,~f(x_i))}$ - Fundamental Idea: 1. Use the negative gradient vector $u$ as a weight vector where: $u^{[m]} = (-\frac{\partial}{\partial f}Loss(y_i, f(X_{i\cdot}))|_{f=\hat{f}^{[m-1]}})_{i= 1,~...,~n}$ 2. Select the base learner $\hat{h}_j$ which optimally fits $u$: $\hat{h}^{[m]} = argmin_{j, h}~||u^{[m]} -h_j^{[m]}(X_{\cdot j})||$ ```python # Gradient Boosting algorithm: # Input I : training data, I_new : test data # Input M : max iterator to stop training # Input {h_j} : a set of base learners, Loss : loss function # Input sl : a small step length. 0 < sl << 1 f^[0] <- initial predictor, e.g. f^[0] = 0 for m = 1 to M do: m += 1 u^[m] <- The negative gradient vector of the loss function evaluated at the previous iteration h^[m] <- The optimal base learner which fits u^[m] f^[m] = f^[m-1] + sl * h^[m] return f^[M](I_new) ``` - Notation: - In each iteration $m$, only the most powerful base learner with $x_j$ is added to the final predictor. As a result, the substantive number of features is intentionally reduced, similar to Lasso regulation. - Any function $f$ can be the loss function of gradient boosting, if the function $f$ is at least convex and differentiable. - Any random variable $X$ can be used, even when the distribution of $X$ is not a member of the exponential family.