Summary - LightGBM, A Highly Efficient Gradient Boosting Decision Tree 🌐

> [!Information] > - **Authors/Presenters**: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu > - **Journal/Seminar Name**: 31st Conference on Neural Information Processing Systems > - **Date**: December 4, 2017 > - **Related Topics**: Boosting Algorithm > - **Link**: [LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf) ### One-line Summary LightGBM is an efficient gradient boosting algorithm based on two novel techniques: *Gradient-based One-Side Sampling* and *Exclusive Feature Bundling*. ### Introduction - **The larger the number of features and instances is, the more time the conventional GBDT algorithm spends on learning,** - because the traditional GBDT algorithm searches every feature and every instance. - **To reduce training time, the authors propose two novel techniques: *GOSS* and *EFB*** - **GOSS: Gradient-based One-Side Sampling** - “Data instances with different gradients play different roles in the computation of information gain.” - GOSS is a sampling method to draw the instances with large gradients and randomly drop the instances with small gradients. - **EFB: Exclusive Feature Bundling** - In many practical cases, many features are (almost) exclusive. Therefore, the actual number of effective features is less than the original number. - EFB is a bundling method to replace exclusive features with a new bundle feature. **➕ It is based on the characteristics of tree-based algorithms that are unrelated to strict inequality between values of the feature.** - Example: | F1 | F2 | F3 | Bundle F | | -- | -- | -- | -- | | 1 | 0 | 0 | 1 | | 2 | 0 | 0 | 2 | | 0 | 1 | 0 | 3 | | 0 | 0 | 1 | 4 | ### Preliminaries - **GBDT and Its Complexity Analysis** - Existing GBDT algorithms spend most of the time finding the best split point to train a decision tree. - There are two popular algorithms to search for the optimal split point: pre-sorted algorithm and histogram-based algorithm. - “Histogram-based algorithm is more efficient in both memory consumption and training speed.” - **Related Work** - **Most GBDT algorithms do not use any method to reduce the number of instances**, because AdaBoost, their ancestor, does not use such a method. Only SGB randomly drops instances, but “it is not a desirable choice.” - The popular method to reduce the number of features is PCA. But it is not always effective. ### Gradient-based One-Side Sampling (GOSS) ```python # Gradient-based One-Side Sampling: # Input I : training data, d : iterations # Input a : sampling ratio of large gradient data # Input b : sampling ratio of small gradient data # Input loss : loss function, L : weak learner models <- {} fact <- (1-a)/b topN <- a * len(I), topN <- b * len(I) for i = 1 to d do: preds <- models.predict(I) g <- loss(I, preds), w <- {1, 1, ...} sorted <- GetSortedIndices(abs(g)) topSet <- sorted[1:topN] randSet <- RandomPick(sorted[topN:len(I)], randN) usedSet <- topSet + randSet w[randSet] *= fact newModel <- L(I[usedSet], -g[usedSet], w[usedSet]) models.append(newModel) ``` - **GOSS is a sampling method to draw the instances with large gradients and randomly drop the instances with small gradients.** - The gradient of information gain for each instance can be interpreted as the score of whether the instance is poorly trained. - Small gradient → well-trained, Large gradient → poorly trained - GOSS firstly takes the instances with large gradients as samples to focus more on the “under-trained” cases. - Additionally, GOSS samples the instances with small gradients to prevent a significant change in the distribution of data. - **Let $\hat{d}$ be the best split point on the instances sampled by GOSS. $\hat{d}$ becomes close to $d$, the best split point on original instances, when the number of data is large.** ![[Untitled.png]] [supplementary-nips.pdf](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Supplementary.pdf) ### Exclusive Feature Bundling (EFB) ```python # Greedy Bundling: # Input F : features, K : max conflict count Construct graph G searchOrder ← G.sortByDegree() bundles ← {}, bundlesConflict ← {} for i in searchOrder do needNew ← True for j = 1 to len(bundles) do cnt ← ConflictCnt(bundles[j],F[i]) if cnt + bundlesConflict[i] ≤ K then bundles[j].add(F[i]), needNew ← False break if needNew then Add F[i] as a new bundle to bundles # Output: bundles ``` ```python # Merge Exclusive Features: # Input numData : number of data # Input F : One bundle of exclusive features binRanges ← {0}, totalBin ← 0 for f in F do totalBin += f.numBin binRanges.append(totalBin) newBin ← new Bin(numData) for i = 1 to numData do newBin[i] ← 0 for j = 1 to len(F) do if F[j].bin[i] != 0 then newBin[i] ← F[j].bin[i] + binRanges[j] # Output: newBin, binRanges ``` - **EFB is a bundling method to replace exclusive features with a new bundle feature.** - **The problems are (1) how to choose a set of features to be bundled together and (2) how to assign the value of the bundle.** - **How to choose a set of features to be bundled together** - The authors prove that it is computationally hard to change features into the smallest exclusive bundles. ![[Untitled 1.png]] - Alternatively, the authors use the greedy approach: - For each feature, assign it to an existing exclusive bundle if the number of conflicts is small. - Else, create a new exclusive bundle and assign it to a new one. - **How to assign the value of the bundle** - “The key is to ensure that the values of the original features can be identified from the feature bundles.” - Example: | | F1 | F2 | F3 | Bundle F | |---|----|----|----|----------| | 0 | 1 | 0 | 0 | 1 | | 1 | 2 | 0 | 0 | 2 | | 2 | 0 | 1 | 0 | 3 | | 3 | 0 | 0 | 1 | 4 | ``` F1.numbin = 3, F2.numbin = 2, F3.numbin = 2 ⇒ binRanges = [0, 2, 3, 4] row 0 : F1.bin[0] = 1 ≠ 0 → Bundle F[0] = 1 + 0 = 1 row 1 : F1.bin[1] = 2 ≠ 0 → Bundle F[1] = 2 + 0 = 2 row 2 : F1.bin[2] = 0 == 0 → F2.bin[2] = 1 ≠ 0 → Bundle F[2] = 1+ 2= 3 row 3 : F1.bin[3] = 0 == 0 → F2.bin[3] = 0 == 0 → F3.bin[3] = 1 ≠ 0 → Bundle F[3] = 1+ 3 = 4 ``` ### My Thought **Efficiency improvements are crucial.** - LightGBM became a beloved model solely due to its efficiency improvements. → This suggests that similar improvements could be possible in causal discovery algorithms, as evidenced by the existence of the fast PC algorithm. - Improving efficiency requires a deep understanding of the model and data, similar to LightGBM.