> [!Information]
> - **Authors/Presenters**: Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu
> - **Journal/Seminar Name**: 31st Conference on Neural Information Processing Systems
> - **Date**: December 4, 2017
> - **Related Topics**: Boosting Algorithm
> - **Link**: [LightGBM: A Highly Efficient Gradient Boosting Decision Tree](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf)
### One-line Summary
LightGBM is an efficient gradient boosting algorithm based on two novel techniques: *Gradient-based One-Side Sampling* and *Exclusive Feature Bundling*.
### Introduction
- **The larger the number of features and instances is, the more time the conventional GBDT algorithm spends on learning,**
- because the traditional GBDT algorithm searches every feature and every instance.
- **To reduce training time, the authors propose two novel techniques: *GOSS* and *EFB***
- **GOSS: Gradient-based One-Side Sampling**
- “Data instances with different gradients play different roles in the computation of information gain.”
- GOSS is a sampling method to draw the instances with large gradients and randomly drop the instances with small gradients.
- **EFB: Exclusive Feature Bundling**
- In many practical cases, many features are (almost) exclusive. Therefore, the actual number of effective features is less than the original number.
- EFB is a bundling method to replace exclusive features with a new bundle feature.
**➕ It is based on the characteristics of tree-based algorithms that are unrelated to strict inequality between values of the feature.**
- Example:
| F1 | F2 | F3 | Bundle F |
| -- | -- | -- | -- |
| 1 | 0 | 0 | 1 |
| 2 | 0 | 0 | 2 |
| 0 | 1 | 0 | 3 |
| 0 | 0 | 1 | 4 |
### Preliminaries
- **GBDT and Its Complexity Analysis**
- Existing GBDT algorithms spend most of the time finding the best split point to train a decision tree.
- There are two popular algorithms to search for the optimal split point: pre-sorted algorithm and histogram-based algorithm.
- “Histogram-based algorithm is more efficient in both memory consumption and training speed.”
- **Related Work**
- **Most GBDT algorithms do not use any method to reduce the number of instances**, because AdaBoost, their ancestor, does not use such a method. Only SGB randomly drops instances, but “it is not a desirable choice.”
- The popular method to reduce the number of features is PCA. But it is not always effective.
### Gradient-based One-Side Sampling (GOSS)
```python
# Gradient-based One-Side Sampling:
# Input I : training data, d : iterations
# Input a : sampling ratio of large gradient data
# Input b : sampling ratio of small gradient data
# Input loss : loss function, L : weak learner
models <- {}
fact <- (1-a)/b
topN <- a * len(I), topN <- b * len(I)
for i = 1 to d do:
preds <- models.predict(I)
g <- loss(I, preds), w <- {1, 1, ...}
sorted <- GetSortedIndices(abs(g))
topSet <- sorted[1:topN]
randSet <- RandomPick(sorted[topN:len(I)], randN)
usedSet <- topSet + randSet
w[randSet] *= fact
newModel <- L(I[usedSet], -g[usedSet], w[usedSet])
models.append(newModel)
```
- **GOSS is a sampling method to draw the instances with large gradients and randomly drop the instances with small gradients.**
- The gradient of information gain for each instance can be interpreted as the score of whether the instance is poorly trained.
- Small gradient → well-trained, Large gradient → poorly trained
- GOSS firstly takes the instances with large gradients as samples to focus more on the “under-trained” cases.
- Additionally, GOSS samples the instances with small gradients to prevent a significant change in the distribution of data.
- **Let $\hat{d}$ be the best split point on the instances sampled by GOSS. $\hat{d}$ becomes close to $d$, the best split point on original instances, when the number of data is large.**
![[Untitled.png]]
[supplementary-nips.pdf](https://proceedings.neurips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Supplementary.pdf)
### Exclusive Feature Bundling (EFB)
```python
# Greedy Bundling:
# Input F : features, K : max conflict count
Construct graph G
searchOrder ← G.sortByDegree()
bundles ← {}, bundlesConflict ← {}
for i in searchOrder do
needNew ← True
for j = 1 to len(bundles) do
cnt ← ConflictCnt(bundles[j],F[i])
if cnt + bundlesConflict[i] ≤ K then
bundles[j].add(F[i]), needNew ← False
break
if needNew then
Add F[i] as a new bundle to bundles
# Output: bundles
```
```python
# Merge Exclusive Features:
# Input numData : number of data
# Input F : One bundle of exclusive features
binRanges ← {0}, totalBin ← 0
for f in F do
totalBin += f.numBin
binRanges.append(totalBin)
newBin ← new Bin(numData)
for i = 1 to numData do
newBin[i] ← 0
for j = 1 to len(F) do
if F[j].bin[i] != 0 then
newBin[i] ← F[j].bin[i] + binRanges[j]
# Output: newBin, binRanges
```
- **EFB is a bundling method to replace exclusive features with a new bundle feature.**
- **The problems are (1) how to choose a set of features to be bundled together and (2) how to assign the value of the bundle.**
- **How to choose a set of features to be bundled together**
- The authors prove that it is computationally hard to change features into the smallest exclusive bundles.
![[Untitled 1.png]]
- Alternatively, the authors use the greedy approach:
- For each feature, assign it to an existing exclusive bundle if the number of conflicts is small.
- Else, create a new exclusive bundle and assign it to a new one.
- **How to assign the value of the bundle**
- “The key is to ensure that the values of the original features can be identified from the feature bundles.”
- Example:
| | F1 | F2 | F3 | Bundle F |
|---|----|----|----|----------|
| 0 | 1 | 0 | 0 | 1 |
| 1 | 2 | 0 | 0 | 2 |
| 2 | 0 | 1 | 0 | 3 |
| 3 | 0 | 0 | 1 | 4 |
```
F1.numbin = 3, F2.numbin = 2, F3.numbin = 2
⇒ binRanges = [0, 2, 3, 4]
row 0 : F1.bin[0] = 1 ≠ 0 → Bundle F[0] = 1 + 0 = 1
row 1 : F1.bin[1] = 2 ≠ 0 → Bundle F[1] = 2 + 0 = 2
row 2 : F1.bin[2] = 0 == 0 → F2.bin[2] = 1 ≠ 0 → Bundle F[2] = 1+ 2= 3
row 3 : F1.bin[3] = 0 == 0 → F2.bin[3] = 0 == 0 →
F3.bin[3] = 1 ≠ 0 → Bundle F[3] = 1+ 3 = 4
```
### My Thought
**Efficiency improvements are crucial.**
- LightGBM became a beloved model solely due to its efficiency improvements.
→ This suggests that similar improvements could be possible in causal discovery algorithms, as evidenced by the existence of the fast PC algorithm.
- Improving efficiency requires a deep understanding of the model and data, similar to LightGBM.