Gradient Boosting (XGBoost, LightGBM, CatBoost)
The state-of-the-art for tabular data. Modern boosting variants trade off speed, accuracy, and categorical handling.
Theory & Concept
Gradient Boosting is currently the dominant technique for structured/tabular data (winning most Kaggle competitions). Unlike Random Forest, which builds trees independent of each other (parallel), Gradient Boosting builds them sequentially.
The Process
- Train a weak model (e.g., a shallow tree).
- Calculate the errors (residuals) of this model.
- Train a new model to predict these errors.
- Add this new model to the ensemble (weighted by a learning rate).
- Repeat.
Essentially, each new tree is an "expert" on the mistakes of the previous trees.
XGBoost (eXtreme Gradient Boosting) is an optimized implementation that includes:
- Regularization (L1/L2) to prevent overfitting.
- Handling Missing Values automatically.
- Parallel Processing for tree construction.
Boosting Families
AdaBoost is the classic boosting algorithm. It increases the weight of misclassified samples each round so the next weak learner focuses on hard cases. It is simple and interpretable, but can struggle with noisy labels and large-scale data.
LightGBM is a fast, modern gradient boosting framework. It uses histogram-based feature binning and leaf-wise tree growth, which makes it very efficient on large datasets while often matching or exceeding XGBoost accuracy.
CatBoost is designed for categorical features. It uses ordered target statistics to avoid target leakage and delivers strong performance with minimal preprocessing.
Comparison (Quick Take)
| Model | Strengths | Weaknesses | Best Use |
|---|---|---|---|
| AdaBoost | Simple, strong baseline, works well with weak learners | Sensitive to noise and outliers | Small-to-medium clean datasets |
| XGBoost | Strong accuracy, robust regularization, great defaults | Heavier tuning, slower than LightGBM | General-purpose tabular ML |
| LightGBM | Very fast training, scales to large data | Can overfit small data, needs tuning | Large datasets, speed-critical |
| CatBoost | Native categorical handling, minimal preprocessing | Slower than LightGBM, heavier model | Mixed numeric/categorical data |
When to Choose Which (Top-Down)
Want the full visual? View the gradient boosting cheat sheet.
Mathematical Intuition
We want to find a function that minimizes a loss function .
At step , we add a new tree :
Where:
- is the Learning Rate (shrinkage).
- fits the negative gradient of the loss (pseudo-residuals).
Quick Readiness Check
Quick Readiness Check
Is this method a fit for your use case?
Best For
Winning Competitions. Anything categorical/tabular where accuracy is paramount (Fraud, Risk).
Prerequisites
No scaling needed. Encoding categorical variables is required (unless using CatBoost).
Strengths
Highest accuracy. Handles missing data. Flexible loss functions.
Weaknesses
Harder to tune than RF. Prone to overfitting. Slower to train sequentially.
Pro Tip
Major trade-off: Speed vs. Accuracy. AdaBoost is the classic baseline, LightGBM is faster, CatBoost handles categories better, XGBoost is the classic choice.
Code Snippet
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
# 1. Initialize
# learning_rate (eta): Step size. Lower is better but requires more n_estimators.
model = XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
n_jobs=-1,
random_state=42
)
# 2. Train with Early Stopping
# Stops if validation score doesn't improve for 50 rounds
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(
X_train, y_train,
eval_set=eval_set,
early_stopping_rounds=50,
verbose=False
)
# 3. Predict
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
print(f"Best Iteration: {model.best_iteration}")LightGBM (Fast on Large Tabular Data)
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
model = LGBMClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=-1,
num_leaves=63,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
)
model.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric="logloss",
early_stopping_rounds=50,
verbose=False
)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")CatBoost (Categorical-Friendly)
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
# cat_features: list of column indices (or names) for categorical features
model = CatBoostClassifier(
iterations=500,
learning_rate=0.05,
depth=6,
loss_function="Logloss",
verbose=False,
random_state=42
)
model.fit(
X_train, y_train,
cat_features=cat_features,
eval_set=(X_test, y_test),
early_stopping_rounds=50
)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")AdaBoost (Classic Baseline)
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
model = AdaBoostClassifier(
n_estimators=200,
learning_rate=0.5,
random_state=42
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")Parameter Tuning Cheat Sheet
| Parameter | Options / Range | Effect & Best Practice |
|---|---|---|
| learning_rate | 0.01 - 0.3 | Most important. Lower LR + Higher Estimators = Better (but slower). |
| max_depth | 3 - 10 | Keep small! (3-6). Deep trees overfit quickly in boosting. |
| subsample | 0.5 - 1.0 | Fraction of rows to use per tree. 0.8 is a good default. |
| scale_pos_weight | 1 - 100 | Use this for Imbalanced Data (creates weighted loss). |