Back to ML Guide
Supervised Learning

Gradient Boosting (XGBoost, LightGBM, CatBoost)

The state-of-the-art for tabular data. Modern boosting variants trade off speed, accuracy, and categorical handling.

5 min read

Theory & Concept

Gradient Boosting is currently the dominant technique for structured/tabular data (winning most Kaggle competitions). Unlike Random Forest, which builds trees independent of each other (parallel), Gradient Boosting builds them sequentially.

The Process

  1. Train a weak model (e.g., a shallow tree).
  2. Calculate the errors (residuals) of this model.
  3. Train a new model to predict these errors.
  4. Add this new model to the ensemble (weighted by a learning rate).
  5. Repeat.

Essentially, each new tree is an "expert" on the mistakes of the previous trees.

XGBoost (eXtreme Gradient Boosting) is an optimized implementation that includes:

  • Regularization (L1/L2) to prevent overfitting.
  • Handling Missing Values automatically.
  • Parallel Processing for tree construction.

Boosting Families

AdaBoost is the classic boosting algorithm. It increases the weight of misclassified samples each round so the next weak learner focuses on hard cases. It is simple and interpretable, but can struggle with noisy labels and large-scale data.

LightGBM is a fast, modern gradient boosting framework. It uses histogram-based feature binning and leaf-wise tree growth, which makes it very efficient on large datasets while often matching or exceeding XGBoost accuracy.

CatBoost is designed for categorical features. It uses ordered target statistics to avoid target leakage and delivers strong performance with minimal preprocessing.

Comparison (Quick Take)

ModelStrengthsWeaknessesBest Use
AdaBoostSimple, strong baseline, works well with weak learnersSensitive to noise and outliersSmall-to-medium clean datasets
XGBoostStrong accuracy, robust regularization, great defaultsHeavier tuning, slower than LightGBMGeneral-purpose tabular ML
LightGBMVery fast training, scales to large dataCan overfit small data, needs tuningLarge datasets, speed-critical
CatBoostNative categorical handling, minimal preprocessingSlower than LightGBM, heavier modelMixed numeric/categorical data

When to Choose Which (Top-Down)


Mathematical Intuition

We want to find a function F(x)F(x) that minimizes a loss function L(y,F(x))L(y, F(x)).

At step mm, we add a new tree hm(x)h_m(x):

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Where:

  • η\eta is the Learning Rate (shrinkage).
  • hm(x)h_m(x) fits the negative gradient of the loss (pseudo-residuals).
hmL(y,Fm1(x))Fm1(x)h_m \approx - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}

Quick Readiness Check

Quick Readiness Check

Is this method a fit for your use case?

Best For

Winning Competitions. Anything categorical/tabular where accuracy is paramount (Fraud, Risk).

Prerequisites

No scaling needed. Encoding categorical variables is required (unless using CatBoost).

Strengths

Highest accuracy. Handles missing data. Flexible loss functions.

Weaknesses

Harder to tune than RF. Prone to overfitting. Slower to train sequentially.

Pro Tip

Major trade-off: Speed vs. Accuracy. AdaBoost is the classic baseline, LightGBM is faster, CatBoost handles categories better, XGBoost is the classic choice.


Code Snippet

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
 
# 1. Initialize
# learning_rate (eta): Step size. Lower is better but requires more n_estimators.
model = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    n_jobs=-1,
    random_state=42
)
 
# 2. Train with Early Stopping
# Stops if validation score doesn't improve for 50 rounds
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(
    X_train, y_train,
    eval_set=eval_set,
    early_stopping_rounds=50,
    verbose=False
)
 
# 3. Predict
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
print(f"Best Iteration: {model.best_iteration}")

LightGBM (Fast on Large Tabular Data)

from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
 
model = LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
 
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    eval_metric="logloss",
    early_stopping_rounds=50,
    verbose=False
)
 
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

CatBoost (Categorical-Friendly)

from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
 
# cat_features: list of column indices (or names) for categorical features
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    loss_function="Logloss",
    verbose=False,
    random_state=42
)
 
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    early_stopping_rounds=50
)
 
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

AdaBoost (Classic Baseline)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
 
model = AdaBoostClassifier(
    n_estimators=200,
    learning_rate=0.5,
    random_state=42
)
 
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

Parameter Tuning Cheat Sheet

ParameterOptions / RangeEffect & Best Practice
learning_rate0.01 - 0.3Most important. Lower LR + Higher Estimators = Better (but slower).
max_depth3 - 10Keep small! (3-6). Deep trees overfit quickly in boosting.
subsample0.5 - 1.0Fraction of rows to use per tree. 0.8 is a good default.
scale_pos_weight1 - 100Use this for Imbalanced Data (creates weighted loss).