Gradient Boosting (XGBoost, LightGBM, CatBoost)

Theory & Concept

Gradient Boosting is currently the dominant technique for structured/tabular data (winning most Kaggle competitions). Unlike Random Forest, which builds trees independent of each other (parallel), Gradient Boosting builds them sequentially.

The Process

Train a weak model (e.g., a shallow tree).
Calculate the errors (residuals) of this model.
Train a new model to predict these errors.
Add this new model to the ensemble (weighted by a learning rate).
Repeat.

Essentially, each new tree is an "expert" on the mistakes of the previous trees.

XGBoost (eXtreme Gradient Boosting) is an optimized implementation that includes:

Regularization (L1/L2) to prevent overfitting.
Handling Missing Values automatically.
Parallel Processing for tree construction.

Boosting Families

AdaBoost is the classic boosting algorithm. It increases the weight of misclassified samples each round so the next weak learner focuses on hard cases. It is simple and interpretable, but can struggle with noisy labels and large-scale data.

LightGBM is a fast, modern gradient boosting framework. It uses histogram-based feature binning and leaf-wise tree growth, which makes it very efficient on large datasets while often matching or exceeding XGBoost accuracy.

CatBoost is designed for categorical features. It uses ordered target statistics to avoid target leakage and delivers strong performance with minimal preprocessing.

Comparison (Quick Take)

Model	Strengths	Weaknesses	Best Use
AdaBoost	Simple, strong baseline, works well with weak learners	Sensitive to noise and outliers	Small-to-medium clean datasets
XGBoost	Strong accuracy, robust regularization, great defaults	Heavier tuning, slower than LightGBM	General-purpose tabular ML
LightGBM	Very fast training, scales to large data	Can overfit small data, needs tuning	Large datasets, speed-critical
CatBoost	Native categorical handling, minimal preprocessing	Slower than LightGBM, heavier model	Mixed numeric/categorical data

When to Choose Which (Top-Down)

Want the full visual? View the gradient boosting cheat sheet.

Mathematical Intuition

We want to find a function $F(x)$ that minimizes a loss function $L(y, F(x))$ .

At step $m$ , we add a new tree $h_m(x)$ :

F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Where:

$\eta$ is the Learning Rate (shrinkage).
$h_m(x)$ fits the negative gradient of the loss (pseudo-residuals).

h_m \approx - \frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)}

Quick Readiness Check

Is this method a fit for your use case?

Best For

Winning Competitions. Anything categorical/tabular where accuracy is paramount (Fraud, Risk).

Prerequisites

No scaling needed. Encoding categorical variables is required (unless using CatBoost).

Strengths

Highest accuracy. Handles missing data. Flexible loss functions.

Weaknesses

Harder to tune than RF. Prone to overfitting. Slower to train sequentially.

Pro Tip

Major trade-off: Speed vs. Accuracy. AdaBoost is the classic baseline, LightGBM is faster, CatBoost handles categories better, XGBoost is the classic choice.

Code Snippet

from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
 
# 1. Initialize
# learning_rate (eta): Step size. Lower is better but requires more n_estimators.
model = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    n_jobs=-1,
    random_state=42
)
 
# 2. Train with Early Stopping
# Stops if validation score doesn't improve for 50 rounds
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(
    X_train, y_train,
    eval_set=eval_set,
    early_stopping_rounds=50,
    verbose=False
)
 
# 3. Predict
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
print(f"Best Iteration: {model.best_iteration}")

LightGBM (Fast on Large Tabular Data)

from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
 
model = LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=-1,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
 
model.fit(
    X_train, y_train,
    eval_set=[(X_test, y_test)],
    eval_metric="logloss",
    early_stopping_rounds=50,
    verbose=False
)
 
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

CatBoost (Categorical-Friendly)

from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
 
# cat_features: list of column indices (or names) for categorical features
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    loss_function="Logloss",
    verbose=False,
    random_state=42
)
 
model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    early_stopping_rounds=50
)
 
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

AdaBoost (Classic Baseline)

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
 
model = AdaBoostClassifier(
    n_estimators=200,
    learning_rate=0.5,
    random_state=42
)
 
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")

Parameter Tuning Cheat Sheet

Parameter	Options / Range	Effect & Best Practice
learning_rate	0.01 - 0.3	Most important. Lower LR + Higher Estimators = Better (but slower).
max_depth	3 - 10	Keep small! (3-6). Deep trees overfit quickly in boosting.
subsample	0.5 - 1.0	Fraction of rows to use per tree. 0.8 is a good default.
scale_pos_weight	1 - 100	Use this for Imbalanced Data (creates weighted loss).