Back to ML Guide
Supervised Learning

Random Forest

An ensemble learning method that operates by constructing a multitude of decision trees. Solves the overfitting problem of individual trees.

3 min read

Theory & Concept

Random Forest is an ensemble method that combines multiple Decision Trees to create a more robust and accurate model. It relies on the "Wisdom of Crowds"—while individual trees might be noisy or prone to overfitting, their average is often very stable.

It uses a technique called Bagging (Bootstrap Aggregating):

  1. Bootstrapping: Train each tree on a random sample of the data (with replacement).
  2. Feature Randomness: At each split, consider only a random subset of features (not all).
  3. Aggregating: Average the predictions (Regression) or take the majority vote (Classification).

Why "Random"?

The randomness comes from two places:

  • Random sampling of data rows (Bagging).
  • Random sampling of features at each split.

This de-correlates the trees, ensuring they don't all make the same mistakes.


Mathematical Intuition

If we have TT independent trees, each with variance σ2\sigma^2, the variance of their average is:

Var(Xˉ)=σ2TVar(\bar{X}) = \frac{\sigma^2}{T}

By increasing TT (number of trees), we reduce the variance of the final model without increasing the bias. This is why Random Forest is harder to overfit than a single Decision Tree.


Quick Readiness Check

Quick Readiness Check

Is this method a fit for your use case?

Best For

'Set it and forget it' models. Often the best out-of-the-box algorithm for tabular data.

Prerequisites

No scaling needed. Works well with raw data.

Strengths

Accurate, robust to outliers, handles missing values, provides Feature Importance.

Weaknesses

Slow to predict (must run 100+ trees). Not interpretable (Black Box). Large model size.

Pro Tip

Key difference vs. Decision Tree: Variance Reduction. Key difference vs. GBM: Parallel Training (RF is parallel, GBM is sequential).


Code Snippet

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
 
# 1. Initialize
# n_estimators=100 is standard. n_jobs=-1 uses all CPU cores.
rf = RandomForestClassifier(n_estimators=100, 
                            max_depth=10, 
                            max_features='sqrt', 
                            n_jobs=-1, 
                            random_state=42)
 
# 2. Train
rf.fit(X_train, y_train)
 
# 3. Predict
preds = rf.predict(X_test)
print(classification_report(y_test, preds))
 
# 4. Feature Importance (Critical for analysis)
import pandas as pd
importances = pd.Series(rf.feature_importances_, index=feature_names)
print(importances.sort_values(ascending=False).head(5))

Parameter Tuning Cheat Sheet

ParameterOptions / RangeEffect & Best Practice
n_estimators100 - 1000Number of trees. More is usually better (stable), but slower. Diminishing returns after larger number.
max_featuressqrt, log2, NoneCritical. sqrt is standard for classification. Controls diversity of trees.
max_depthNone, intLimits tree depth. Unlike single trees, RF is robust to full depth, but limiting it saves memory/time.
bootstrapTrue (default), FalseKeep True for Bagging effect.