Linear Regression
The foundation of predictive modeling. Fits a linear relationship between input features and a continuous target.
Theory & Concept
Linear Regression is the "Hello World" of machine learning, but don't let its simplicity fool you—it remains a workhorse in production for forecasting and causality analysis.
At its core, it attempts to model the relationship between a dependent variable () and one or more independent variables () by fitting a linear equation to observed data. The goal is to find the "line of best fit" that minimizes the difference between the predicted values and the actual values.
Decision Path (Regularization)
- Go Ridge (L2) when features are correlated or you want stable weights and best predictive performance.
- Go Lasso (L1) when you want automatic feature selection and a compact, interpretable model.
Key Assumptions
For Linear Regression to be valid, your data must meet specific criteria (remember the acronym LINE):
- Linearity: The relationship between X and Y must be linear.
- Independence: Observations must be independent of each other (no autocorrelation).
- Normality: Residuals should be approximately normal if you want reliable confidence intervals or hypothesis tests (not strictly required for prediction).
- Equal Variance (Homoscedasticity): The variance of residuals should be constant across all levels of X.
Mathematical Intuition (Simplified)
Imagine you're trying to draw a line through a cloud of points. The equation for this line in simple 2D space is:
In Machine Learning terms, we expand this to multiple features ():
Where:
- is the predicted value.
- is the bias (or intercept).
- are the weights (or coefficients).
How do we find the best weights?
We define a Cost Function (usually Mean Squared Error - MSE) that measures how "wrong" our line is:
We then use an optimization algorithm like Gradient Descent (or the Closed-Form solution OLS) to adjust the weights () until this error is minimized. Think of it as rolling a ball down a hill until it reaches the lowest point (minimum error).
Quick Readiness Check
Quick Readiness Check
Is this method a fit for your use case?
Best For
Predicting continuous values (Pricing, Forecasting), understanding relationship strength.
Prerequisites
Feature scaling is required when using Gradient Descent–based solvers, but not strictly necessary for OLS. Scaling is still recommended for numerical stability and interpretability.
Strengths
Highly interpretable, very fast training, minimal tuning required.
Weaknesses
Cannot model complex non-linear relationships. Highly sensitive to outliers.
Pro Tip
If asked 'How do you improve Linear Regression?', suggest Regularization (Ridge/Lasso) or feature engineering (Polynomial features).
Code Snippet
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# 1. Prepare Data
# Assume X (features) and y (target) are loaded
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# 2. Preprocessing (CRITICAL for Linear Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 3. Initialize and Train
# fit_intercept=True is default, but good to be explicit
model = LinearRegression(fit_intercept=True)
model.fit(X_train_scaled, y_train)
# 4. Evaluate
preds = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, preds)
r2 = r2_score(y_test, preds)
print(f"MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")
print(f"Coefficients: {model.coef_}")Parameter Tuning
Linear Regression has very few parameters to tune. If you need more control, you typically move to Ridge (L2) or Lasso (L1) regression.
| Parameter | Options / Range | Effect & Best Practice |
|---|---|---|
| fit_intercept | True (default) | False | Calculates the y-intercept. Only set to False if data is already centered. |
| positive | False (default) | True | Forces coefficients to be positive (useful for pricing models where negative prices don't make sense). |
Regularization: Ridge vs Lasso
When plain Linear Regression overfits or suffers from multicollinearity, add regularization. Both methods add a penalty to the loss function to keep coefficients in check:
- Ridge (L2) shrinks all coefficients toward zero, keeps all features, and is usually better for prediction.
- Lasso (L1) drives some coefficients to zero, enabling automatic feature selection and a sparse model.
Quick Usage (scikit-learn)
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1, max_iter=10000)
ridge.fit(X_train_scaled, y_train)
lasso.fit(X_train_scaled, y_train)Parameter Tuning (Ridge/Lasso)
| Parameter | Options / Range | Effect & Best Practice |
|---|---|---|
| alpha | Ridge: 0.1, 1.0, 10 | Lasso: 0.01, 0.1, 1.0 | Regularization strength. Higher = stronger shrinkage. Lasso is more sensitive to alpha. |
| solver (Ridge) | auto, svd, cholesky, lsqr, sag, saga | auto works well in most cases. sag/saga are faster on large datasets. |
| max_iter (Lasso) | 1000 (default) to 10000 | Increase if convergence warnings appear. |
When to Choose Which
- Ridge: many correlated features, want stable coefficients, prioritize predictive performance.
- Lasso: want feature selection, interpretability, or a compact model.
Common Pitfalls
- Scale features before fitting (especially for Lasso) so penalties behave consistently.
- Highly correlated features can make Lasso unstable; prefer Ridge or ElasticNet in that case.