Linear Regression

Theory & Concept

Linear Regression is the "Hello World" of machine learning, but don't let its simplicity fool you—it remains a workhorse in production for forecasting and causality analysis.

At its core, it attempts to model the relationship between a dependent variable ( $Y$ ) and one or more independent variables ( $X$ ) by fitting a linear equation to observed data. The goal is to find the "line of best fit" that minimizes the difference between the predicted values and the actual values.

Decision Path (Regularization)

Go Ridge (L2) when features are correlated or you want stable weights and best predictive performance.
Go Lasso (L1) when you want automatic feature selection and a compact, interpretable model.

Key Assumptions

For Linear Regression to be valid, your data must meet specific criteria (remember the acronym LINE):

Linearity: The relationship between X and Y must be linear.
Independence: Observations must be independent of each other (no autocorrelation).
Normality: Residuals should be approximately normal if you want reliable confidence intervals or hypothesis tests (not strictly required for prediction).
Equal Variance (Homoscedasticity): The variance of residuals should be constant across all levels of X.

Mathematical Intuition (Simplified)

Imagine you're trying to draw a line through a cloud of points. The equation for this line in simple 2D space is:

y = mx + b

In Machine Learning terms, we expand this to multiple features ( $x_1, x_2, ...$ ):

\hat{y} = w_0 + w_1x_1 + w_2x_2 + ... + w_nx_n

Where:

$\hat{y}$ is the predicted value.
$w_0$ is the bias (or intercept).
$w_1, ... w_n$ are the weights (or coefficients).

How do we find the best weights?

We define a Cost Function (usually Mean Squared Error - MSE) that measures how "wrong" our line is:

MSE = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

We then use an optimization algorithm like Gradient Descent (or the Closed-Form solution OLS) to adjust the weights ( $w$ ) until this error is minimized. Think of it as rolling a ball down a hill until it reaches the lowest point (minimum error).

Quick Readiness Check

Is this method a fit for your use case?

Best For

Predicting continuous values (Pricing, Forecasting), understanding relationship strength.

Prerequisites

Feature scaling is required when using Gradient Descent–based solvers, but not strictly necessary for OLS. Scaling is still recommended for numerical stability and interpretability.

Strengths

Highly interpretable, very fast training, minimal tuning required.

Weaknesses

Cannot model complex non-linear relationships. Highly sensitive to outliers.

Pro Tip

If asked 'How do you improve Linear Regression?', suggest Regularization (Ridge/Lasso) or feature engineering (Polynomial features).

Code Snippet

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
 
# 1. Prepare Data
# Assume X (features) and y (target) are loaded
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
# 2. Preprocessing (CRITICAL for Linear Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# 3. Initialize and Train
# fit_intercept=True is default, but good to be explicit
model = LinearRegression(fit_intercept=True)
model.fit(X_train_scaled, y_train)
 
# 4. Evaluate
preds = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, preds)
r2 = r2_score(y_test, preds)
 
print(f"MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")
print(f"Coefficients: {model.coef_}")

Parameter Tuning

Linear Regression has very few parameters to tune. If you need more control, you typically move to Ridge (L2) or Lasso (L1) regression.

Parameter	Options / Range	Effect & Best Practice
fit_intercept	True (default) \| False	Calculates the y-intercept. Only set to False if data is already centered.
positive	False (default) \| True	Forces coefficients to be positive (useful for pricing models where negative prices don't make sense).

Regularization: Ridge vs Lasso

When plain Linear Regression overfits or suffers from multicollinearity, add regularization. Both methods add a penalty to the loss function to keep coefficients in check:

Ridge (L2) shrinks all coefficients toward zero, keeps all features, and is usually better for prediction.
Lasso (L1) drives some coefficients to zero, enabling automatic feature selection and a sparse model.

Quick Usage (scikit-learn)

from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
 
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
ridge = Ridge(alpha=1.0)
lasso = Lasso(alpha=0.1, max_iter=10000)
 
ridge.fit(X_train_scaled, y_train)
lasso.fit(X_train_scaled, y_train)

Parameter Tuning (Ridge/Lasso)

Parameter	Options / Range	Effect & Best Practice
alpha	Ridge: 0.1, 1.0, 10 \| Lasso: 0.01, 0.1, 1.0	Regularization strength. Higher = stronger shrinkage. Lasso is more sensitive to alpha.
solver (Ridge)	auto, svd, cholesky, lsqr, sag, saga	auto works well in most cases. sag/saga are faster on large datasets.
max_iter (Lasso)	1000 (default) to 10000	Increase if convergence warnings appear.

When to Choose Which

Ridge: many correlated features, want stable coefficients, prioritize predictive performance.
Lasso: want feature selection, interpretability, or a compact model.

Common Pitfalls

Scale features before fitting (especially for Lasso) so penalties behave consistently.
Highly correlated features can make Lasso unstable; prefer Ridge or ElasticNet in that case.