Logistic Regression

Theory & Concept

Despite its name, Logistic Regression is used for classification, not regression. It predicts the probability that a given input point belongs to a certain class (e.g., "spam" vs. "not spam", "churn" vs. "stay").

While Linear Regression fits a straight line that can go from $-\infty$ to $+\infty$ , Logistic Regression squashes the output between 0 and 1 using the Sigmoid Function. This makes it perfect for probability estimation.

Key Assumptions

Binary Outcome: The target variable is binary (0 or 1).
Linearity of Log-Odds: The independent variables are linearly related to the log-odds of the target.
No Multicollinearity: Independent variables should not be too highly correlated with each other.
Large Sample Size: Requires reasonably large samples for stable results (maximum likelihood estimation).

Mathematical Intuition (Simplified)

We start with the linear equation from Linear Regression: $z = w^Tx + b$ .

Instead of outputting $z$ directly, we pass it through the Sigmoid Function ( $\sigma$ ):

\sigma(z) = \frac{1}{1 + e^{-z}}

This ensures the output is always between 0 and 1.

The training process tries to minimize the Log Loss (or Binary Cross-Entropy), which penalizes confident wrong predictions heavily:

J(w) = - \frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]

Quick Readiness Check

Is this method a fit for your use case?

Best For

Binary classification baselines. When you need calibrated probabilities (e.g., credit scoring).

Prerequisites

Scale your data! Regularization (L1/L2) depends on feature scale.

Strengths

Simple, fast, outputs probabilities, highly interpretable weights.

Weaknesses

Linearity assumption (can't solve XOR without feature engineering). Performance drops with complex data.

Pro Tip

If asked 'Does Logistic Regression handle non-linear decision boundaries?', the answer is NO (unless you add features).

Regularization: Ridge vs Lasso

Logistic Regression almost always uses regularization. In scikit-learn, C is the inverse of regularization strength (smaller C = stronger regularization):

Ridge (L2): shrinks all weights, keeps all features, usually more stable.
Lasso (L1): can zero out weights, acts as feature selection.

Common Pitfalls

Scale features so regularization doesn't overweight large-magnitude inputs.
Correlated features can cause L1 to pick arbitrary winners; use L2 if stability matters.

Code Snippet

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
 
# 1. Prepare Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
 
# 2. Preprocessing (Standardization is crucial for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# 3. Initialize and Train
# C=1.0 is default inverse regularization strength (smaller = stronger reg)
model = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs')
model.fit(X_train_scaled, y_train)
 
# 4. Evaluate
preds = model.predict(X_test_scaled)
probs = model.predict_proba(X_test_scaled)[:, 1] # Probability of class 1
 
print(f"Accuracy: {accuracy_score(y_test, preds):.4f}")
print("Classification Report:")
print(classification_report(y_test, preds))

Parameter Tuning Cheat Sheet

Parameter	Options / Range	Effect & Best Practice
C	0.001 to 100	Inverse of regularization strength. Low C = High Regularization. High C = Low Regularization.
penalty	l1, l2	L1 (Lasso): Feature selection. L2 (Ridge): Shrinks weights (default).
solver	liblinear, lbfgs	Use liblinear for small datasets. lbfgs for larger ones.
class_weight	None, balanced	Use balanced if your classes are imbalanced (e.g., 95% neg, 5% pos).