K-Nearest Neighbors (KNN)
A simple, instance-based learning algorithm. Classifies data based on the majority vote of its neighbors.
Theory & Concept
KNN is a "lazy learner"—it doesn't actually learn a model during training. Instead, it memorizes the training data.

When you ask it to predict a new point:
- It calculates the distance from the new point to all stored training points.
- It finds the closest points ("neighbors").
- Classification: It takes a majority vote of the neighbors' classes.
- Regression: It takes the average of the neighbors' values.
Mathematical Intuition
The core is the Distance Metric. The most common is Euclidean Distance:
For categorical variables, Hamming distance is used.
The prediction for classification is:
Where is the set of nearest neighbors.
Quick Readiness Check
Quick Readiness Check
Is this method a fit for your use case?
Best For
Small datasets, baselines, and systems where 'similarity' is intuitive (e.g., simple recommender systems).
Prerequisites
Scaling is Mandatory. Distances are dominated by large-scale features otherwise.
Strengths
Extremely simple. Zero training time. Makes no assumptions about data distribution.
Weaknesses
Slow Inference: O(N) for every prediction (must scan all data). Curse of Dimensionality issues.
Pro Tip
If asked 'What happens if K is too small or too large?': Small K (1) = High Variance (Overfitting). Large K = High Bias (Underfitting).
Code Snippet
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
# 1. Pipeline
model = make_pipeline(StandardScaler(),
KNeighborsClassifier(n_neighbors=5))
# 2. Train (Just stores data)
model.fit(X_train, y_train)
# 3. Predict (Calculates distances)
preds = model.predict(X_test)Parameter Tuning Cheat Sheet
| Parameter | Options / Range | Effect & Best Practice |
|---|---|---|
| n_neighbors (K) | 1 - 50 | Start with sqrt(N) or default 5. Always pick an odd number for binary classification to avoid ties. |
| weights | uniform, distance | uniform: All neighbors count equally. distance: Closer neighbors have more say (usually better). |
| metric | euclidean, manhattan | euclidean is standard. manhattan (taxicab) often better for high-dimensional data. |