Back to ML Guide
Supervised Learning

Naive Bayes

A probabilistic classifier based on Bayes' Theorem with the 'naive' assumption of feature independence. Fast, simple, and surprisingly effective for text.

3 min read

Theory & Concept

Naive Bayes calculates the probability of a data point belonging to a class given its features: P(ClassFeatures)P(Class | Features).

It is called "Naive" because it assumes that all features are independent of each other given the class.

  • Example: In spam filtering, it assumes the presence of the word "buy" is unrelated to the word "viagra", which is obviously false.
  • Reality: Despite this wrong assumption, it works remarkably well, especially for text classification (Spam, Sentiment Analysis).

Bayes' Theorem

P(yX)=P(Xy)P(y)P(X)P(y|X) = \frac{P(X|y) \cdot P(y)}{P(X)}
  • Posterior P(yX)P(y|X): Probability of class yy given features XX.
  • Likelihood P(Xy)P(X|y): Probability of features XX appearing in class yy.
  • Prior P(y)P(y): Probability of class yy occurring generally.

Types of Naive Bayes

  1. Multinomial NB: For discrete counts (e.g., word counts in text).
  2. Bernoulli NB: For binary/boolean features (e.g., word present/absent).
  3. Gaussian NB: For continuous features (assumes normal distribution).

Quick Readiness Check

Quick Readiness Check

Is this method a fit for your use case?

Best For

Text Classification (Spam, News categorization), Sentiment Analysis, Real-time systems.

Prerequisites

Text must be tokenized/vectorized (Bag of Words or TF-IDF). Continuous data may need binning.

Strengths

Blazing fast (linear O(N)). Needs very less training data. Handles high dimensions well.

Weaknesses

The 'Zero Frequency' problem (if a word is unseen, probability = 0). Solved by Laplace Smoothing. 'Independence' assumption often violated.

Pro Tip

'What is the Zero Frequency problem?' When a feature value isn't in the training set, likelihood is 0, wiping out the whole equation. Fix: Add 1 to all counts (Laplace Smoothing).


Code Snippet

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import make_pipeline
 
# Text Data (Spam vs Ham)
X = ["offer is secret", "click secret link", "secret sports link", "play sports"]
y = ["spam", "spam", "ham", "ham"]
 
# 1. Pipeline: Vectorize -> Convert text to counts -> Train NB
# alpha=1.0 is Laplace Smoothing
model = make_pipeline(CountVectorizer(), MultinomialNB(alpha=1.0))
 
# 2. Train
model.fit(X, y)
 
# 3. Predict
print(model.predict(["secret sports game"]))
print(model.predict_proba(["secret sports game"]))

Parameter Tuning Cheat Sheet

ParameterOptions / RangeEffect & Best Practice
alpha0.0 - 1.0+Smoothing parameter (Laplace/Lidstone). 1.0 is standard. 0.0 = No smoothing (risk of zero prob).
fit_priorTrue (default), FalseWhether to learn class prior probabilities or assume uniform (50/50). Keep True usually.