🧠 AI with Python – ✉️🔍 Spam Detection with CountVectorizer + MultinomialNB


Description:

Detecting spam is a classic text-classification task. A reliable baseline uses bag-of-words features with a Multinomial Naive Bayes classifier. It’s fast, effective on short messages, and easy to implement.


Why this approach?

  • Bag-of-words converts text into word count features.
  • Multinomial Naive Bayes handles word count distributions well.
  • Together, they provide a strong baseline for spam filtering.

Bag-of-Words Explained:

The CountVectorizer in scikit-learn implements the Bag-of-Words method. Here’s how it works:

Vocabulary building → It scans all messages in the training set and creates a list of unique words (the vocabulary). For example, from:

"Win a free iPhone"
"Your OTP is 12345"

The vocabulary might look like:

['Win', 'a', 'free', 'iPhone', 'Your', 'OTP', 'is', '12345']

Vectorization → Each message is converted into a numeric vector that counts how often each word appears.

Example:

  • "Win a free iPhone" → [1, 1, 1, 1, 0, 0, 0, 0]
  • "Your OTP is 12345" → [0, 0, 0, 0, 1, 1, 1, 1]

This results in a sparse matrix (most entries are zero), which is efficient for large text datasets.


Why Multinomial Naive Bayes?

Once messages are converted into word counts, we need a classifier to separate spam from ham (non-spam).

Multinomial Naive Bayes is well-suited for this because:

  1. Models word frequency distributions
    • It assumes each class (spam/ham) has a certain distribution of words.
    • Example: spam messages might contain words like “win”, “free”, “claim” more often, while ham might contain “meeting”, “schedule”, “project”.
  2. Prediction = combine word probabilities
    • For a new message, Naive Bayes calculates the probability of it belonging to spam vs ham by multiplying the probabilities of the words appearing in each class.
    • The class with the higher probability wins.

Although it assumes independence between words (which isn’t fully true in language), this simplification works surprisingly well for spam detection.


Minimal Implementation:

Create a simple pipeline that vectorizes the text and trains the classifier:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

model = Pipeline([
    ("vec", CountVectorizer(lowercase=True, stop_words="english")),
    ("nb", MultinomialNB())
])

Train, evaluate, and predict:

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Sample Output:

On a demo dataset, you might see:

Accuracy: 0.750

Classification Report:
              precision    recall  f1-score   support

         ham       0.67      1.00      0.80         2
        spam       1.00      0.50      0.67         2

    accuracy                           0.75         4
   macro avg       0.83      0.75      0.73         4
weighted avg       0.83      0.75      0.73         4

Using a Larger Dataset:

Our demo used only a handful of messages, which is useful for illustration but not for serious evaluation. With just a few samples, accuracy and precision swing wildly depending on the split.

For a real project, try the SMS Spam Collection Dataset from UCI (also available on Kaggle). It contains 5,500+ SMS messages labeled as spam or ham.

Using this dataset will:

  • Provide enough data for reliable train/test splits.
  • Let you try cross-validation for stable performance metrics.
  • Show the true power of CountVectorizer + MultinomialNB.

On this dataset, you can expect 90–95% accuracy with this simple baseline.


Key Takeaways:

  • CountVectorizer (BoW): Turns text into sparse word count vectors.
  • MultinomialNB: Efficient for word frequency data and robust for spam detection.
  • Great baseline before trying more advanced methods like TF-IDF or Transformers.

Code Snippet:

# Text features, model, split, and metrics
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report


# Sample messages (X) and labels (y): 'spam' or 'ham'
X = [
    "Win a FREE iPhone now!!! Click here",
    "Your OTP is 483920. Do not share it.",
    "Limited time offer! Claim your reward today",
    "Are we still on for lunch tomorrow?",
    "Meeting rescheduled to 3pm. See you!",
    "You won a lottery. Send your bank details",
    "Project update: pushed commits to main",
    "Get cheap meds without prescription. Order now",
]
y = ["spam", "ham", "spam", "ham", "ham", "spam", "ham", "spam"]


# 25% test split for quick evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)


# Combine vectorizer and classifier for a clean workflow
model = Pipeline([
    ("vec", CountVectorizer(lowercase=True, stop_words="english")),
    ("nb", MultinomialNB())
])


# Learn vocabulary + train NB on word counts
model.fit(X_train, y_train)


# Predictions and metrics
y_pred = model.predict(X_test)
print("Accuracy:", round(accuracy_score(y_test, y_pred), 3))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


# A few example messages to classify
new_msgs = [
    "Exclusive deal just for you! Click to claim your prize",
    "Can we move our call to 5 pm?",
    "Your package is out for delivery"
]
print("\nPredictions on new messages:")
for m, p in zip(new_msgs, model.predict(new_msgs)):
    print(f"- {m}  ->  {p}")

Link copied!

Comments

Add Your Comment

Comment Added!