Step 2.3 – Hands-On: Spam Detection

πŸ” What Is Spam Detection?

Spam detection is a binary classification problem where we train a model to distinguish between spam and ham (legitimate) messages.

🧠 Concepts Covered

πŸ“˜ Machine Learning Concepts Explained

Spam detection is a classic binary classification task. The model learns to assign one of two labels: spam or ham.

1. Supervised Learning

We use labeled data to train the model. Each message is tagged as spam or ham, and the model learns patterns from these examples.

2. Text Preprocessing

3. Feature Extraction with TF-IDF

TF-IDF converts text into numerical features that reflect word importance. It down-weights common words and highlights rare, meaningful ones. Learn more about TF-IDF.

4. Model Training

We train a classifier (e.g., Naive Bayes) using the TF-IDF vectors. The model learns which word patterns are more likely to appear in spam vs ham.

5. Evaluation Metrics

πŸ“¦ Dataset

We use a labeled dataset of SMS messages. Each message is tagged as either ham or spam.

πŸ§‘β€πŸ’» Python Script

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Load and clean dataset
df = pd.read_csv("spam.csv", encoding="utf-8")[["v1", "v2"]].dropna()
df.columns = ["label", "message"]
df = df[df["label"].isin(["ham", "spam"])]

# Map labels
y = df["label"].map({"ham": 0, "spam": 1})
X = df["message"]
X = X[y.notna()]
y = y[y.notna()]

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature extraction
vectorizer = TfidfVectorizer(stop_words="english")
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# Train and evaluate
model = MultinomialNB()
model.fit(X_train_vec, y_train)
y_pred = model.predict(X_test_vec)
print(classification_report(y_test, y_pred))

πŸ“Š Output Example

              precision    recall  f1-score   support
         ham       0.98      0.99      0.99       965
        spam       0.94      0.90      0.92       150
    accuracy                           0.98      1115

πŸ“˜ Learn More

Want to understand how TF-IDF works? Check out this deep dive on TF-IDF.