🔍 TF-IDF Explained – Spam Detection

TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a technique used to convert text into numbers that reflect how important a word is in a document relative to the entire dataset. It’s the backbone of many spam filters, search engines, and recommendation systems.

📈 1. Term Frequency (TF)

TF measures how often a word appears in a single document. The more frequent the word, the higher its TF score.

Formula:

\( \text{TF}(w, d) = \frac{\text{Number of times } w \text{ appears in } d}{\text{Total words in } d} \)

📉 2. Inverse Document Frequency (IDF)

IDF measures how rare a word is across all documents. Common words like “the” or “and” get low scores, while rare words like “iPhone” or “urgent” get high scores.

Formula:

\( \text{IDF}(w) = \log\left( \frac{N}{1 + \text{DF}(w)} \right) \)

🔍 Example: If “free” appears in 800 out of 1000 messages, its IDF is low. If “iPhone” appears in only 10 messages, its IDF is high — making it a strong spam signal.

⚙️ 3. TF-IDF Score

The final score is the product of TF and IDF. It highlights words that are frequent in a specific document but rare across the dataset.

Formula:

\( \text{TFIDF}(w, d) = \text{TF}(w, d) \times \text{IDF}(w) \)

🧪 4. Real-World Example

Let’s say you have 3 messages:

Words like “free” appear in multiple messages → lower IDF.
Words like “iPhone” or “meeting” appear only once → higher IDF.

🧑‍💻 5. TF-IDF in Python

Here’s how you can apply TF-IDF using scikit-learn in Python:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample list of messages
messages = [
    "Win a free iPhone now",
    "Meeting rescheduled to 3 PM",
    "Free offer just for you"
]

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words="english")

# Fit and transform the messages
X_vec = vectorizer.fit_transform(messages)

📘 Related Topics

Want to see TF-IDF in action? Check out Step 2.3 – Spam Detection for a full implementation.