TF-IDF stands for Term Frequency–Inverse Document Frequency. It’s a technique used to convert text into numbers that reflect how important a word is in a document relative to the entire dataset. It’s the backbone of many spam filters, search engines, and recommendation systems.
TF measures how often a word appears in a single document. The more frequent the word, the higher its TF score.
Formula:
\( \text{TF}(w, d) = \frac{\text{Number of times } w \text{ appears in } d}{\text{Total words in } d} \)
IDF measures how rare a word is across all documents. Common words like “the” or “and” get low scores, while rare words like “iPhone” or “urgent” get high scores.
Formula:
\( \text{IDF}(w) = \log\left( \frac{N}{1 + \text{DF}(w)} \right) \)
w🔍 Example: If “free” appears in 800 out of 1000 messages, its IDF is low. If “iPhone” appears in only 10 messages, its IDF is high — making it a strong spam signal.
The final score is the product of TF and IDF. It highlights words that are frequent in a specific document but rare across the dataset.
Formula:
\( \text{TFIDF}(w, d) = \text{TF}(w, d) \times \text{IDF}(w) \)
Let’s say you have 3 messages:
Words like “free” appear in multiple messages → lower IDF.
Words like “iPhone” or “meeting” appear only once → higher IDF.
Here’s how you can apply TF-IDF using scikit-learn in Python:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample list of messages
messages = [
"Win a free iPhone now",
"Meeting rescheduled to 3 PM",
"Free offer just for you"
]
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words="english")
# Fit and transform the messages
X_vec = vectorizer.fit_transform(messages)
Want to see TF-IDF in action? Check out Step 2.3 – Spam Detection for a full implementation.