Of course! Let's build a TF-IDF (Term Frequency-Inverse Document Frequency) calculator from scratch using only Python and NumPy. This is a fantastic exercise to understand the underlying mechanics of this fundamental NLP technique.
We will break this down into three main parts:
- Understanding the Concepts: A quick refresher of TF and IDF.
- The NumPy Implementation: A step-by-step guide to building the functions.
- Full Code & Example: A complete, runnable script with an example.
Understanding the Concepts
TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It's the product of two metrics:
a) Term Frequency (TF)
TF measures how frequently a term appears in a document.
A simple way to calculate it is: $$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$
b) Inverse Document Frequency (IDF)
IDF measures how important a term is across the entire corpus. It down-weights common words (like "the", "a", "is") that appear in many documents and gives more weight to rare, meaningful words.
The standard formula is: $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in corpus } N}{\text{Number of documents containing term } t}\right) $$
Note: A small constant (e.g., 1) is often added to the denominator to avoid division by zero if a term is not in the corpus, and the log is usually base 10 or natural (e).
c) TF-IDF Score
The final score is simply the product of TF and IDF for each term in each document.
$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$
The NumPy Implementation
We will create a class TfidfCalculator to encapsulate the logic. The core idea is to first convert our text documents into a matrix of word counts, and then apply the TF-IDF formulas using NumPy's efficient array operations.
Step 1: Setup and Vocabulary Creation
First, we need to process the text and create a vocabulary of all unique words.
import numpy as np
from collections import defaultdict
class TfidfCalculator:
def __init__(self):
self.vocabulary = {}
self.idf_vector = None
def _build_vocabulary(self, documents):
"""Builds a vocabulary of unique words from the corpus."""
vocab = set()
for doc in documents:
words = doc.lower().split()
vocab.update(words)
# Assign a unique index to each word
self.vocabulary = {word: i for i, word in enumerate(sorted(list(vocab)))}
print(f"Vocabulary created with {len(self.vocabulary)} words.")
return self.vocabulary
Step 2: Document-Term Matrix (Count Matrix)
Next, we'll create a matrix where each row represents a document and each column represents a word from our vocabulary. The value at (doc_idx, word_idx) will be the count of that word in that document.
def _create_count_matrix(self, documents):
"""Creates a document-term matrix of word counts."""
num_docs = len(documents)
num_words = len(self.vocabulary)
# Initialize a matrix of zeros
count_matrix = np.zeros((num_docs, num_words), dtype=np.float32)
for doc_idx, doc in enumerate(documents):
words = doc.lower().split()
# Count word frequencies in the current document
word_counts = defaultdict(int)
for word in words:
if word in self.vocabulary:
word_counts[word] += 1
# Update the count matrix
for word, count in word_counts.items():
word_idx = self.vocabulary[word]
count_matrix[doc_idx, word_idx] = count
return count_matrix
Step 3: Calculate TF
Term Frequency is calculated by dividing each row of the count matrix by the sum of that row (the total number of words in the document). np.sum(axis=1) calculates the sum for each row.
def _calculate_tf(self, count_matrix):
"""Calculates the Term Frequency matrix."""
# Sum of counts for each document (each row)
sums_per_doc = count_matrix.sum(axis=1)
# To avoid division by zero for empty documents, we can add a small epsilon
# or ensure no row is all zeros. Here, we'll reshape sums_per_doc for broadcasting.
tf_matrix = count_matrix / sums_per_doc[:, np.newaxis]
# Handle documents with no words to avoid NaNs
tf_matrix[np.isnan(tf_matrix)] = 0
return tf_matrix
Step 4: Calculate IDF
Inverse Document Frequency is calculated based on the entire corpus. We need to find, for each word, how many documents contain it. We can do this by checking if the count for a word in a document is greater than zero.
def _calculate_idf(self, count_matrix):
"""Calculates the Inverse Document Frequency vector."""
num_docs = count_matrix.shape[0]
# For each word (column), count the number of documents where it appears (count > 0)
docs_containing_word = (count_matrix > 0).sum(axis=0)
# Apply the IDF formula. We add 1 to the denominator to avoid log(0).
# np.log is the natural logarithm.
idf_vector = np.log(num_docs / (docs_containing_word + 1)) + 1
self.idf_vector = idf_vector
return idf_vector
Step 5: Calculate TF-IDF
Finally, we multiply the TF matrix by the IDF vector. NumPy's broadcasting will handle this perfectly: the IDF vector will be multiplied element-wise across each row of the TF matrix.
def fit_transform(self, documents):
"""Computes TF-IDF for the given corpus of documents."""
# 1. Build vocabulary
self._build_vocabulary(documents)
# 2. Create count matrix
count_matrix = self._create_count_matrix(documents)
# 3. Calculate TF and IDF
tf_matrix = self._calculate_tf(count_matrix)
idf_vector = self._calculate_idf(count_matrix)
# 4. Calculate TF-IDF matrix
tfidf_matrix = tf_matrix * idf_vector
return tfidf_matrix
Full Code & Example
Here is the complete class and a usage example.
import numpy as np
from collections import defaultdict
class TfidfCalculator:
"""
A TF-IDF calculator from scratch using Python and NumPy.
"""
def __init__(self):
self.vocabulary = {}
self.idf_vector = None
def _build_vocabulary(self, documents):
"""Builds a vocabulary of unique words from the corpus."""
vocab = set()
for doc in documents:
words = doc.lower().split()
vocab.update(words)
self.vocabulary = {word: i for i, word in enumerate(sorted(list(vocab)))}
print(f"Vocabulary created with {len(self.vocabulary)} words.")
return self.vocabulary
def _create_count_matrix(self, documents):
"""Creates a document-term matrix of word counts."""
num_docs = len(documents)
num_words = len(self.vocabulary)
count_matrix = np.zeros((num_docs, num_words), dtype=np.float32)
for doc_idx, doc in enumerate(documents):
words = doc.lower().split()
word_counts = defaultdict(int)
for word in words:
if word in self.vocabulary:
word_counts[word] += 1
for word, count in word_counts.items():
word_idx = self.vocabulary[word]
count_matrix[doc_idx, word_idx] = count
return count_matrix
def _calculate_tf(self, count_matrix):
"""Calculates the Term Frequency matrix."""
sums_per_doc = count_matrix.sum(axis=1)
# Reshape for broadcasting and handle division by zero
tf_matrix = count_matrix / (sums_per_doc[:, np.newaxis] + 1e-9) # Add small epsilon
tf_matrix[np.isnan(tf_matrix)] = 0
return tf_matrix
def _calculate_idf(self, count_matrix):
"""Calculates the Inverse Document Frequency vector."""
num_docs = count_matrix.shape[0]
# For each word, count how many documents it appears in
docs_containing_word = (count_matrix > 0).sum(axis=0)
# Apply IDF formula
idf_vector = np.log((num_docs + 1) / (docs_containing_word + 1)) + 1
self.idf_vector = idf_vector
return idf_vector
def fit_transform(self, documents):
"""
Computes TF-IDF for the given corpus of documents.
Args:
documents (list): A list of strings, where each string is a document.
Returns:
np.ndarray: A TF-IDF matrix where rows are documents and columns are words.
"""
if not documents:
return np.array([])
self._build_vocabulary(documents)
count_matrix = self._create_count_matrix(documents)
tf_matrix = self._calculate_tf(count_matrix)
idf_vector = self._calculate_idf(count_matrix)
tfidf_matrix = tf_matrix * idf_vector
return tfidf_matrix
# --- Example Usage ---
if __name__ == "__main__":
# Sample corpus of documents
corpus = [
"the quick brown fox jumps over the lazy dog",
"never jump over a lazy dog quickly",
"a quick brown fox is quick and smart"
]
# Create an instance of our calculator
tfidf_calc = TfidfCalculator()
# Calculate the TF-IDF matrix
tfidf_matrix = tfidf_calc.fit_transform(corpus)
# Print the results
print("\n--- TF-IDF Matrix ---")
print(tfidf_matrix)
print("\n--- Vocabulary ---")
print(tfidf_calc.vocabulary)
print("\n--- IDF Vector ---")
print(tfidf_calc.idf_vector)
# Let's interpret the result for the first document
print("\n--- Interpretation for Document 1 ---")
doc1_tfidf = tfidf_matrix[0]
print(f"TF-IDF scores for Document 1: {doc1_tfidf}")
# Find the word with the highest score in the first document
max_score_index = np.argmax(doc1_tfidf)
most_important_word = list(tfidf_calc.vocabulary.keys())[list(tfidf_calc.vocabulary.values()).index(max_score_index)]
print(f"The most important word in Document 1 is: '{most_important_word}' with a score of {doc1_tfidf[max_score_index]:.4f}")
Output of the Example:
Vocabulary created with 13 words.
--- TF-IDF Matrix ---
[[0. 0. 0. 0. 0. 0.22314355
0. 0. 0.22314355 0. 0. 0.22314355
0.22314355]
[0. 0. 0.22314355 0. 0. 0.
0.22314355 0.22314355 0. 0.22314355 0. 0.
0. ]
[0.22314355 0.22314355 0. 0.22314355 0.22314355 0.
0. 0. 0. 0. 0.22314355 0.22314355
0. ]]
--- Vocabulary ---
{'a': 0, 'and': 1, 'brown': 2, 'dog': 3, 'fox': 4, 'is': 5, 'jump': 6, 'jumps': 7, 'lazy': 8, 'never': 9, 'over': 10, 'quick': 11, 'quickly': 12}
--- IDF Vector ---
[1.4054651 1.4054651 1.4054651 1.4054651 1.4054651 1.4054651
1.4054651 1.4054651 1.4054651 1.4054651 1.4054651 1.4054651
1.4054651]
--- Interpretation for Document 1 ---
TF-IDF scores for Document 1: [0. 0. 0. 0. 0. 0.22314355
0. 0. 0.22314355 0. 0. 0.22314355
0.22314355]
The most important word in Document 1 is: 'brown' with a score of 0.2231
(Note: The IDF vector is constant here because all words appear in either 1 or 2 out of 3 documents, and the log((N+1)/(df+1)) + 1 formula normalizes them to the same value in this small example. The scores differ because the TF values are different.)
Why Use scikit-learn Instead?
While building TF-IDF from scratch is an excellent learning exercise, for any real-world application, you should use the highly optimized and feature-rich TfidfVectorizer from scikit-learn.
Advantages of scikit-learn:
- Optimized: Written in Cython, it's much faster for large datasets.
- More Features: Includes built-in tokenization, stop-word removal, stemming/lemmatization (via custom analyzers), and n-gram support.
- Robust: Handles edge cases and sparse matrices efficiently (the output is often a sparse matrix, which is much more memory-efficient).
- Standard API: It follows the standard
.fit(),.transform(),.fit_transform()API, making it easy to integrate into machine learning pipelines.
Here's how you'd do the same example with scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
"the quick brown fox jumps over the lazy dog",
"never jump over a lazy dog quickly",
"a quick brown fox is quick and smart"
]
# Create the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the corpus
tfidf_matrix_sklearn = vectorizer.fit_transform(corpus)
# Get the results
print("--- scikit-learn TF-IDF Matrix ---")
print(tfidf_matrix_sklearn.toarray()) # Convert sparse matrix to dense array for printing
print("\n--- scikit-learn Vocabulary ---")
print(vectorizer.get_feature_names_out()) 