杰瑞科技汇

Python NumPy如何高效实现TF-IDF?

Of course! Let's build a TF-IDF (Term Frequency-Inverse Document Frequency) calculator from scratch using only Python and NumPy. This is a fantastic exercise to understand the underlying mechanics of this fundamental NLP technique.

We will break this down into three main parts:

  1. Understanding the Concepts: A quick refresher of TF and IDF.
  2. The NumPy Implementation: A step-by-step guide to building the functions.
  3. Full Code & Example: A complete, runnable script with an example.

Understanding the Concepts

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. It's the product of two metrics:

a) Term Frequency (TF)

TF measures how frequently a term appears in a document.

A simple way to calculate it is: $$ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

b) Inverse Document Frequency (IDF)

IDF measures how important a term is across the entire corpus. It down-weights common words (like "the", "a", "is") that appear in many documents and gives more weight to rare, meaningful words.

The standard formula is: $$ \text{IDF}(t, D) = \log\left(\frac{\text{Total number of documents in corpus } N}{\text{Number of documents containing term } t}\right) $$

Note: A small constant (e.g., 1) is often added to the denominator to avoid division by zero if a term is not in the corpus, and the log is usually base 10 or natural (e).

c) TF-IDF Score

The final score is simply the product of TF and IDF for each term in each document.

$$ \text{TF-IDF}(t, d, D) = \text{TF}(t, d) \times \text{IDF}(t, D) $$


The NumPy Implementation

We will create a class TfidfCalculator to encapsulate the logic. The core idea is to first convert our text documents into a matrix of word counts, and then apply the TF-IDF formulas using NumPy's efficient array operations.

Step 1: Setup and Vocabulary Creation

First, we need to process the text and create a vocabulary of all unique words.

import numpy as np
from collections import defaultdict
class TfidfCalculator:
    def __init__(self):
        self.vocabulary = {}
        self.idf_vector = None
    def _build_vocabulary(self, documents):
        """Builds a vocabulary of unique words from the corpus."""
        vocab = set()
        for doc in documents:
            words = doc.lower().split()
            vocab.update(words)
        # Assign a unique index to each word
        self.vocabulary = {word: i for i, word in enumerate(sorted(list(vocab)))}
        print(f"Vocabulary created with {len(self.vocabulary)} words.")
        return self.vocabulary

Step 2: Document-Term Matrix (Count Matrix)

Next, we'll create a matrix where each row represents a document and each column represents a word from our vocabulary. The value at (doc_idx, word_idx) will be the count of that word in that document.

    def _create_count_matrix(self, documents):
        """Creates a document-term matrix of word counts."""
        num_docs = len(documents)
        num_words = len(self.vocabulary)
        # Initialize a matrix of zeros
        count_matrix = np.zeros((num_docs, num_words), dtype=np.float32)
        for doc_idx, doc in enumerate(documents):
            words = doc.lower().split()
            # Count word frequencies in the current document
            word_counts = defaultdict(int)
            for word in words:
                if word in self.vocabulary:
                    word_counts[word] += 1
            # Update the count matrix
            for word, count in word_counts.items():
                word_idx = self.vocabulary[word]
                count_matrix[doc_idx, word_idx] = count
        return count_matrix

Step 3: Calculate TF

Term Frequency is calculated by dividing each row of the count matrix by the sum of that row (the total number of words in the document). np.sum(axis=1) calculates the sum for each row.

    def _calculate_tf(self, count_matrix):
        """Calculates the Term Frequency matrix."""
        # Sum of counts for each document (each row)
        sums_per_doc = count_matrix.sum(axis=1)
        # To avoid division by zero for empty documents, we can add a small epsilon
        # or ensure no row is all zeros. Here, we'll reshape sums_per_doc for broadcasting.
        tf_matrix = count_matrix / sums_per_doc[:, np.newaxis]
        # Handle documents with no words to avoid NaNs
        tf_matrix[np.isnan(tf_matrix)] = 0
        return tf_matrix

Step 4: Calculate IDF

Inverse Document Frequency is calculated based on the entire corpus. We need to find, for each word, how many documents contain it. We can do this by checking if the count for a word in a document is greater than zero.

    def _calculate_idf(self, count_matrix):
        """Calculates the Inverse Document Frequency vector."""
        num_docs = count_matrix.shape[0]
        # For each word (column), count the number of documents where it appears (count > 0)
        docs_containing_word = (count_matrix > 0).sum(axis=0)
        # Apply the IDF formula. We add 1 to the denominator to avoid log(0).
        # np.log is the natural logarithm.
        idf_vector = np.log(num_docs / (docs_containing_word + 1)) + 1
        self.idf_vector = idf_vector
        return idf_vector

Step 5: Calculate TF-IDF

Finally, we multiply the TF matrix by the IDF vector. NumPy's broadcasting will handle this perfectly: the IDF vector will be multiplied element-wise across each row of the TF matrix.

    def fit_transform(self, documents):
        """Computes TF-IDF for the given corpus of documents."""
        # 1. Build vocabulary
        self._build_vocabulary(documents)
        # 2. Create count matrix
        count_matrix = self._create_count_matrix(documents)
        # 3. Calculate TF and IDF
        tf_matrix = self._calculate_tf(count_matrix)
        idf_vector = self._calculate_idf(count_matrix)
        # 4. Calculate TF-IDF matrix
        tfidf_matrix = tf_matrix * idf_vector
        return tfidf_matrix

Full Code & Example

Here is the complete class and a usage example.

import numpy as np
from collections import defaultdict
class TfidfCalculator:
    """
    A TF-IDF calculator from scratch using Python and NumPy.
    """
    def __init__(self):
        self.vocabulary = {}
        self.idf_vector = None
    def _build_vocabulary(self, documents):
        """Builds a vocabulary of unique words from the corpus."""
        vocab = set()
        for doc in documents:
            words = doc.lower().split()
            vocab.update(words)
        self.vocabulary = {word: i for i, word in enumerate(sorted(list(vocab)))}
        print(f"Vocabulary created with {len(self.vocabulary)} words.")
        return self.vocabulary
    def _create_count_matrix(self, documents):
        """Creates a document-term matrix of word counts."""
        num_docs = len(documents)
        num_words = len(self.vocabulary)
        count_matrix = np.zeros((num_docs, num_words), dtype=np.float32)
        for doc_idx, doc in enumerate(documents):
            words = doc.lower().split()
            word_counts = defaultdict(int)
            for word in words:
                if word in self.vocabulary:
                    word_counts[word] += 1
            for word, count in word_counts.items():
                word_idx = self.vocabulary[word]
                count_matrix[doc_idx, word_idx] = count
        return count_matrix
    def _calculate_tf(self, count_matrix):
        """Calculates the Term Frequency matrix."""
        sums_per_doc = count_matrix.sum(axis=1)
        # Reshape for broadcasting and handle division by zero
        tf_matrix = count_matrix / (sums_per_doc[:, np.newaxis] + 1e-9) # Add small epsilon
        tf_matrix[np.isnan(tf_matrix)] = 0
        return tf_matrix
    def _calculate_idf(self, count_matrix):
        """Calculates the Inverse Document Frequency vector."""
        num_docs = count_matrix.shape[0]
        # For each word, count how many documents it appears in
        docs_containing_word = (count_matrix > 0).sum(axis=0)
        # Apply IDF formula
        idf_vector = np.log((num_docs + 1) / (docs_containing_word + 1)) + 1
        self.idf_vector = idf_vector
        return idf_vector
    def fit_transform(self, documents):
        """
        Computes TF-IDF for the given corpus of documents.
        Args:
            documents (list): A list of strings, where each string is a document.
        Returns:
            np.ndarray: A TF-IDF matrix where rows are documents and columns are words.
        """
        if not documents:
            return np.array([])
        self._build_vocabulary(documents)
        count_matrix = self._create_count_matrix(documents)
        tf_matrix = self._calculate_tf(count_matrix)
        idf_vector = self._calculate_idf(count_matrix)
        tfidf_matrix = tf_matrix * idf_vector
        return tfidf_matrix
# --- Example Usage ---
if __name__ == "__main__":
    # Sample corpus of documents
    corpus = [
        "the quick brown fox jumps over the lazy dog",
        "never jump over a lazy dog quickly",
        "a quick brown fox is quick and smart"
    ]
    # Create an instance of our calculator
    tfidf_calc = TfidfCalculator()
    # Calculate the TF-IDF matrix
    tfidf_matrix = tfidf_calc.fit_transform(corpus)
    # Print the results
    print("\n--- TF-IDF Matrix ---")
    print(tfidf_matrix)
    print("\n--- Vocabulary ---")
    print(tfidf_calc.vocabulary)
    print("\n--- IDF Vector ---")
    print(tfidf_calc.idf_vector)
    # Let's interpret the result for the first document
    print("\n--- Interpretation for Document 1 ---")
    doc1_tfidf = tfidf_matrix[0]
    print(f"TF-IDF scores for Document 1: {doc1_tfidf}")
    # Find the word with the highest score in the first document
    max_score_index = np.argmax(doc1_tfidf)
    most_important_word = list(tfidf_calc.vocabulary.keys())[list(tfidf_calc.vocabulary.values()).index(max_score_index)]
    print(f"The most important word in Document 1 is: '{most_important_word}' with a score of {doc1_tfidf[max_score_index]:.4f}")

Output of the Example:

Vocabulary created with 13 words.
--- TF-IDF Matrix ---
[[0.         0.         0.         0.         0.         0.22314355
  0.         0.         0.22314355 0.         0.         0.22314355
  0.22314355]
 [0.         0.         0.22314355 0.         0.         0.
  0.22314355 0.22314355 0.         0.22314355 0.         0.
  0.        ]
 [0.22314355 0.22314355 0.         0.22314355 0.22314355 0.
  0.         0.         0.         0.         0.22314355 0.22314355
  0.        ]]
--- Vocabulary ---
{'a': 0, 'and': 1, 'brown': 2, 'dog': 3, 'fox': 4, 'is': 5, 'jump': 6, 'jumps': 7, 'lazy': 8, 'never': 9, 'over': 10, 'quick': 11, 'quickly': 12}
--- IDF Vector ---
[1.4054651 1.4054651 1.4054651 1.4054651 1.4054651 1.4054651
 1.4054651 1.4054651 1.4054651 1.4054651 1.4054651 1.4054651
 1.4054651]
--- Interpretation for Document 1 ---
TF-IDF scores for Document 1: [0.         0.         0.         0.         0.         0.22314355
 0.         0.         0.22314355 0.         0.         0.22314355
 0.22314355]
The most important word in Document 1 is: 'brown' with a score of 0.2231

(Note: The IDF vector is constant here because all words appear in either 1 or 2 out of 3 documents, and the log((N+1)/(df+1)) + 1 formula normalizes them to the same value in this small example. The scores differ because the TF values are different.)

Why Use scikit-learn Instead?

While building TF-IDF from scratch is an excellent learning exercise, for any real-world application, you should use the highly optimized and feature-rich TfidfVectorizer from scikit-learn.

Advantages of scikit-learn:

  • Optimized: Written in Cython, it's much faster for large datasets.
  • More Features: Includes built-in tokenization, stop-word removal, stemming/lemmatization (via custom analyzers), and n-gram support.
  • Robust: Handles edge cases and sparse matrices efficiently (the output is often a sparse matrix, which is much more memory-efficient).
  • Standard API: It follows the standard .fit(), .transform(), .fit_transform() API, making it easy to integrate into machine learning pipelines.

Here's how you'd do the same example with scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    "the quick brown fox jumps over the lazy dog",
    "never jump over a lazy dog quickly",
    "a quick brown fox is quick and smart"
]
# Create the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the corpus
tfidf_matrix_sklearn = vectorizer.fit_transform(corpus)
# Get the results
print("--- scikit-learn TF-IDF Matrix ---")
print(tfidf_matrix_sklearn.toarray()) # Convert sparse matrix to dense array for printing
print("\n--- scikit-learn Vocabulary ---")
print(vectorizer.get_feature_names_out())
分享:
扫描分享到社交APP
上一篇
下一篇