杰瑞科技汇

如何用Python实现concordance功能?

Of course! Let's break down "concordance" in Python, covering everything from the fundamental concept to practical implementation and popular libraries.

如何用Python实现concordance功能?-图1
(图片来源网络,侵删)

What is a Concordance?

In linguistics and data analysis, a concordance is a list of words that appear in a text, along with the context in which each word appears. Think of it as an advanced form of a word index or a KWIC (Key Word In Context) display.

The most common format shows the target word centered, with a fixed number of words on either side. This helps you understand how a word is used.

Example:

Let's take the sentence: "The quick brown fox jumps over the lazy dog. The dog was not amused."

如何用Python实现concordance功能?-图2
(图片来源网络,侵删)

A concordance for the word "dog" might look like this:

quick brown fox jumps over the lazy dog
The dog was not amused

This shows that "dog" appears twice. In the first instance, it's the object of a verb ("jumps over"). In the second, it's the subject of a sentence ("The dog was..."). This context is invaluable for understanding meaning and usage.


Creating a Concordance from Scratch (Pure Python)

You can easily build a concordance using basic Python data structures like dictionaries and lists. This is a great way to understand the underlying logic.

The Logic:

  1. Tokenize: Split the text into individual words (tokens).
  2. Create an Index: Use a dictionary where the keys are the unique words you want to find. The values will be a list of integers representing the index (position) of each occurrence of that word in the tokenized list.
  3. Format the Output: For a given word, look up its list of positions. For each position, grab a "window" of words around it and print them in a formatted way.

Code Implementation:

import re
def build_concordance(text, window_size=5):
    """
    Builds a concordance index from a text string.
    Returns a dictionary where keys are words and values are lists of token indices.
    """
    # Tokenize the text, keeping only words and converting to lowercase
    tokens = re.findall(r'\b\w+\b', text.lower())
    concordance_index = {}
    for index, token in enumerate(tokens):
        if token not in concordance_index:
            concordance_index[token] = []
        concordance_index[token].append(index)
    return tokens, concordance_index
def print_concordance(tokens, concordance_index, word, window_size=5):
    """
    Prints the concordance for a specific word.
    """
    word = word.lower()
    if word not in concordance_index:
        print(f"The word '{word}' was not found in the text.")
        return
    print(f"Concordance for '{word}':\n")
    for position in concordance_index[word]:
        # Define the start and end of the window
        start = max(0, position - window_size)
        end = min(len(tokens), position + window_size + 1)
        # Get the context words
        context = tokens[start:end]
        # Center the target word for better visualization
        # Find where our target word is within the context slice
        word_in_context_pos = position - start
        context[word_in_context_pos] = f"**{tokens[position]}**"
        # Join and print
        print(" ".join(context))
# --- Example Usage ---
sample_text = """
The quick brown fox jumps over the lazy dog. The dog was not amused.
It was a quiet day. The fox, however, was very quick.
"""
# 1. Build the concordance index
tokens, index = build_concordance(sample_text)
# 2. Print the concordance for a specific word
print_concordance(tokens, index, "the")
print("-" * 20)
print_concordance(tokens, index, "fox")
print("-" * 20)
print_concordance(tokens, index, "quick")

Output of the Example:

Concordance for 'the':
the quick brown fox jumps
quick brown fox jumps over **the**
fox jumps over **the** lazy
jumps over **the** lazy dog
**the** lazy dog was
**the** dog was not
It was a quiet day
It was a quiet day
The fox however was
--------------------
Concordance for 'fox':
the quick brown fox jumps
brown fox jumps over the
fox jumps over the lazy
the fox however was
--------------------
Concordance for 'quick':
the quick brown fox jumps
The fox however was very
The fox however was very **quick**

Using the NLTK Library (The Easiest & Most Powerful Way)

For any serious natural language processing, the Natural Language Toolkit (NLTK) is the standard library in Python. It has a built-in, highly optimized ConcordanceIndex class.

如何用Python实现concordance功能?-图3
(图片来源网络,侵删)

First, you need to install NLTK and download some data:

pip install nltk

Then, in a Python interpreter:

import nltk
nltk.download('punkt') # For tokenization
nltk.download('gutenberg') # For some sample texts

NLTK Implementation:

NLTK's approach is more object-oriented and handles tokenization for you.

import nltk
from nltk.text import Text
# Load a sample text from NLTK's corpus
# Let's use the first chapter of Moby Dick
moby_dick_text = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
# Create a Text object, which tokenizes the text for you
moby_dick = Text(moby_dick_text.split())
# --- Use the built-in concordance method ---
print("Concordance for 'whale' using NLTK:")
moby_dick.concordance("whale", width=80, lines=5)
print("\n" + "-"*40 + "\n")
print("Concordance for 'sea':")
moby_dick.concordance("sea", width=60, lines=8)

Output of the Example:

Concordance for 'whale' using NLTK:
Displaying 5 of 1222 matches:
                              upon the mast-head of some whale-ship, with a telescope;
       to a certain remarkable occasion in his previous life, when a whale-ship on
       him to mount to the summit of the mast to ascertain the position of the
       ship. Now having a night, a day, and still another night following before
       him in the solitude of his cabin, Ahab thus pondered over the wondrous
----------------------------------------
Concordance for 'sea':
Displaying 8 of 459 matches:
       the sea; the vast fields of water around the ship; and at times, by night,
       especially in the vicinity of the Malays, this same silent sea sends forth
       such apprehensions as overtake the solitary fisherman. The vast watery
       desert of the ocean; the way-side places of the earth; the unplowed
       prairies of the sky; separate continents of land and water, and
       super- insular seas; far from the pastoral land-locked harbors of
       continental coasts; in the awful loneliness of the sea, the
       slave-ships, of the slavetrade, are famous discoveries. When we
       consider that Greenland was not colonized until the latter years of

Key NLTK concordance() parameters:

  • word: The word to find.
  • width: The width of each line of output (number of characters).
  • lines: The maximum number of lines to display.

Using spaCy (A Modern, Industrial-Strength Approach)

spaCy is another powerful NLP library, known for its speed and production-ready capabilities. While it doesn't have a one-line concordance function like NLTK, it's very easy to build one using its powerful tokenization and iteration features.

First, install spaCy and download a model:

pip install spacy
python -m spacy download en_core_web_sm

spaCy Implementation:

spaCy processes text into a Doc object, where you can iterate over tokens.

import spacy
# Load the English language model
nlp = spacy.load("en_core_web_sm")
# Process the text
text = "The quick brown fox jumps over the lazy dog. The dog was not amused. A quick thought."
doc = nlp(text)
def print_spacy_concordance(doc, target_word, window_size=2):
    """
    Prints a concordance using spaCy's Doc object.
    """
    target_word = target_word.lower()
    print(f"Concordance for '{target_word}' using spaCy:\n")
    for i, token in enumerate(doc):
        if token.text.lower() == target_word:
            # Get the start and end of the window
            start = max(0, i - window_size)
            end = min(len(doc), i + window_size + 1)
            # Get the context tokens
            context_tokens = doc[start:end]
            # Highlight the target word
            context_tokens[i].text = f"**{token.text}**"
            # Join and print
            print(" ".join([t.text for t in context_tokens]))
# --- Example Usage ---
print_spacy_concordance(doc, "the")
print("-" * 20)
print_spacy_concordance(doc, "quick")

Output of the Example:

Concordance for 'the' using spaCy:
**The** quick brown fox
fox jumps over **the**
jumps over **the** lazy
over **the** lazy dog.
**The** dog was not
--------------------
Concordance for 'quick' using spaCy:
**The** quick brown fox
quick brown fox jumps
A **quick** thought.

Summary: Which Method to Use?

Method Pros Cons Best For
Pure Python - No external dependencies.
- Excellent for learning the logic.
- Lightweight and fast for small tasks.
- Requires manual implementation.
- Less robust tokenization (e.g., doesn't handle punctuation perfectly).
Learning, small scripts, or situations where you can't install libraries.
NLTK - Very easy to use (text.concordance()).
- Excellent tokenization.
- Rich with corpora and other NLP tools.
- Great for research and education.
- Can be slower than spaCy.
- The API can feel a bit dated.
Students, researchers, and anyone needing a quick and powerful NLP toolkit.
spaCy - Extremely fast and efficient.
- State-of-the-art NLP features (lemmatization, POS tagging, etc.).
- Modern, well-designed API.
- Requires more code for a custom concordance than NLTK.
- The models can be larger.
Production applications, large-scale data processing, and modern NLP pipelines.
分享:
扫描分享到社交APP
上一篇
下一篇