杰瑞科技汇

Python如何实现Levenshtein距离算法?

Of course! The Levenshtein distance is a measure of the difference between two sequences. It's the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.

Python如何实现Levenshtein距离算法?-图1
(图片来源网络,侵删)

For example:

  • The Levenshtein distance between "kitten" and "sitting" is 3.
    • k -> s (substitution)
    • e -> i (substitution)
    • Insert g at the end

Here's a complete guide on how to use and implement the Levenshtein distance in Python.


The Easy Way: Using the python-Levenshtein Library

For most applications, the best approach is to use a highly optimized, third-party library. The most common one is python-Levenshtein.

Installation

First, you need to install it. It's a C-optimized library, so it's very fast.

Python如何实现Levenshtein距离算法?-图2
(图片来源网络,侵删)
pip install python-Levenshtein

Usage

The library provides a simple function distance().

import Levenshtein
# --- Basic Usage ---
str1 = "kitten"
str2 = "sitting"
distance = Levenshtein.distance(str1, str2)
print(f"The Levenshtein distance between '{str1}' and '{str2}' is: {distance}")
# Output: The Levenshtein distance between 'kitten' and 'sitting' is: 3
# --- Other Examples ---
print(Levenshtein.distance("book", "back"))  # 2 (substitute 'o' for 'a', 'o' for 'c')
print(Levenshtein.distance("Saturday", "Sunday")) # 3 (delete 'a', 't', 'u')
print(Levenshtein.distance("", "abc")) # 3 (insert 'a', 'b', 'c')
print(Levenshtein.distance("abc", "")) # 3 (delete 'a', 'b', 'c')
print(Levenshtein.distance("abc", "abc")) # 0 (no changes needed)

This library is also very fast, making it suitable for processing large lists of strings.


The Manual Way: Implementing the Algorithm (Dynamic Programming)

If you can't use external libraries or want to understand the underlying algorithm, you can implement it yourself using dynamic programming. The core idea is to build a matrix that stores the distances between all prefixes of the two strings.

Let's use "kitten" and "sitting" as an example.

Python如何实现Levenshtein距离算法?-图3
(图片来源网络,侵删)

The Algorithm

  1. Create a matrix with (len(s1) + 1) rows and (len(s2) + 1) columns.
  2. Initialize the first row and column with their index values (0, 1, 2, ...). This represents the cost of converting an empty string to the target prefix.
  3. Iterate through the matrix. For each cell (i, j):
    • If the characters s1[i-1] and s2[j-1] are the same, the cost is 0. Otherwise, it's 1.
    • The value of the cell is the minimum of three possible operations:
      • Deletion: matrix[i-1][j] + 1
      • Insertion: matrix[i][j-1] + 1
      • Substitution: matrix[i-1][j-1] + cost
  4. The final Levenshtein distance is the value in the bottom-right cell of the matrix.

Python Implementation

Here is a clean and well-commented implementation.

def levenshtein_distance(s1, s2):
    """
    Calculates the Levenshtein distance between two strings.
    """
    # Get the lengths of the strings
    len_s1 = len(s1)
    len_s2 = len(s2)
    # Create a matrix to store distances
    # The matrix has len_s1 + 1 rows and len_s2 + 1 columns
    dp = [[0] * (len_s2 + 1) for _ in range(len_s1 + 1)]
    # Initialize the first row and column
    # The cost of converting an empty string to s2[0..j] is j insertions
    for i in range(len_s1 + 1):
        dp[i][0] = i
    # The cost of converting s1[0..i] to an empty string is i deletions
    for j in range(len_s2 + 1):
        dp[0][j] = j
    # Fill the rest of the matrix
    for i in range(1, len_s1 + 1):
        for j in range(1, len_s2 + 1):
            # Cost is 0 if characters are the same, 1 otherwise
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            # Find the minimum of deletion, insertion, and substitution
            dp[i][j] = min(
                dp[i - 1][j] + 1,      # Deletion
                dp[i][j - 1] + 1,      # Insertion
                dp[i - 1][j - 1] + cost # Substitution
            )
    # The final distance is in the bottom-right cell
    return dp[len_s1][len_s2]
# --- Usage ---
str1 = "kitten"
str2 = "sitting"
distance = levenshtein_distance(str1, str2)
print(f"The Levenshtein distance between '{str1}' and '{str2}' is: {distance}")
# Output: The Levenshtein distance between 'kitten' and 'sitting' is: 3

Practical Application: Fuzzy String Matching

The most common use case for Levenshtein distance is fuzzy string matching, where you want to find strings in a list that are "close" to a target string.

Here's how you can use the python-Levenshtein library to find the best match.

import Levenshtein
# A list of possible correct spellings or options
correct_options = [
    "apple", "banana", "cherry", "date", "elderberry",
    "fig", "grape", "honeydew", "kiwi", "lemon"
]
# A user input with a typo
user_input = "aple" # Missing 'p'
# Another example
user_input_2 = "bannanna" # Extra 'n'
# Another example
user_input_3 = "stawbery" # Wrong word entirely
def find_best_match(input_str, options):
    """
    Finds the best match for an input string from a list of options.
    """
    if not input_str or not options:
        return None
    # Calculate the distance for each option
    # The key=lambda item: item[1] sorts by the distance value
    # min() will find the tuple with the smallest second element (the distance)
    best_match = min(
        [(option, Levenshtein.distance(input_str, option)) for option in options],
        key=lambda item: item[1]
    )
    return best_match
# --- Find matches ---
match1 = find_best_match(user_input, correct_options)
match2 = find_best_match(user_input_2, correct_options)
match3 = find_best_match(user_input_3, correct_options)
print(f"Input: '{user_input}', Best Match: '{match1[0]}' (Distance: {match1[1]})")
print(f"Input: '{user_input_2}', Best Match: '{match2[0]}' (Distance: {match2[1]})")
print(f"Input: '{user_input_3}', Best Match: '{match3[0]}' (Distance: {match3[1]})")
# --- More advanced: Find all matches within a certain threshold ---
def find_all_matches_within_threshold(input_str, options, threshold):
    """
    Finds all options that are within a certain Levenshtein distance.
    """
    matches = []
    for option in options:
        distance = Levenshtein.distance(input_str, option)
        if distance <= threshold:
            matches.append((option, distance))
    # Sort by distance for better results
    matches.sort(key=lambda item: item[1])
    return matches
# Find all matches for "aple" with a distance of 1 or less
close_matches = find_all_matches_within_threshold(user_input, correct_options, 1)
print(f"\nClose matches for '{user_input}' (threshold <= 1):")
for match, dist in close_matches:
    print(f"- {match} (Distance: {dist})")

Summary: Which one should you use?

Method Pros Cons Best for...
python-Levenshtein Extremely fast, easy to use, standard industry choice. Requires an external installation (pip install). Almost all production use cases. Performance-critical applications.
Manual Implementation No external dependencies, great for learning the algorithm. Slower for large strings, more code to write and maintain. Academic purposes, coding interviews, or environments where you cannot install packages.
分享:
扫描分享到社交APP
上一篇
下一篇