Of course! The ahocorasick library in Python is a powerful and efficient tool for finding multiple patterns (a "dictionary" of strings) in a single text. It's an implementation of the Aho-Corasick algorithm, which is designed for this exact purpose.

Let's break down what it is, why it's useful, and how to use it with clear examples.
What is the Aho-Corasick Algorithm?
Imagine you want to find a list of words ("cat", "dog", "hat") in a long text ("The cat in the hat saw the dog."). The naive way is to loop through your list of words and search for each one in the text. This is slow if you have many words.
The Aho-Corasick algorithm is much faster. It builds a special data structure (a "trie" with extra links) from your list of patterns first. Then, it scans the text only once to find all the matches. This makes it incredibly efficient for large texts and large pattern sets.
Key Features:

- Multi-pattern search: Find many strings at once.
- Linear time complexity: The search time is proportional to the length of the text, not the number of patterns.
- Can find overlapping matches: It can find "ana" and "anana" in "banana".
- Can find all occurrences: It can report where each pattern starts and ends.
Installation
First, you need to install the library. It's available on PyPI.
pip install ahocorasick
Core Concepts and Usage
The main class you'll interact with is ahocorasick.Automaton. The workflow is always the same:
- Create an Automaton:
automaton = ahocorasick.Automaton() - Add your patterns (words) to the Automaton: Use
automaton.add_word(word, value). Thevaluecan be anything you want to associate with the word (like the word itself, a category, an ID, etc.). - Convert the Automaton to a search state:
automaton.make_automaton(). This is a crucial step that builds the "failure links" that make the algorithm so fast. - Search through your text: Use
automaton.iter(text)orautomaton.get(text, default).
Example 1: Basic Keyword Search
Let's find a list of programming-related keywords in a sentence.
import ahocorasick
# 1. Create an Automaton
automaton = ahocorasick.Automaton()
# 2. Add words to the dictionary
# We'll store the word itself as the value for easy retrieval.
keywords = ["python", "java", "script", "code", "error"]
for keyword in keywords:
automaton.add_word(keyword, keyword)
# 3. "Compile" the automaton
automaton.make_automaton()
# 4. Search in a text
text = "Writing Python script is fun, but a Java script error can be frustrating."
print(f"Searching in: '{text}'\n")
# Using iter() to find all matches
for end_index, matched_keyword in automaton.iter(text):
start_index = end_index - len(matched_keyword) + 1
print(f"Found '{matched_keyword}' at index {start_index}-{end_index}")
# Using get() to find the first match
first_match = automaton.get(text)
if first_match:
print(f"\nFirst match found: '{first_match}'")
else:
print("\nNo matches found.")
Output:

Searching in: 'Writing Python script is fun, but a Java script error can be frustrating.'
Found 'Python' at index 8-13
Found 'script' at index 15-20
Found 'Java' at index 32-35
Found 'script' at index 37-42
Found 'error' at index 44-48
First match found: 'Python'
Example 2: Handling Overlapping Matches
The iter() method has a ignore_overlaps parameter. By default, it's False, so it will find all matches, including overlapping ones.
import ahocorasick
automaton = ahocorasick.Automaton()
# Find "ana" and "anana" in "banana"
automaton.add_word("ana", "ana")
automaton.add_word("anana", "anana")
automaton.make_automaton()
text = "banana"
print("Finding all matches (including overlaps):")
for end_idx, keyword in automaton.iter(text):
start_idx = end_idx - len(keyword) + 1
print(f"Found '{keyword}' at index {start_idx}-{end_idx}")
print("\nFinding matches without overlaps:")
# Set ignore_overlaps=True to get only non-overlapping matches
for end_idx, keyword in automaton.iter(text, ignore_overlaps=True):
start_idx = end_idx - len(keyword) + 1
print(f"Found '{keyword}' at index {start_idx}-{end_idx}")
Output:
Finding all matches (including overlaps):
Found 'ana' at index 1-3
Found 'ana' at index 3-5
Found 'anana' at index 1-5
Finding matches without overlaps:
Found 'anana' at index 1-5
Example 3: Associating Patterns with Custom Values
You don't have to store the keyword itself. You can store any object. This is useful for categorizing keywords.
import ahocorasick
automaton = ahocorasick.Automaton()
# Add words and associate them with a category
automaton.add_word("error", "System Issue")
automaton.add_word("exception", "System Issue")
automaton.add_word("crash", "System Issue")
automaton.add_word("login", "Authentication")
automaton.add_word("password", "Authentication")
automaton.add_word("user", "Authentication")
automaton.add_word("sale", "E-commerce")
automaton.add_word("purchase", "E-commerce")
automaton.add_word("cart", "E-commerce")
automaton.make_automaton()
text = "A user login error caused the system to crash. Check the purchase cart."
print("Found issues and their categories:")
for end_index, category in automaton.iter(text):
# We need to find the actual word to provide better context
# A simple way is to search backwards from the end_index
start_index = end_index
while start_index > 0 and text[start_index - 1] != ' ':
start_index -= 1
matched_word = text[start_index:end_index+1]
print(f"- '{matched_word}' is a '{category}'")
Output:
Found issues and their categories:
- 'user' is a 'Authentication'
- 'login' is a 'Authentication'
- 'error' is a 'System Issue'
- 'crash' is a 'System Issue'
- 'purchase' is a 'E-commerce'
- 'cart' is a 'E-commerce'
Advanced Features
Longest-Only Matching
Sometimes you only care about the longest match at any given position (e.g., finding "python" but not "py"). You can enable this behavior.
import ahocorasick
automaton = ahocorasick.Automaton()
automaton.add_word("py", "short")
automaton.add_word("python", "long")
automaton.make_automaton()
# Enable longest-only matching
automaton.set_longest_only(True)
text = "I love python programming."
for end_idx, value in automaton.iter(text):
start_idx = end_idx - len(value) + 1
print(f"Found '{value}' at index {start_idx}-{end_idx}")
Output:
Found 'long' at index 7-12
Notice it found "python" (with value "long") and ignored the shorter "py".
Storing Only End Indices
If you don't need the associated value and just want a fast way to get all the end indices of matches, you can use store=False.
import ahocorasick
automaton = ahocorasick.Automaton()
# The second argument (value) is optional if store=False
automaton.add_word("cat", None)
automaton.add_word("bat", None)
automaton.make_automaton()
text = "The cat and the bat are friends."
# The iter() function will now only yield the end indices
print("End indices of matches:")
for end_idx in automaton.iter(text):
print(end_idx)
Output:
End indices of matches:
7
12
When to Use ahocorasick vs. re (Regular Expressions)
This is a very common question.
| Feature | ahocorasick |
re.findall() / re.finditer() |
|---|---|---|
| Use Case | Finding a fixed list of many strings. | Finding patterns described by a regex grammar. |
| Performance | Much faster for large sets of fixed strings. O(N) for text length. | Slower, especially with complex regexes. Performance can degrade unexpectedly. |
| Regex Power | None. It only matches exact strings. | Full power. Can match character classes, wildcards, repetitions, etc. |
| Code | More setup (add_word, make_automaton). |
Simpler for one-off searches (re.findall(pattern, text)). |
Rule of Thumb:
- Use
ahocorasickwhen you have a known, fixed dictionary of keywords (e.g., a list of names, product SKUs, error codes, IP addresses) and you need to find them in a large body of text quickly. - Use
rewhen your search pattern is dynamic or complex (e.g., "find all email addresses," "find all 5-digit numbers," "find words that start with 'pre' and end with 'ed'").
