杰瑞科技汇

Python regex group如何捕获分组内容?

Of course! Let's dive deep into Python regex groups. This is a powerful feature that allows you to extract and work with specific parts of a matched string.

Python regex group如何捕获分组内容?-图1
(图片来源网络,侵删)

What is a Regex Group?

A group is a way to treat multiple characters as a single unit. You create a group by enclosing part of your regular expression in parentheses .

Groups serve two primary purposes:

  1. Grouping for Quantifiers: To apply a quantifier (like , , {3,5}) to a sequence of characters instead of just a single character.

    • Without grouping: ab+ matches a followed by one or more bs (ab, abb, abbb).
    • With grouping: (ab)+ matches one or more repetitions of the sequence ab (ab, abab, ababab).
  2. Capturing: To "capture" the part of the string that matches the group and extract it later. This is the most common use case.

    Python regex group如何捕获分组内容?-图2
    (图片来源网络,侵删)

Capturing Groups: The Basics

This is what most people mean when they talk about "groups" in regex. When you use parentheses in a pattern, the text that matches the expression inside them is captured and stored.

You can access these captured groups using the group() method of the re match object.

How to Access Captured Groups

  • match.group(0): Returns the entire match.
  • match.group(1): Returns the first captured group.
  • match.group(2): Returns the second captured group.
  • ...and so on.
  • match.groups(): Returns a tuple containing all the captured groups (from 1 onwards).

Example: Parsing a Date

Let's say we want to parse dates in the format YYYY-MM-DD.

import re
text = "The event is scheduled for 2025-10-27."
pattern = r"(\d{4})-(\d{2})-(\d{2})" # 4 digits, then 2 digits, then 2 digits
match = re.search(pattern, text)
if match:
    # The entire matched string
    print(f"Full match: {match.group(0)}")
    # Output: Full match: 2025-10-27
    # The first captured group (the year)
    print(f"Year: {match.group(1)}")
    # Output: Year: 2025
    # The second captured group (the month)
    print(f"Month: {match.group(2)}")
    # Output: Month: 10
    # The third captured group (the day)
    print(f"Day: {match.group(3)}")
    # Output: Day: 27
    # All captured groups as a tuple
    print(f"All groups: {match.groups()}")
    # Output: All groups: ('2025', '10', '27')
else:
    print("No match found.")

Named Groups

When you have many groups, remembering which is group(1) and which is group(5) can be confusing. Named groups solve this by letting you assign a name to a group. This makes your code much more readable and maintainable.

Syntax

You create a named group using the syntax (?P<name>pattern). The P stands for "Python extension".

How to Access Named Groups

  • match.group('name'): Returns the group with the specified name.
  • match.groupdict(): Returns a dictionary where keys are the group names and values are the matched strings.

Example: Parsing a Date (Again, but with Names)

import re
text = "The event is scheduled for 2025-10-27."
pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
match = re.search(pattern, text)
if match:
    # Access by name
    print(f"Year: {match.group('year')}")
    # Output: Year: 2025
    print(f"Month: {match.group('month')}")
    # Output: Month: 10
    # All named groups as a dictionary
    print(f"All named groups: {match.groupdict()}")
    # Output: All named groups: {'year': '2025', 'month': '10', 'day': '27'}
else:
    print("No match found.")

Non-Capturing Groups

Sometimes you need parentheses for grouping (e.g., (ab)+) but you don't actually need to capture the matched text. Using a capturing group in this case creates an unnecessary entry in the result tuple, which can be inefficient and confusing.

For this, you use a non-capturing group: (?:pattern).

The at the start tells the regex engine to group the expression but not to capture it.

Example: Finding Words Followed by "ing"

Let's find all words that end in "ing". We want to capture the whole word, not just the "ing" part.

import re
text = "I am running, jumping, and singing."
# Capturing group: We get ('runn', 'ing'), ('jump', 'ing'), etc.
# This is not what we want.
pattern_capturing = r"(\w+)(ing)"
matches_capturing = re.findall(pattern_capturing, text)
print(f"Capturing group result: {matches_capturing}")
# Output: Capturing group result: [('runn', 'ing'), ('jump', 'ing'), ('s', 'ing')]
# Non-capturing group: We only get the whole word, which is what we want.
pattern_non_capturing = r"(\w+)(?:ing)"
matches_non_capturing = re.findall(pattern_non_capturing, text)
print(f"Non-capturing group result: {matches_non_capturing}")
# Output: Non-capturing group result: ['runn', 'jump', 's']
# If we use re.finditer, we can see the full match is still available
pattern_non_capturing_iter = r"(\w+)(?:ing)"
for match in re.finditer(pattern_non_capturing_iter, text):
    print(f"Found word: {match.group(0)}")
# Output:
# Found word: running
# Found word: jumping
# Found word: singing

Key takeaway: Use when you need to extract a part of the string. Use when you only need the parentheses for logical grouping.


Other Types of Groups

There are several other specialized group types, but these are the most common.

Group Syntax Name Description
(a\|b) Alternation Group Matches either a or b. This is the standard "OR" operator.
Non-Capturing Group Groups the regex but does not capture the match.
(?P<name>...) Named Group Captures the match and assigns it a name.
Positive Lookahead Asserts that the following characters match the pattern, but does not consume them. The match continues after the lookahead.
Negative Lookahead Asserts that the following characters do not match the pattern, but does not consume them.
(?<=...) Positive Lookbehind Asserts that the preceding characters match the pattern, but does not consume them. The match includes the text after the lookbehind.
(?<!...) Negative Lookbehind Asserts that the preceding characters do not match the pattern, but does not consume them.

Example: Lookahead for File Extensions

Let's find all words that are followed by a .py extension, but we only want the word itself, not the .py.

import re
text = "Use script.py or main.py, but not config.txt."
# Positive lookahead: (?=\.py)
# This means "match a word boundary followed by letters, but only if it's
# immediately followed by a literal '.py' string"
pattern = r"\b(\w+)(?=\.py)"
matches = re.findall(pattern, text)
print(f"Scripts found: {matches}")
# Output: Scripts found: ['script', 'main']

Summary and Best Practices

Group Type Syntax Use Case How to Access
Capturing You need to extract a specific part of the match. match.group(1), match.groups()
Named Capturing (?P<name>...) You need to extract parts and want readable code. match.group('name'), match.groupdict()
Non-Capturing You need parentheses for logic (quantifiers, alternation) but don't want to capture the text. Not accessible via group() methods.
Lookahead You want to match something only if it's followed by a specific pattern, without including that pattern in the result. Not accessible via group() methods.

Golden Rule: If you don't need to use the matched text later, use a non-capturing group . It's more efficient and makes your code clearer.

分享:
扫描分享到社交APP
上一篇
下一篇