杰瑞科技汇

Python tldextract如何精准提取域名?

Of course! tldextract is a fantastic and essential Python library for anyone working with URLs. It reliably separates a URL's subdomain, domain, and top-level domain (TLD), even with complex, new TLDs like .co.uk or .google.

Python tldextract如何精准提取域名?-图1
(图片来源网络,侵删)

Here's a comprehensive guide covering what it is, why you need it, how to install and use it, and advanced tips.


What is tldextract and Why Do You Need It?

When you look at a URL like https://blog.example.co.uk/path?query=1, it's tempting to split it by dots:

  • blog
  • example
  • co
  • uk

But which part is the domain, and which is the TLD? The answer is: it depends on a public list. The TLD for this URL is actually .co.uk, not just .uk.

Manually parsing this is a nightmare. The list of valid TLDs changes constantly (e.g., .app, .dev, .ai were all added recently).

Python tldextract如何精准提取域名?-图2
(图片来源网络,侵删)

tldextract solves this by using a live, public list (from the Public Suffix List) to do the parsing correctly for you.

Correct Parsing of https://blog.example.co.uk/path:

  • Subdomain: blog
  • Domain: example
  • TLD: co.uk

Installation

First, you need to install the library. It's recommended to also install its dependency, requests, for fetching the TLD list.

pip install tldextract
# Optional but recommended for fetching the TLD list
pip install requests

Basic Usage

The core function is tldextract.extract(). It takes a URL or hostname string and returns a named tuple with three parts: subdomain, domain, and suffix (which is the TLD).

import tldextract
# --- Example 1: Simple URL ---
url = "https://www.google.com/search?q=python"
ext = tldextract.extract(url)
print(f"URL: {url}")
print(f"Subdomain: {ext.subdomain}")  # 'www'
print(f"Domain: {ext.domain}")        # 'google'
print(f"Suffix (TLD): {ext.suffix}")  # 'com'
print("-" * 20)
# --- Example 2: URL with a complex TLD ---
url_uk = "https://www.bbc.co.uk/news"
ext_uk = tldextract.extract(url_uk)
print(f"URL: {url_uk}")
print(f"Subdomain: {ext_uk.subdomain}") # 'www'
print(f"Domain: {ext_uk.domain}")       # 'bbc'
print(f"Suffix (TLD): {ext_uk.suffix}") # 'co.uk'
print("-" * 20)
# --- Example 3: Just a hostname ---
hostname = "api.github.com"
ext_hostname = tldextract.extract(hostname)
print(f"Hostname: {hostname}")
print(f"Subdomain: {ext_hostname.subdomain}") # 'api'
print(f"Domain: {ext_hostname.domain}")       # 'github'
print(f"Suffix (TLD): {ext_hostname.suffix}") # 'com'
print("-" * 20)
# --- Example 4: An IP Address ---
ip_url = "http://127.0.0.1:8000"
ext_ip = tldextract.extract(ip_url)
print(f"URL: {ip_url}")
print(f"Subdomain: {ext_ip.subdomain}") # ''
print(f"Domain: {ext_ip.domain}")       # '127'
print(f"Suffix (TLD): {ext_ip.suffix}") # '0.0.1:8000' (Note: the port is part of the suffix)
print("-" * 20)

Output:

URL: https://www.google.com/search?q=python
Subdomain: www
Domain: google
Suffix (TLD): com
--------------------
URL: https://www.bbc.co.uk/news
Subdomain: www
Domain: bbc
Suffix (TLD): co.uk
--------------------
Hostname: api.github.com
Subdomain: api
Domain: github
Suffix (TLD): com
--------------------
URL: http://127.0.0.1:8000
Subdomain: 
Domain: 127
Suffix (TLD): 0.0.1:8000
--------------------

Advanced Usage & Configuration

A. Updating the TLD List

The library caches the Public Suffix List to avoid downloading it on every run. However, the list is updated daily. You should refresh it periodically.

import tldextract
# Let's say you haven't run this in a week...
# First, get the current cache info
cache_info = tldextract._cache.get_cache_info()
print(f"Cache is from: {cache_info.date}")
print(f"Cache is valid: {cache_info.is_fresh}") # Might be False
# Now, force a refresh of the TLD list
print("\nUpdating TLD list...")
tldextract.update_TLD_from('https://publicsuffix.org/list/public_suffix_list.dat')
# Check the cache info again
cache_info = tldextract._cache.get_cache_info()
print(f"Cache is from: {cache_info.date}")
print(f"Cache is valid: {cache_info.is_fresh}")

B. Using a Custom TLD List

If you're in a restricted environment or need to use a specific version of the list, you can provide your own file path.

import tldextract
# Use a local copy of the TLD list
custom_list_path = 'path/to/my/custom_tld_list.txt'
# Make sure the file exists and has the correct format
# (one TLD per line, e.g., com, co.uk, ...)
with open(custom_list_path, 'w') as f:
    f.write("com\n")
    f.write("org\n")
    f.write("co.uk\n")
    f.write("myapp.dev\n")
# Tell tldextract to use this file
tldextract.update_TLD_from(custom_list_path)
# Now it will use your custom list
ext = tldextract.extract("https://blog.myapp.dev")
print(f"Subdomain: {ext.subdomain}") # 'blog'
print(f"Domain: {ext.domain}")       # 'myapp'
print(f"Suffix (TLD): {ext.suffix}") # 'dev'

C. Handling URLs Without a TLD

tldextract is smart enough to handle invalid or incomplete URLs gracefully.

import tldextract
# No TLD
ext = tldextract.extract("localhost:8000")
print(f"Subdomain: '{ext.subdomain}'") # ''
print(f"Domain: '{ext.domain}'")       # 'localhost'
print(f"Suffix: '{ext.suffix}'")       # '8000' (port is treated as part of the suffix)
# Just a path
ext = tldextract.extract("/some/path/to/file.html")
print(f"Subdomain: '{ext.subdomain}'") # ''
print(f"Domain: '{ext.domain}'")       # ''
print(f"Suffix: '{ext.suffix}'")       # ''

Practical Use Cases

Here are common scenarios where tldextract shines.

Use Case 1: Normalizing URLs for Comparison

You want to check if two URLs point to the same domain, ignoring www and protocol differences.

def normalize_domain(url):
    ext = tldextract.extract(url)
    # Combine domain and suffix for the full registered domain
    return f"{ext.domain}.{ext.suffix}"
url1 = "https://www.example.com/page"
url2 = "http://example.com/page"
url3 = "https://blog.example.co.uk"
print(f"Normalized 1: {normalize_domain(url1)}") # example.com
print(f"Normalized 2: {normalize_domain(url2)}") # example.com
print(f"Normalized 3: {normalize_domain(url3)}") # example.co.uk
# Now you can easily compare them
if normalize_domain(url1) == normalize_domain(url2):
    print("\nURL 1 and URL 2 belong to the same registered domain.")

Use Case 2: Filtering by Domain

You have a list of URLs and want to find all that belong to a specific organization, like nytimes.com.

urls = [
    "https://www.nytimes.com/2025/10/01/world/europe/uk-economy.html",
    "https://www.bbc.co.uk/news",
    "https://www.nytimes.com/spotlight",
    "https://github.com/psf/requests",
    "https://cooking.nytimes.com/recipes"
]
target_domain = "nytimes.com"
matching_urls = []
for url in urls:
    ext = tldextract.extract(url)
    # We check the full registered domain (domain + suffix)
    registered_domain = f"{ext.domain}.{ext.suffix}"
    if registered_domain == target_domain:
        matching_urls.append(url)
print(f"URLs from {target_domain}:")
for url in matching_urls:
    print(f"- {url}")

Output:

URLs from nytimes.com:
- https://www.nytimes.com/2025/10/01/world/europe/uk-economy.html
- https://www.nytimes.com/spotlight
- https://cooking.nytimes.com/recipes

Notice how cooking.nytimes.com is correctly included because its registered domain is also nytimes.com.


Comparison with Alternatives

Method Pros Cons Best for...
tldextract Accurate, handles complex TLDs, actively maintained, easy to use. Adds an external dependency, requires fetching a list (usually automatic). Almost all URL parsing tasks. This is the recommended, robust solution.
urllib.parse Built-in to Python, no dependencies. Very naive. Splits on dots only. Fails on bbc.co.uk (would think TLD is uk). Simple parsing where you know the TLDs in advance or don't care about accuracy (e.g., extracting all path components).
re (Regex) Full control over the pattern. Brittle and hard to maintain. The regex for the Public Suffix List is enormous and complex. Quick-and-dirty scripts where you only expect simple TLDs.

Example of urllib.parse failure:

from urllib.parse import urlparse
url = "https://www.bbc.co.uk/news"
parsed = urlparse(url)
hostname = parsed.netloc
# This is the naive part
parts = hostname.split('.')
print(parts) # ['www', 'bbc', 'co', 'uk']
# You'd have to manually hardcode rules for 'co.uk', 'com.au', etc.
# This is exactly what tldextract automates for you.

Summary

For any serious URL manipulation in Python, tldextract is the go-to library. It saves you from the headache of managing the Public Suffix List yourself and provides reliable, accurate domain parsing. Its simple API makes it a drop-in solution for most use cases.

分享:
扫描分享到社交APP
上一篇
下一篇