Python tldextract如何精准提取域名？-杰瑞科技汇

Of course! tldextract is a fantastic and essential Python library for anyone working with URLs. It reliably separates a URL's subdomain, domain, and top-level domain (TLD), even with complex, new TLDs like .co.uk or .google.

（图片来源网络，侵删）

Here's a comprehensive guide covering what it is, why you need it, how to install and use it, and advanced tips.

What is `tldextract` and Why Do You Need It?

When you look at a URL like https://blog.example.co.uk/path?query=1, it's tempting to split it by dots:

blog
example
co
uk

But which part is the domain, and which is the TLD? The answer is: it depends on a public list. The TLD for this URL is actually .co.uk, not just .uk.

Manually parsing this is a nightmare. The list of valid TLDs changes constantly (e.g., .app, .dev, .ai were all added recently).

（图片来源网络，侵删）

tldextract solves this by using a live, public list (from the Public Suffix List) to do the parsing correctly for you.

Correct Parsing of https://blog.example.co.uk/path:

Subdomain: blog
Domain: example
TLD: co.uk

Installation

First, you need to install the library. It's recommended to also install its dependency, requests, for fetching the TLD list.

pip install tldextract
# Optional but recommended for fetching the TLD list
pip install requests

Basic Usage

The core function is tldextract.extract(). It takes a URL or hostname string and returns a named tuple with three parts: subdomain, domain, and suffix (which is the TLD).

import tldextract
# --- Example 1: Simple URL ---
url = "https://www.google.com/search?q=python"
ext = tldextract.extract(url)
print(f"URL: {url}")
print(f"Subdomain: {ext.subdomain}")  # 'www'
print(f"Domain: {ext.domain}")        # 'google'
print(f"Suffix (TLD): {ext.suffix}")  # 'com'
print("-" * 20)
# --- Example 2: URL with a complex TLD ---
url_uk = "https://www.bbc.co.uk/news"
ext_uk = tldextract.extract(url_uk)
print(f"URL: {url_uk}")
print(f"Subdomain: {ext_uk.subdomain}") # 'www'
print(f"Domain: {ext_uk.domain}")       # 'bbc'
print(f"Suffix (TLD): {ext_uk.suffix}") # 'co.uk'
print("-" * 20)
# --- Example 3: Just a hostname ---
hostname = "api.github.com"
ext_hostname = tldextract.extract(hostname)
print(f"Hostname: {hostname}")
print(f"Subdomain: {ext_hostname.subdomain}") # 'api'
print(f"Domain: {ext_hostname.domain}")       # 'github'
print(f"Suffix (TLD): {ext_hostname.suffix}") # 'com'
print("-" * 20)
# --- Example 4: An IP Address ---
ip_url = "http://127.0.0.1:8000"
ext_ip = tldextract.extract(ip_url)
print(f"URL: {ip_url}")
print(f"Subdomain: {ext_ip.subdomain}") # ''
print(f"Domain: {ext_ip.domain}")       # '127'
print(f"Suffix (TLD): {ext_ip.suffix}") # '0.0.1:8000' (Note: the port is part of the suffix)
print("-" * 20)

Output:

URL: https://www.google.com/search?q=python
Subdomain: www
Domain: google
Suffix (TLD): com
--------------------
URL: https://www.bbc.co.uk/news
Subdomain: www
Domain: bbc
Suffix (TLD): co.uk
--------------------
Hostname: api.github.com
Subdomain: api
Domain: github
Suffix (TLD): com
--------------------
URL: http://127.0.0.1:8000
Subdomain: 
Domain: 127
Suffix (TLD): 0.0.1:8000
--------------------

Advanced Usage & Configuration

A. Updating the TLD List

The library caches the Public Suffix List to avoid downloading it on every run. However, the list is updated daily. You should refresh it periodically.

import tldextract
# Let's say you haven't run this in a week...
# First, get the current cache info
cache_info = tldextract._cache.get_cache_info()
print(f"Cache is from: {cache_info.date}")
print(f"Cache is valid: {cache_info.is_fresh}") # Might be False
# Now, force a refresh of the TLD list
print("\nUpdating TLD list...")
tldextract.update_TLD_from('https://publicsuffix.org/list/public_suffix_list.dat')
# Check the cache info again
cache_info = tldextract._cache.get_cache_info()
print(f"Cache is from: {cache_info.date}")
print(f"Cache is valid: {cache_info.is_fresh}")

B. Using a Custom TLD List

If you're in a restricted environment or need to use a specific version of the list, you can provide your own file path.

import tldextract
# Use a local copy of the TLD list
custom_list_path = 'path/to/my/custom_tld_list.txt'
# Make sure the file exists and has the correct format
# (one TLD per line, e.g., com, co.uk, ...)
with open(custom_list_path, 'w') as f:
    f.write("com\n")
    f.write("org\n")
    f.write("co.uk\n")
    f.write("myapp.dev\n")
# Tell tldextract to use this file
tldextract.update_TLD_from(custom_list_path)
# Now it will use your custom list
ext = tldextract.extract("https://blog.myapp.dev")
print(f"Subdomain: {ext.subdomain}") # 'blog'
print(f"Domain: {ext.domain}")       # 'myapp'
print(f"Suffix (TLD): {ext.suffix}") # 'dev'

C. Handling URLs Without a TLD

tldextract is smart enough to handle invalid or incomplete URLs gracefully.

import tldextract
# No TLD
ext = tldextract.extract("localhost:8000")
print(f"Subdomain: '{ext.subdomain}'") # ''
print(f"Domain: '{ext.domain}'")       # 'localhost'
print(f"Suffix: '{ext.suffix}'")       # '8000' (port is treated as part of the suffix)
# Just a path
ext = tldextract.extract("/some/path/to/file.html")
print(f"Subdomain: '{ext.subdomain}'") # ''
print(f"Domain: '{ext.domain}'")       # ''
print(f"Suffix: '{ext.suffix}'")       # ''

Practical Use Cases

Here are common scenarios where tldextract shines.

Use Case 1: Normalizing URLs for Comparison

You want to check if two URLs point to the same domain, ignoring www and protocol differences.

def normalize_domain(url):
    ext = tldextract.extract(url)
    # Combine domain and suffix for the full registered domain
    return f"{ext.domain}.{ext.suffix}"
url1 = "https://www.example.com/page"
url2 = "http://example.com/page"
url3 = "https://blog.example.co.uk"
print(f"Normalized 1: {normalize_domain(url1)}") # example.com
print(f"Normalized 2: {normalize_domain(url2)}") # example.com
print(f"Normalized 3: {normalize_domain(url3)}") # example.co.uk
# Now you can easily compare them
if normalize_domain(url1) == normalize_domain(url2):
    print("\nURL 1 and URL 2 belong to the same registered domain.")

Use Case 2: Filtering by Domain

You have a list of URLs and want to find all that belong to a specific organization, like nytimes.com.

urls = [
    "https://www.nytimes.com/2025/10/01/world/europe/uk-economy.html",
    "https://www.bbc.co.uk/news",
    "https://www.nytimes.com/spotlight",
    "https://github.com/psf/requests",
    "https://cooking.nytimes.com/recipes"
]
target_domain = "nytimes.com"
matching_urls = []
for url in urls:
    ext = tldextract.extract(url)
    # We check the full registered domain (domain + suffix)
    registered_domain = f"{ext.domain}.{ext.suffix}"
    if registered_domain == target_domain:
        matching_urls.append(url)
print(f"URLs from {target_domain}:")
for url in matching_urls:
    print(f"- {url}")

Output:

URLs from nytimes.com:
- https://www.nytimes.com/2025/10/01/world/europe/uk-economy.html
- https://www.nytimes.com/spotlight
- https://cooking.nytimes.com/recipes

Notice how cooking.nytimes.com is correctly included because its registered domain is also nytimes.com.

Comparison with Alternatives

Method	Pros	Cons	Best for...
`tldextract`	Accurate, handles complex TLDs, actively maintained, easy to use.	Adds an external dependency, requires fetching a list (usually automatic).	Almost all URL parsing tasks. This is the recommended, robust solution.
`urllib.parse`	Built-in to Python, no dependencies.	Very naive. Splits on dots only. Fails on `bbc.co.uk` (would think TLD is `uk`).	Simple parsing where you know the TLDs in advance or don't care about accuracy (e.g., extracting all path components).
`re` (Regex)	Full control over the pattern.	Brittle and hard to maintain. The regex for the Public Suffix List is enormous and complex. Quick-and-dirty scripts where you only expect simple TLDs.

Example of urllib.parse failure:

from urllib.parse import urlparse
url = "https://www.bbc.co.uk/news"
parsed = urlparse(url)
hostname = parsed.netloc
# This is the naive part
parts = hostname.split('.')
print(parts) # ['www', 'bbc', 'co', 'uk']
# You'd have to manually hardcode rules for 'co.uk', 'com.au', etc.
# This is exactly what tldextract automates for you.

Summary

For any serious URL manipulation in Python, tldextract is the go-to library. It saves you from the headache of managing the Public Suffix List yourself and provides reliable, accurate domain parsing. Its simple API makes it a drop-in solution for most use cases.

Python tldextract如何精准提取域名？

What is `tldextract` and Why Do You Need It?

Installation

Basic Usage

Advanced Usage & Configuration

A. Updating the TLD List

B. Using a Custom TLD List

C. Handling URLs Without a TLD

Practical Use Cases

Use Case 1: Normalizing URLs for Comparison

Use Case 2: Filtering by Domain

Comparison with Alternatives

Summary

99ANYc3cd6

Python如何实现无命名编程？

如何高效积累Python知识？

Python basemanger是什么？如何使用？

Python装饰器究竟如何实现？

Eclipse如何打开Java文件？

Java Socket编程实例具体怎么实现？

Java如何调用C语言的WebService接口？

python sift landmark

HDR Light Studio教程如何快速上手？

苹果Windows系统安装教程视频，步骤是否通用？

Photoshop CS5教程视频哪里下载？

StringTokenizer在Java中如何正确使用？

Python中Caffe的Reshape操作如何实现？

Python ONVIF服务如何配置与使用？

python pandas header

Access教程如何快速上手数据库管理？

Python tldextract如何精准提取域名？

What is tldextract and Why Do You Need It?

Installation

Basic Usage

Advanced Usage & Configuration

A. Updating the TLD List

B. Using a Custom TLD List

C. Handling URLs Without a TLD

Practical Use Cases

Use Case 1: Normalizing URLs for Comparison

Use Case 2: Filtering by Domain

Comparison with Alternatives

Summary

相关推荐

Java Socket编程实例具体怎么实现？

What is `tldextract` and Why Do You Need It?