Python urljoin 如何正确拼接 URL？-杰瑞科技汇

什么是 `urljoin`？

urljoin 是 Python 标准库 urllib.parse 模块中的一个函数，它的全称是 "URL join"，它的主要作用是将一个基础 URL（base URL）和一个相对 URL（relative URL）拼接成一个完整的、绝对的 URL。

（图片来源网络，侵删）

这在处理网页抓取、数据爬取等任务时非常有用，因为网页中的链接（如 <a href="...">）通常是相对路径，你需要一个完整的 URL 才能去访问。

函数签名

from urllib.parse import urljoin
urllib.parse.urljoin(base, url, allow_fragments=True)

base: 基础 URL，这是一个字符串，通常是你当前所在的页面的完整 URL。
url: 相对 URL，这是你想要拼接的 URL，可以是相对路径,也可以是绝对路径。
allow_fragments: 一个布尔值，用于控制是否处理 URL 片段（后面的部分），默认为 True，通常情况下,你不需要修改这个参数。

核心逻辑（非常重要！）

理解 urljoin 的行为是关键,它的拼接规则可以总结为以下几点：

url 是一个绝对 URL（以 http://, https://, ftp://, 等开头）：
- urljoin 会直接忽略 base，返回 url 本身。
- urljoin('https://example.com/base', 'https://another.com/path') -> 'https://another.com/path'
url 是一个相对路径（不以开头）：
（图片来源网络，侵删）
- urljoin 会在 base 的路径的最后一级后面追加 url。
- urljoin('https://example.com/base/path', 'page.html') -> 'https://example.com/base/path/page.html'
url 是一个路径（以开头）：
- urljoin 会替换掉 base 的整个路径部分，只保留 base 的协议和域名，然后拼接上 url。
- urljoin('https://example.com/base/path', '/new/page.html') -> 'https://example.com/new/page.html'
url 是一个或 ：
- 它会像在文件系统中一样处理路径。
- 表示返回上一级目录。
- 表示保持在当前目录。
- urljoin('https://example.com/base/path/', '../other.html') -> 'https://example.com/base/other.html'
- urljoin('https://example.com/base/path/', './index.html') -> 'https://example.com/base/path/index.html'
处理 base 末尾的 ：
- base 末尾的有非常重要的意义。
- base 以它被当作一个目录。
- base 不以它被当作一个文件。
- 这是最容易出错的地方！
- urljoin('https://example.com/base/path', 'page.html') -> 'https://example.com/base/path/page.html' (把 base 当作文件)
- urljoin('https://example.com/base/path/', 'page.html') -> 'https://example.com/base/path/page.html' (把 base 当作目录)
- urljoin('https://example.com/base/path', '../page.html') -> 'https://example.com/base/page.html' (把 base 当作文件，path 是文件名，上一级是 base/)
- urljoin('https://example.com/base/path/', '../page.html') -> 'https://example.com/base/page.html' (把 base 当作目录，上一级是 base/)

代码示例

让我们通过一系列例子来加深理解。

（图片来源网络，侵删）

from urllib.parse import urljoin
# --- 场景1: url是绝对URL ---
base1 = 'https://www.example.com/base/page.html'
url1 = 'https://www.python.org/doc/'
print(f"场景1: {urljoin(base1, url1)}")
# 输出: https://www.python.org/doc/ (直接使用url，忽略base)
# --- 场景2: url是相对路径 (不以/开头) ---
base2 = 'https://www.example.com/base/path/'
url2 = 'page2.html'
print(f"场景2: {urljoin(base2, url2)}")
# 输出: https://www.example.com/base/path/page2.html
base3 = 'https://www.example.com/base/path' # 注意，这里没有结尾的/
url3 = 'page2.html'
print(f"场景3: {urljoin(base3, url3)}")
# 输出: https://www.example.com/base/page2.html (base被当作文件'path'，#39;page2.html'被加在了上一级)
# --- 场景4: url是路径 (以/开头) ---
base4 = 'https://www.example.com/base/path/'
url4 = '/another/page.html'
print(f"场景4: {urljoin(base4, url4)}")
# 输出: https://www.example.com/another/page.html (替换了base的整个路径)
# --- 场景5: url包含 .. 或 . ---
base5 = 'https://www.example.com/base/path/'
url5 = '../other.html'
print(f"场景5: {urljoin(base5, url5)}")
# 输出: https://www.example.com/base/other.html
base6 = 'https://www.example.com/base/path/'
url6 = './index.html'
print(f"场景6: {urljoin(base6, url6)}")
# 输出: https://www.example.com/base/path/index.html
# --- 场景6: 处理片段(#部分) ---
# allow_fragments=True 是默认行为
base7 = 'https://www.example.com/base/path#section1'
url7 = '#section2'
print(f"场景7 (默认): {urljoin(base7, url7)}")
# 输出: https://www.example.com/base/path#section2
# allow_fragments=False 会忽略片段
print(f"场景7 (False): {urljoin(base7, url7, allow_fragments=False)}")
# 输出: https://www.example.com/base/path
# --- 场景7: 处理查询参数(?部分) ---
base8 = 'https://www.example.com/base/search?p=1&q=abc'
url8 = '?p=2'
print(f"场景8: {urljoin(base8, url8)}")
# 输出: https://www.example.com/base/search?p=2 (查询参数被替换)

常见误区与最佳实践

误区：Base URL 必须以结尾吗？

不一定，但最好确保你的 base URL 是一个完整的、有效的 URL，并且末尾的要根据实际情况处理。

在爬虫中，你的 base URL 通常是当前页面的 URL，如果你从 <a href="about.html"> 这样的链接中提取 url，那么你的 base 应该是当前页面的完整 URL。

# 假设这是你当前正在访问的页面
current_page_url = 'https://news.sina.com.cn/world/'
# 链接1: 相对路径
link1 = '2025/12/01/doc-imzxyzxy...html'
full_url1 = urljoin(current_page_url, link1)
print(full_url1)
# 输出: https://news.sina.com.cn/world/2025/12/01/doc-imzxyzxy...html (正确)
# 链接2: 绝对路径
link2 = '/sports/2025/12/01/doc-imzxyzxy...html'
full_url2 = urljoin(current_page_url, link2)
print(full_url2)
# 输出: https://news.sina.com.cn/sports/2025/12/01/doc-imzxyzxy...html (正确)

最佳实践：始终使用完整的 URL 作为 `base`

在爬虫逻辑中，你拼接 URL 时的 base 应该永远是当前页面的完整 URL，不要用一个简化的路径作为 base。

# 错误示范
base = 'www.example.com/path' # 缺少协议
url = 'page.html'
print(urljoin(base, url))
# 输出: www.example.com/pathpage.html (可能不是你想要的)
# 正确示范
base = 'https://www.example.com/path'
url = 'page.html'
print(urljoin(base, url))
# 输出: https://www.example.com/path/page.html (正确)

与 `urlparse` 的结合使用

urljoin 经常与 urllib.parse.urlparse 一起使用。urlparse 用于解析 URL，获取其各个组成部分（协议、域名、路径等）。

from urllib.parse import urljoin, urlparse
url = 'https://www.example.com/base/path/page.html?query=1#fragment'
parsed = urlparse(url)
print(f"解析结果: {parsed}")
# 输出: ParseResult(scheme='https', netloc='www.example.com', path='/base/path/page.html', params='', query='query=1', fragment='fragment')
# 你可以方便地获取某个部分
print(f"协议: {parsed.scheme}")
print(f"域名: {parsed.netloc}")
print(f"路径: {parsed.path}")
# 使用urljoin拼接新URL
base_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
new_relative_url = '../other.html'
full_new_url = urljoin(base_url, new_relative_url)
print(f"拼接后的新URL: {full_new_url}")
# 输出: https://www.example.com/base/other.html

`url` 的类型	`urljoin(base, url)` 的行为	示例 (`base='https://a.com/b/c/'`)
绝对 URL	忽略 `base`，直接返回 `url`	`urljoin(base, 'https://x.com')` -> `https://x.com`
相对路径	在 `base` 路径后追加 `url`	`urljoin(base, 'd')` -> `https://a.com/b/c/d`
路径 (以开头)	替换 `base` 的整个路径	`urljoin(base, '/e/f')` -> `https://a.com/e/f`
(上级目录)	在 `base` 路径中向上导航一级	`urljoin(base, '../d')` -> `https://a.com/b/d`
(当前目录)	在 `base` 路径后追加 `url` (效果同相对路径)	`urljoin(base, './d')` -> `https://a.com/b/c/d`

urljoin 的核心是“相对路径的智能拼接”，并特别注意 base URL 末尾的对结果的影响，在爬虫开发中，正确使用 urljoin 是构建有效链接的关键一步。

Python urljoin 如何正确拼接 URL？

什么是 `urljoin`？

函数签名

核心逻辑（非常重要！）

代码示例

常见误区与最佳实践

误区：Base URL 必须以结尾吗？

最佳实践：始终使用完整的 URL 作为 `base`

与 `urlparse` 的结合使用

99ANYc3cd6

FontCreator教程怎么学？入门到精通指南

Python tk包如何快速入门？

Project2025教程怎么学？入门到精通指南？

Project 2025教程是什么？怎么学？

苹果电脑怎么装Win10？步骤复杂吗？

Java Socket编程实例具体怎么实现？

Python报错10013，端口被占用还是权限不足？如何快速解决？

Unity3D入门视频教程该怎么学？

Servlet如何处理请求与响应？

Xbox360自制系统怎么玩？新手必看教程？

ANSYS 18.0安装教程具体步骤是什么？

MPLToolkits Python是什么？如何使用？

Project 2007教程如何快速上手？

python selenium用法

python 列表 replace

picsart美易p图教程

Python urljoin 如何正确拼接 URL？

什么是 urljoin？

函数签名

核心逻辑（非常重要！）

代码示例

常见误区与最佳实践

误区：Base URL 必须以 结尾吗？

最佳实践：始终使用完整的 URL 作为 base

与 urlparse 的结合使用

相关推荐

Java Socket编程实例具体怎么实现？

什么是 `urljoin`？

误区：Base URL 必须以结尾吗？

最佳实践：始终使用完整的 URL 作为 `base`

与 `urlparse` 的结合使用