杰瑞科技汇

Python urllib.open 如何使用?

Of course! The urllib.open function is a fundamental part of Python for making HTTP requests. It's part of Python's built-in urllib.request module.

Python urllib.open 如何使用?-图1
(图片来源网络,侵删)

Let's break it down, from the basic concept to more advanced usage.

The Core Concept: urllib.request.urlopen()

The main function you'll use is urllib.request.urlopen(). It opens a URL (like a webpage) and returns a file-like object. This means you can read from it just like you would read from a local file.

Basic Example: Fetching a Web Page

Here's the simplest way to use it to get the HTML of a webpage.

import urllib.request
# The URL you want to open
url = 'https://www.example.com'
try:
    # Open the URL
    # This returns a response object
    with urllib.request.urlopen(url) as response:
        # Read the response and decode it from bytes to a string
        html_content = response.read().decode('utf-8')
        # Print the first 500 characters
        print(html_content[:500])
except urllib.error.URLError as e:
    print(f"Failed to open the URL: {e.reason}")

What's happening here?

Python urllib.open 如何使用?-图2
(图片来源网络,侵删)
  1. import urllib.request: We import the necessary module.
  2. with urllib.request.urlopen(url) as response:: This opens the URL. The with statement is best practice as it automatically closes the connection for you. The result, response, is a file-like object.
  3. response.read(): This reads the entire content of the response from the server. By default, it returns the content as bytes.
  4. .decode('utf-8'): We convert the bytes object into a human-readable string using UTF-8 encoding, which is common for web pages.
  5. except urllib.error.URLError: This is good practice. If the URL is invalid, the server is down, or there's a network problem, urlopen() raises a URLError.

Working with the Response Object

The object returned by urlopen() has several useful attributes and methods:

  • response.read(): Reads the entire body of the response.
  • response.readline(): Reads one line at a time.
  • response.readlines(): Reads all lines into a list.
  • response.status: The HTTP status code (e.g., 200 for OK, 404 for Not Found).
  • response.getcode(): An alias for response.status.
  • response.headers: A dictionary-like object containing the response headers (e.g., Content-Type, Server).

Example: Inspecting the Response

import urllib.request
url = 'https://httpbin.org/get' # A great site for testing HTTP requests
try:
    with urllib.request.urlopen(url) as response:
        print(f"Status Code: {response.status}")
        print("-" * 30)
        print("Headers:")
        for header, value in response.headers.items():
            print(f"{header}: {value}")
        print("-" * 30)
        print("Response Body (first 200 chars):")
        body = response.read().decode('utf-8')
        print(body[:200])
except urllib.error.URLError as e:
    print(f"Error: {e.reason}")

Making POST Requests

By default, urlopen() makes a GET request. To make a POST request, you need to pass some extra data.

The data must be encoded into bytes.

Example: Making a POST Request

import urllib.request
import urllib.parse
url = 'https://httpbin.org/post'
# Data to send in the POST request
# This should be a dictionary
data = {
    'username': 'testuser',
    'password': 'securepassword123'
}
# Encode the data into bytes
# urllib.parse.urlencode() is perfect for this
post_data = urllib.parse.urlencode(data).encode('utf-8')
try:
    # Create a request object with the URL and data
    request = urllib.request.Request(url, data=post_data, method='POST')
    # Open the request
    with urllib.request.urlopen(request) as response:
        response_body = response.read().decode('utf-8')
        print("POST Request Successful!")
        print(response_body)
except urllib.error.URLError as e:
    print(f"Error: {e.reason}")

Key changes for POST:

  1. urllib.parse.urlencode(data): This takes a dictionary and turns it into a URL-encoded string like username=testuser&password=securepassword123.
  2. .encode('utf-8'): The urlopen() function requires the data to be in bytes.
  3. urllib.request.Request(url, data=post_data, method='POST'): We create a Request object, which allows us to specify the data and the HTTP method.

Adding Headers (e.g., User-Agent)

Some websites block default urllib requests because they don't look like a real browser. You can add headers to your request to make it more legitimate.

Example: Adding a User-Agent Header

import urllib.request
import urllib.parse
url = 'https://httpbin.org/user-agent' # This endpoint returns the User-Agent it sees
# Create a dictionary of headers
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': 'application/json' # Ask for JSON data
}
# Create a request object and add the headers
request = urllib.request.Request(url, headers=headers)
try:
    with urllib.request.urlopen(request) as response:
        response_body = response.read().decode('utf-8')
        print("Request with custom User-Agent:")
        print(response_body)
except urllib.error.URLError as e:
    print(f"Error: {e.reason}")

Handling Timeouts

If a server is slow or unresponsive, your program could hang indefinitely. You should always set a timeout.

Example: Setting a Timeout

import urllib.request
url = 'https://httpbin.org/delay/5' # This endpoint waits 5 seconds before responding
try:
    # Set a timeout of 3 seconds
    # The request will fail if the server doesn't respond within 3 seconds
    with urllib.request.urlopen(url, timeout=3) as response:
        print("Request succeeded within the timeout!")
        print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
    print(f"Request timed out or failed: {e.reason}")

Summary: urllib.request.urlopen() vs. requests

While urllib is built-in and powerful, the requests library is much more user-friendly and is the de facto standard for most Python developers.

Feature urllib.request.urlopen() requests library
Ease of Use Verbose, requires manual encoding and decoding. Very simple, intuitive syntax.
JSON Handling Requires manual parsing with json.loads(). response.json() method for automatic parsing.
Sessions No built-in session object. requests.Session() handles cookies and connections across requests.
Installation Built-in with Python. pip install requests
Recommendation Good for simple scripts or environments where you can't install packages. Highly recommended for almost all HTTP tasks in Python.

Example with requests for comparison:

# Using the requests library
import requests
url = 'https://httpbin.org/get'
params = {'key1': 'value1', 'key2': 'value2'}
try:
    # GET request with params
    response = requests.get(url, params=params, timeout=5)
    # Automatically raises an exception for bad status codes (4xx or 5xx)
    response.raise_for_status() 
    # Access JSON data directly
    data = response.json()
    print("Status Code:", response.status_code)
    print("JSON Response:", data)
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

Final Recommendation

For learning purposes, understanding urllib is great because it's built-in and shows you what's happening under the hood. However, for any serious project, use the requests library. It will save you a lot of time and make your code cleaner and more robust.

分享:
扫描分享到社交APP
上一篇
下一篇