杰瑞科技汇

Python urlopen如何设置proxy代理?

Of course! Using a proxy with Python's urllib.request.urlopen is a common task. The standard library provides a clean, built-in way to do this.

Python urlopen如何设置proxy代理?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the basic method, advanced configurations, and important considerations.

The Core Concept: ProxyHandler

The key to using a proxy with urllib is the ProxyHandler class. You create a handler with your proxy URL(s) and install it into an "opener." This opener is then used to open URLs, which will automatically route your request through the specified proxy.


Basic Example: HTTP Proxy

This is the most straightforward case. Let's assume you have an HTTP proxy server.

Proxy Details:

Python urlopen如何设置proxy代理?-图2
(图片来源网络,侵删)
  • Host: 123.123.123
  • Port: 8080

Code

import urllib.request
import urllib.error
# Define your proxy details
proxy_host = '123.123.123.123'
proxy_port = '8080'
proxy_address = f'http://{proxy_host}:{proxy_port}'
# The URL you want to visit
url_to_visit = 'http://httpbin.org/ip' # This site shows your public IP address
try:
    # 1. Create a proxy handler
    # This tells urllib to use the proxy for all HTTP and HTTPS connections.
    proxy_handler = urllib.request.ProxyHandler({
        'http': proxy_address,
        'https': proxy_address  # Many proxies handle both HTTP and HTTPS
    })
    # 2. Build an opener with the proxy handler
    opener = urllib.request.build_opener(proxy_handler)
    # 3. Install the opener (optional but recommended)
    # This makes urllib.request.urlopen() use your opener by default.
    urllib.request.install_opener(opener)
    # 4. Open the URL using the standard urlopen function
    # It will now automatically use the proxy we configured.
    print(f"Requesting {url_to_visit} via proxy {proxy_address}...")
    with urllib.request.urlopen(url_to_visit, timeout=10) as response:
        # Read and decode the response content
        response_data = response.read().decode('utf-8')
        print("\n--- Response from Server ---")
        print(response_data)
        print("---------------------------\n")
except urllib.error.URLError as e:
    print(f"Error: Failed to reach the server. Reason: {e.reason}")
    print("This could be due to an incorrect proxy, proxy being down, or network issues.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Explanation

  1. ProxyHandler({'http': ..., 'https': ...}): We create a dictionary mapping the protocol (http, https) to the proxy's URL. If you only need a proxy for http, you can omit the https key.
  2. build_opener(proxy_handler): This function creates an "opener," which is an object capable of opening URLs. We pass our ProxyHandler to it to tell the opener how to handle proxies.
  3. install_opener(opener): This is a convenience function. It installs your custom opener so that any subsequent calls to urllib.request.urlopen() will automatically use it. If you don't want to change the global behavior, you can skip this step and use opener.open(url) directly.

Advanced Configurations

a) SOCKS Proxy

The standard urllib library does not support SOCKS proxies directly. For SOCKS, you need a third-party library like requests (with requests[socks]) or urllib3 with a SOCKS proxy handler.

However, if you must use urllib, you would need to manually establish a SOCKS connection and tunnel your HTTP request through it, which is complex.

Easier Alternative with requests:

First, install the necessary library:

Python urlopen如何设置proxy代理?-图3
(图片来源网络,侵删)
pip install requests[socks]

Then, use it like this:

import requests
proxy_host = 'your_socks_proxy.com'
proxy_port = '1080'
proxies = {
    'http': f'socks5://{proxy_host}:{proxy_port}',
    'https': f'socks5://{proxy_host}:{proxy_port}'
}
try:
    response = requests.get('http://httpbin.org/ip', proxies=proxies, timeout=10)
    print(response.json())
except requests.exceptions.ProxyError as e:
    print(f"Proxy Error: {e}")

b) Authentication for Proxies

If your proxy requires a username and password, you simply add them to the proxy URL.

Proxy Details:

  • Host: secure-proxy.com
  • Port: 8080
  • Username: myuser
  • Password: mypassword

Code

import urllib.request
import base64
proxy_host = 'secure-proxy.com'
proxy_port = '8080'
username = 'myuser'
password = 'mypassword'
# Format: http://username:password@host:port
proxy_address = f'http://{username}:{password}@{proxy_host}:{proxy_port}'
proxy_handler = urllib.request.ProxyHandler({
    'http': proxy_address,
    'https': proxy_address
})
opener = urllib.request.build_opener(proxy_handler)
# No need to install_opener if you use opener.open directly
url_to_visit = 'http://httpbin.org/ip'
try:
    with opener.open(url_to_visit, timeout=10) as response:
        print(response.read().decode('utf-8'))
except urllib.error.HTTPError as e:
    if e.code == 407:
        print("Error: Proxy Authentication Failed.")
    else:
        print(f"HTTP Error: {e.code} - {e.reason}")
except Exception as e:
    print(f"An error occurred: {e}")

Important Considerations & Best Practices

a) Proxy Rotation

If you are using proxies for web scraping, sending too many requests through a single IP will get you blocked. You need to rotate proxies.

A simple way to do this is to have a list of proxies and pick one at random for each request.

import random
import urllib.request
# List of working proxies
PROXY_LIST = [
    'http://123.123.123.123:8080',
    'http://124.124.124.124:3128',
    'http://125.125.125.125:8888',
]
def get_random_proxy():
    return random.choice(PROXY_LIST)
def make_request(url):
    proxy_address = get_random_proxy()
    print(f"Using proxy: {proxy_address}")
    proxy_handler = urllib.request.ProxyHandler({
        'http': proxy_address,
        'https': proxy_address
    })
    # Create a new opener for each request to handle proxy failures gracefully
    opener = urllib.request.build_opener(proxy_handler)
    try:
        with opener.open(url, timeout=10) as response:
            return response.read().decode('utf-8')
    except urllib.error.URLError as e:
        print(f"Proxy {proxy_address} failed. Reason: {e.reason}")
        # In a real scraper, you'd remove this proxy from your list and try again
        return None
# --- Usage ---
url = 'http://httpbin.org/ip'
response_data = make_request(url)
if response_data:
    print("\n--- Success! ---")
    print(response_data)

b) Timeouts

Always set a timeout when opening URLs. Proxies can be slow or unresponsive, and your script could hang indefinitely otherwise. A value between 10-30 seconds is common.

# Good
with urllib.request.urlopen(url, timeout=15) as response:
    ...
# Bad - can hang forever
# with urllib.request.urlopen(url) as response:
#     ...

c) HTTPS and SSL Proxies

When you specify a proxy for https, urllib will perform an SSL/TLS handshake with the final destination server (e.g., google.com) but the connection will be routed through the proxy. The proxy itself can see the unencrypted traffic unless it's an SSL/TLS proxy that performs its own MITM (Man-in-the-Middle) inspection.

d) Error Handling

Proxies are less reliable than direct connections. You should always wrap your urlopen calls in a try...except block to catch:

  • urllib.error.URLError: For general network issues, including proxy connection failures.
  • urllib.error.HTTPError: For specific HTTP error codes, like 407 Proxy Authentication Required.

Summary: urllib vs. requests

Feature urllib.request (Standard Library) requests (3rd Party Library)
Ease of Use More verbose, requires manual setup of handlers and openers. Much simpler, more intuitive API.
Proxy Support HTTP/HTTPS only. No built-in SOCKS support. Excellent support for HTTP, HTTPS, and SOCKS (with requests[socks]).
Authentication Built-in, just add user:pass@ to the URL. Very flexible, supports different auth types (basic, digest, etc
分享:
扫描分享到社交APP
上一篇
下一篇