杰瑞科技汇

Python 2并发如何实现高效多任务处理?

Of course. Let's dive into concurrency in Python 2. This is a crucial topic because the way Python 2 handled concurrency is fundamentally different from modern Python 3.

Python 2并发如何实现高效多任务处理?-图1
(图片来源网络,侵删)

The Core Concept: The Global Interpreter Lock (GIL)

Before we talk about the tools, you must understand the Global Interpreter Lock (GIL).

  • What it is: The GIL is a mutex (a lock) that protects access to Python objects, preventing multiple native threads from executing Python bytecode at the same time within a single process.
  • The Implication: This means that even if you use multiple threads, only one thread can execute Python code at any given moment. This prevents true parallelism on multi-core processors for CPU-bound tasks.
  • Why it exists: It simplifies memory management in CPython (the standard Python implementation). Because the GIL exists, you don't need to worry about race conditions when basic CPython objects are being modified, as the GIL acts as a giant lock.

So, when does concurrency help in Python 2?

  1. I/O-Bound Tasks: When a thread is waiting for an external operation to complete (like reading a file, making a network request, or querying a database), it releases the GIL. This allows another thread to run. This is the primary use case for threading in Python 2/3.
  2. CPU-Bound Tasks: For tasks that are purely mathematical and don't involve I/O (e.g., heavy calculations, image processing), the GIL will be a bottleneck. For these tasks in Python 2, the best tool is multiprocessing, which gets around the GIL by using separate processes, each with its own memory space and its own GIL.

The Main Concurrency Tools in Python 2

Python 2 provided two primary modules for concurrency: threading and multiprocessing. A third, multiprocessing.dummy, is a lesser-known but very useful wrapper.

threading Module

The threading module is the standard way to handle multiple threads. It's perfect for I/O-bound applications.

Python 2并发如何实现高效多任务处理?-图2
(图片来源网络,侵删)

Use Case: Making multiple network requests, reading from multiple files, or any task that spends most of its time waiting.

Key Concepts:

  • Thread: The class used to create and manage a new thread.
  • Lock: A synchronization primitive to prevent race conditions when multiple threads try to modify the same shared resource.
  • Queue: A thread-safe data structure for passing data between threads. This is highly recommended over manually sharing lists or dictionaries.

Example: Downloading Multiple URLs with Threads

This classic example shows how to download several web pages concurrently. Notice how much faster it is than doing it sequentially.

Python 2并发如何实现高效多任务处理?-图3
(图片来源网络,侵删)
# concurrent_downloader.py
import threading
import urllib2
import time
# A shared list to store results
results = []
# A lock to prevent race conditions when appending to the list
lock = threading.Lock()
def download_url(url):
    """Downloads a single URL and appends its content length to results."""
    try:
        print "Starting download: %s" % url
        response = urllib2.urlopen(url)
        content = response.read()
        # Use the lock to safely modify the shared list
        with lock:
            results.append((url, len(content)))
            print "Finished download: %s (size: %d bytes)" % (url, len(content))
    except Exception as e:
        print "Error downloading %s: %s" % (url, e)
if __name__ == "__main__":
    urls = [
        'http://www.python.org',
        'http://www.yahoo.com',
        'http://www.google.com',
        'http://www.apache.org',
        'http://www.github.com'
    ]
    start_time = time.time()
    # Create a list of thread objects
    threads = []
    for url in urls:
        thread = threading.Thread(target=download_url, args=(url,))
        threads.append(thread)
        thread.start()
    # Wait for all threads to complete
    for thread in threads:
        thread.join()
    end_time = time.time()
    print "\n--- All downloads complete ---"
    for url, size in results:
        print "%s: %d bytes" % (url, size)
    print "\nTotal time: %f seconds" % (end_time - start_time)

To run this, you'd execute it from the command line. You'll see the output messages from different threads interleaved, demonstrating that they are running concurrently.

multiprocessing Module

The multiprocessing module was introduced in Python 2.6 to address the GIL limitation for CPU-bound tasks. It creates new processes, each with its own Python interpreter and memory space.

Use Case: Video encoding, scientific calculations, data processing, any task that is heavy on CPU.

Key Concepts:

  • Process: The class used to create and manage a new process.
  • Queue, Pipe: Inter-process communication (IPC) mechanisms to pass data between processes. This is necessary because they don't share memory.
  • Pool: A high-level abstraction that manages a pool of worker processes, making it easy to parallelize a function across multiple inputs.

Example: CPU-Bound Task with multiprocessing.Pool

Let's create a function that simulates a heavy computation and then run it on multiple inputs in parallel.

# cpu_bound_worker.py
import multiprocessing
import time
import random
def is_prime(n):
    """A CPU-bound function to check if a number is prime."""
    if n <= 1:
        return False
    if n == 2:
        return True
    if n % 2 == 0:
        return False
    for i in range(3, int(n**0.5) + 1, 2):
        if n % i == 0:
            return False
    return True
def check_prime_chunk(numbers_chunk):
    """Processes a chunk of numbers and returns the primes found."""
    primes_found = []
    for num in numbers_chunk:
        if is_prime(num):
            primes_found.append(num)
    return primes_found
if __name__ == "__main__":
    # We use the `if __name__ == "__main__":` guard to prevent issues on some platforms
    # (like Windows) when importing this module.
    # Generate a large list of random numbers to check
    all_numbers = [random.randint(1, 100000) for _ in range(100000)]
    # Create a pool of 4 worker processes
    pool = multiprocessing.Pool(processes=4)
    # Split the list of numbers into chunks for each process
    # We can let the pool decide the chunk size or specify it.
    chunk_size = len(all_numbers) / 4 
    start_time = time.time()
    # map() blocks until all results are ready
    results = pool.map(is_prime, all_numbers)
    # Alternatively, for more control, we can use chunks
    # results_chunks = pool.map(check_prime_chunk, [all_numbers[i:i+chunk_size] for i in range(0, len(all_numbers), chunk_size)])
    # all_primes = [prime for chunk in results_chunks for prime in chunk]
    end_time = time.time()
    print "Checked %d numbers." % len(all_numbers)
    print "Total time: %f seconds" % (end_time - start_time)
    pool.close()
    pool.join()

When you run this, you'll see that it uses all available CPU cores (up to 4 in this case), significantly speeding up the computation compared to a single-threaded version.

multiprocessing.dummy Module (The Hidden Gem)

This is a simple wrapper in the multiprocessing module that makes a multiprocessing.Pool behave like a threading.pool.

What it does: It provides a Pool class that uses threads instead of processes.

Why is it useful? It gives you the clean, high-level API of multiprocessing.Pool but for threading. This is great for I/O-bound tasks where you want to write concise code without manually managing threading.Thread objects.

Example: Using multiprocessing.dummy for the downloader

This achieves the same result as the first example but is often considered more readable.

# dummy_downloader.py
import urllib2
import time
from multiprocessing.dummy import Pool as ThreadPool
def download_url(url):
    """Downloads a single URL and returns its URL and content length."""
    try:
        print "Starting download: %s" % url
        response = urllib2.urlopen(url)
        content = response.read()
        print "Finished download: %s (size: %d bytes)" % (url, len(content))
        return (url, len(content))
    except Exception as e:
        print "Error downloading %s: %s" % (url, e)
        return (url, 0)
if __name__ == "__main__":
    urls = [
        'http://www.python.org',
        'http://www.yahoo.com',
        'http://www.google.com',
        'http://www.apache.org',
        'http://www.github.com'
    ]
    start_time = time.time()
    # Create a ThreadPool with 4 worker threads
    pool = ThreadPool(processes=4)
    # map() applies the function to each item in the list and returns a list of results
    results = pool.map(download_url, urls)
    # Close the pool and wait for the work to finish
    pool.close()
    pool.join()
    end_time = time.time()
    print "\n--- All downloads complete ---"
    for url, size in results:
        if size > 0:
            print "%s: %d bytes" % (url, size)
    print "\nTotal time: %f seconds" % (end_time - start_time)

Summary: Which One to Use in Python 2?

Scenario Recommended Tool Why?
I/O-Bound Task (network, disk, database) threading or multiprocessing.dummy.Pool Threads release the GIL while waiting, allowing other threads to run. dummy.Pool is a clean way to use threads.
CPU-Bound Task (heavy calculations) multiprocessing Bypasses the GIL by using separate processes, allowing true parallel execution on multiple cores.
Need to share simple data between threads threading.Lock Essential for preventing race conditions when modifying shared variables.
Need to pass data between threads/processes Queue Thread-safe and process-safe data structure designed for producer-consumer patterns.
Want to parallelize a function easily multiprocessing.Pool High-level API that manages a pool of workers, abstracting away the complexity of Process objects.

The Modern Path: Python 3

It's important to know that Python 3 has refined these concepts:

  • The queue module is used for both threads and processes.
  • The concurrent.futures module provides a high-level, unified interface for both ThreadPoolExecutor and ProcessPoolExecutor, making the choice even cleaner.
  • The asyncio module offers a different concurrency model based on single-threaded asynchronous I/O, which is extremely powerful for high-performance network services.

However, if you are maintaining or working on a Python 2 codebase, understanding threading and multiprocessing is essential.

分享:
扫描分享到社交APP
上一篇
下一篇