杰瑞科技汇

Python并发编程,多线程还是多进程?

Of course! Python concurrency is a powerful and essential topic for writing efficient programs, especially when dealing with I/O-bound tasks like network requests, database calls, or file operations.

Python并发编程,多线程还是多进程?-图1
(图片来源网络,侵删)

Let's break it down from the basics to the more advanced concepts.

The Core Problem: The Global Interpreter Lock (GIL)

Before diving into the solutions, you must understand the Global Interpreter Lock (GIL).

  • What is it? The GIL is a mutex (or lock) that protects access to Python objects, preventing multiple native threads from executing Python bytecode at the same time within a single process.
  • What does it mean? It means that true parallelism on multi-core processors is not possible with threads for CPU-bound tasks. Only one thread can execute Python code at any given moment.
  • When does it matter?
    • CPU-bound tasks: Tasks that are heavy on computation (e.g., complex math, image processing, data crunching). The GIL will prevent threads from running in parallel, so you won't get a speedup.
    • I/O-bound tasks: Tasks that spend most of their time waiting for external operations (e.g., reading a file, making a network request, waiting for a database query). While one thread is waiting, the GIL is released, allowing another thread to run. This is where threads excel.

Because of the GIL, Python offers several concurrency models, each suited for different scenarios.


The Four Main Concurrency Models in Python

Here are the primary tools you'll use, ordered from simplest to most complex.

Python并发编程,多线程还是多进程?-图2
(图片来源网络,侵删)

Threads (threading module)

Threads are the most basic form of concurrency. They are lightweight and share the same memory space.

  • Best for: I/O-bound tasks. While one thread is waiting for a network response, another can continue its work.
  • How it works: You create multiple threads that run parts of your program concurrently. The Operating System switches between them rapidly.
  • Caveat: Due to the GIL, threads do not provide a performance boost for CPU-bound tasks. They are still useful for CPU-bound tasks if you have operations that release the GIL (e.g., calling certain C libraries like NumPy).
  • Key Challenge: Shared State. Because all threads in a process share memory, you must use synchronization primitives like Lock, Queue, or Semaphore to prevent race conditions and data corruption.

Example: Downloading images (I/O-bound)

import threading
import requests
def download_image(url, filename):
    print(f"Downloading {url}...")
    response = requests.get(url, stream=True)
    if response.status_code == 200:
        with open(filename, 'wb') as f:
            for chunk in response.iter_content(1024):
                f.write(chunk)
    print(f"Finished downloading {url}")
urls = [
    "https://example.com/image1.jpg",
    "https://example.com/image2.jpg",
    "https://example.com/image3.jpg",
]
# Create a list to hold thread objects
threads = []
for i, url in enumerate(urls):
    filename = f"image_{i+1}.jpg"
    # Create a thread for each download task
    thread = threading.Thread(target=download_image, args=(url, filename))
    threads.append(thread)
    thread.start()
# Wait for all threads to complete
for thread in threads:
    thread.join()
print("All downloads finished!")

Multiprocessing (multiprocessing module)

Multiprocessing bypasses the GIL by creating separate processes, each with its own Python interpreter and memory space.

  • Best for: CPU-bound tasks. Since each process has its own GIL, they can run on different CPU cores in true parallel.
  • How it works: It spawns new processes, which are heavier than threads but offer full parallelism. Inter-process communication (IPC) is required to share data, which is more complex than sharing memory.
  • Key Concept: if __name__ == "__main__": is crucial for protecting the entry point of the program on some platforms (like Windows) to prevent infinite spawning of processes.

Example: Calculating squares of numbers (CPU-bound)

Python并发编程,多线程还是多进程?-图3
(图片来源网络,侵删)
import multiprocessing
def calculate_square(number):
    result = number * number
    print(f"The square of {number} is {result}")
    return result
if __name__ == "__main__":
    numbers = [1, 2, 3, 4, 5, 6, 7, 8]
    # Create a pool of worker processes
    with multiprocessing.Pool(processes=4) as pool:
        # The pool.map function distributes the work across the processes
        results = pool.map(calculate_square, numbers)
    print(f"Final results: {results}")

Asyncio (asyncio module)

Asyncio is a different paradigm. It's single-threaded and uses an event loop to manage a set of "coroutines."

  • Best for: High-concurrency I/O-bound tasks, especially network programming. It's more efficient than threads for handling thousands of simultaneous connections (e.g., web servers, chat applications).
  • How it works: Instead of using threads, asyncio uses async and await keywords. When a coroutine hits an await on an I/O operation, it yields control back to the event loop, which can then run another coroutine. This is called cooperative multitasking.
  • Key Concept: Coroutines are functions that can be paused and resumed. They are extremely lightweight compared to threads or processes.

Example: Fetching web pages (I/O-bound with asyncio)

import asyncio
import aiohttp # A popular async HTTP client library
async def fetch_url(session, url):
    print(f"Fetching {url}...")
    try:
        async with session.get(url, timeout=10) as response:
            # await pauses here until the response is received
            data = await response.text()
            print(f"Finished fetching {url}, length: {len(data)}")
            return len(data)
    except Exception as e:
        print(f"Error fetching {url}: {e}")
        return 0
async def main():
    urls = [
        "https://www.python.org",
        "https://github.com",
        "https://www.wikipedia.org",
    ]
    # aiohttp.ClientSession is the equivalent of requests.Session for async
    async with aiohttp.ClientSession() as session:
        # asyncio.gather runs multiple coroutines concurrently
        tasks = [fetch_url(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
    print(f"Total lengths: {results}")
# Run the main coroutine
asyncio.run(main())

Concurrent Futures (concurrent.futures module)

This is a high-level interface that provides a way to execute callable objects asynchronously using either threads or processes. It's often considered the "best of both worlds" because it abstracts away the low-level details of threading and multiprocessing.

  • Best for: A general-purpose, easy-to-use interface for both I/O-bound and CPU-bound tasks.
  • How it works: You create a ThreadPoolExecutor or ProcessPoolExecutor and submit tasks to it. It returns a Future object, which represents the eventual result of an asynchronous operation.

Example: A unified approach for I/O and CPU tasks

import concurrent.futures
import requests
# I/O-bound task
def download_image(url):
    print(f"Downloading {url}...")
    response = requests.get(url)
    return len(response.content)
# CPU-bound task
def calculate_factorial(n):
    if n == 0:
        return 1
    return n * calculate_factorial(n - 1)
if __name__ == "__main__":
    # --- Using ThreadPoolExecutor for I/O-bound tasks ---
    urls = ["https://www.python.org", "https://github.com"]
    with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
        future_to_url = {executor.submit(download_image, url): url for url in urls}
        for future in concurrent.futures.as_completed(future_to_url):
            url = future_to_url[future]
            try:
                length = future.result()
                print(f"Downloaded {url}, length: {length}")
            except Exception as e:
                print(f"Error downloading {url}: {e}")
    print("-" * 20)
    # --- Using ProcessPoolExecutor for CPU-bound tasks ---
    numbers = [35, 36, 37]
    with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
        future_to_num = {executor.submit(calculate_factorial, num): num for num in numbers}
        for future in concurrent.futures.as_completed(future_to_num):
            num = future_to_num[future]
            try:
                fact = future.result()
                print(f"Factorial of {num} is {fact}")
            except Exception as e:
                print(f"Error calculating factorial of {num}: {e}")

Summary and When to Use What

Model Best For How it Works Pros Cons
threading I/O-bound tasks Multiple threads in a single process. Lightweight. Simple to use. Good for I/O. GIL limits CPU-bound parallelism. Shared state requires locks.
multiprocessing CPU-bound tasks Multiple processes, each with its own GIL and memory. True parallelism on multi-core CPUs. Bypasses GIL. High overhead (memory, startup time). Inter-process communication is complex.
asyncio High-concurrency I/O (e.g., web servers) Single-threaded event loop with coroutines (async/await). Extremely efficient for thousands of connections. Low overhead. Not for CPU-bound tasks. Steeper learning curve. Requires async-compatible libraries.
concurrent.futures General purpose High-level wrapper for threading and multiprocessing. Easy to use, abstracts complexity. Clean API. Adds a layer of abstraction. Still subject to the underlying model's limitations.

Final Recommendation

  • For simple I/O tasks (e.g., downloading a few files): threading or concurrent.futures.ThreadPoolExecutor is a great start.
  • For CPU-heavy tasks (e.g., data analysis, video processing): Use multiprocessing or concurrent.futures.ProcessPoolExecutor.
  • For building high-performance network services (e.g., web APIs, chat servers): asyncio is the modern and most efficient choice.
  • For general-purpose concurrency where you might switch between I/O and CPU tasks: concurrent.futures is often the most convenient and readable option.
分享:
扫描分享到社交APP
上一篇
下一篇