杰瑞科技汇

Python onnxruntime如何高效部署与推理?

Of course! onnxruntime is a crucial library in the Python ecosystem for machine learning, especially when it comes to deploying models. Here's a comprehensive guide covering what it is, why it's used, how to install it, and how to use it with code examples.

Python onnxruntime如何高效部署与推理?-图1
(图片来源网络,侵删)

What is ONNX Runtime?

ONNX Runtime is a high-performance inference engine for Open Neural Network Exchange (ONNX) models.

Let's break that down:

  • ONNX (Open Neural Network Exchange): This is an open standard format for representing machine learning models. Think of it as a universal "container" for a model. If you train a model in one framework (like PyTorch or TensorFlow), you can convert it to the ONNX format. Then, another framework (or a production environment) can use this ONNX file without needing the original training framework.
  • Runtime: This is the "engine" that takes an ONNX model and runs it (performs inference) on your hardware. It's optimized for speed and efficiency.

Key takeaway: onnxruntime allows you to run a model that has been converted to the ONNX format efficiently, regardless of the original framework it was created in.


Why Use ONNX Runtime?

There are several compelling reasons to use it:

Python onnxruntime如何高效部署与推理?-图2
(图片来源网络,侵删)
  1. Framework Agnosticism: This is the biggest advantage. You can train in PyTorch, convert to ONNX, and deploy your model using onnxruntime in a production environment that doesn't have PyTorch installed. This simplifies deployment pipelines.
  2. Performance: ONNX Runtime is highly optimized. It uses techniques like graph optimizations and hardware-specific kernels (e.g., for CPUs, GPUs, TPUs) to achieve very fast inference speeds, often faster than running the model in its original framework.
  3. Hardware Acceleration: It supports a wide range of hardware:
    • CPU: Uses Intel's OpenVINO, NVIDIA's TensorRT, or its own optimized CPU execution providers.
    • GPU: Uses CUDA, TensorRT, or DirectML (for Windows).
    • Edge Devices: Supports specialized hardware like ARM CPUs, Qualcomm Hexagon DSPs, and Neural Processing Units (NPUs).
  4. Cross-Platform: It works on Windows, Linux, and macOS, making it easy to deploy across different environments.

Installation

Installation is straightforward using pip. The basic package covers CPU execution.

# Install the CPU version
pip install onnxruntime
# If you need GPU support (NVIDIA)
pip install onnxruntime-gpu
# For Apple Silicon (M1/M2/M3) Macs, use the directml provider
pip install onnxruntime-directml

Note: For onnxruntime-gpu, you must have the correct versions of CUDA and cuDNN installed on your system for it to work.


How to Use ONNX Runtime: A Step-by-Step Guide

Let's walk through the process with a complete, runnable example.

Step 1: Get an ONNX Model

You can either:

  • Use a pre-trained model from the ONNX Model Zoo.
  • Convert a model from another framework (like PyTorch) to ONNX.

For this example, we'll use a simple pre-trained model from the ONNX Model Zoo. We'll download a MobileNetV2 model for image classification.

import urllib.request
import os
# URL for a pre-trained MobileNetV2 model (ONNX format)
model_url = "https://github.com/onnx/models/raw/main/vision/classification/mobilenet/model/mobilenetv2-7.onnx"
model_path = "mobilenetv2-7.onnx"
# Download the model if it doesn't exist
if not os.path.exists(model_path):
    print(f"Downloading model from {model_url}...")
    urllib.request.urlretrieve(model_url, model_path)
    print("Download complete.")
else:
    print(f"Model already exists at {model_path}")

Step 2: Prepare Input Data

For an image classification model, the input is typically an image. We need to:

  1. Load the image.
  2. Resize it to the model's expected input size (e.g., 224x224 for MobileNetV2).
  3. Convert it to a numerical array (a NumPy tensor).
  4. Normalize the pixel values.
  5. Add a "batch" dimension, as models expect a batch of inputs.
import numpy as np
from PIL import Image
import onnxruntime
# Model's expected input shape (batch_size, channels, height, width)
input_shape = (1, 3, 224, 224)
# Load a sample image (you can replace this with any image)
# For this example, let's create a dummy image
image = Image.new('RGB', (224, 224), color = 'red')
# To use a real image, uncomment the line below:
# image = Image.open("path/to/your/image.jpg")
# Preprocess the image
image = image.resize((224, 224))
image_array = np.array(image).astype(np.float32)
# Normalize the image (e.g., ImageNet mean and std)
mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
image_array = (image_array / 255.0 - mean) / std
# Reorder dimensions from (H, W, C) to (C, H, W) and add batch dimension
image_array = np.transpose(image_array, (2, 0, 1))
image_array = np.expand_dims(image_array, axis=0)
print("Input array shape:", image_array.shape) # Should be (1, 3, 224, 224)

Step 3: Create an ONNX Runtime Session and Run Inference

This is the core of the process. We create a Session object and then use it to run the model.

# Create an ONNX Runtime inference session
# For GPU, you would specify providers=['CUDAExecutionProvider']
try:
    session = onnxruntime.InferenceSession(model_path, providers=['CUDAExecutionProvider'] if onnxruntime.get_device() == 'GPU' else ['CPUExecutionProvider'])
except Exception as e:
    print(f"Could not create session with GPU: {e}. Falling back to CPU.")
    session = onnxruntime.InferenceSession(model_path, providers=['CPUExecutionProvider'])
# Get the input name of the model
input_name = session.get_inputs()[0].name
print(f"Model input name: {input_name}")
# Run the model
# The first argument is a dictionary mapping input names to numpy arrays
# The second argument specifies the output names you want to get back
results = session.run(None, {input_name: image_array})
# The output is a list of numpy arrays (one for each output specified)
output = results[0]
print("Output shape:", output.shape) # Should be (1, 1000) for 1000 ImageNet classes

Step 4: Process the Output

The raw output is a set of logits (scores) for each class. To get a human-readable prediction, we typically:

  1. Apply the softmax function to get probabilities.
  2. Find the index of the highest probability.
  3. Map that index to the corresponding class label.
# Apply softmax to convert logits to probabilities
probabilities = np.exp(output) / np.sum(np.exp(output))
# Get the index of the highest probability
predicted_class_index = np.argmax(probabilities)
predicted_probability = np.max(probabilities)
print(f"\nPredicted Class Index: {predicted_class_index}")
print(f"Predicted Probability: {predicted_probability:.4f}")
# To get the actual class name, you need a mapping file (e.g., from ImageNet)
# For this example, we'll just print the index.
# A real application would load a file like 'imagenet_classes.txt' to map the index to a name.
# For example:
# with open("imagenet_classes.txt", "r") as f:
#     classes = [line.strip() for line in f.readlines()]
# print(f"Predicted Class: {classes[predicted_class_index]}")

Key Concepts and Advanced Usage

  • Execution Providers (EPs): This is how you tell ONNX Runtime which hardware to use.

    # Prioritize GPU, fall back to CPU if not available
    session = onnxruntime.InferenceSession(
        model_path,
        providers=['CUDAExecutionProvider', 'CPUExecutionProvider']
    )

    Common EPs include CUDAExecutionProvider, TensorrtExecutionProvider, CPUExecutionProvider, and DirectMLExecutionProvider.

  • Model Metadata: You can inspect the model's input and output details.

    # Get input details
    input_info = session.get_inputs()[0]
    print(f"Input Name: {input_info.name}")
    print(f"Input Type: {input_info.type}")
    print(f"Input Shape: {input_info.shape}") # Shape can be dynamic, e.g., ['batch_size', 3, 224, 224]
    # Get output details
    output_info = session.get_outputs()[0]
    print(f"Output Name: {output_info.name}")
    print(f"Output Type: {output_info.type}")
    print(f"Output Shape: {output_info.shape}")
  • Advanced Optimizations: For production, you can use tools like onnxruntime.transformers to apply more aggressive graph optimizations that can further improve performance.

分享:
扫描分享到社交APP
上一篇
下一篇