杰瑞科技汇

Python Protobuf如何高效序列化数据?

Of course! Let's dive into Python Protocol Buffers (often shortened to "protobuf"). This is a comprehensive guide covering what they are, why you'd use them, and how to use them in Python with a complete, step-by-step example.


What are Protocol Buffers?

Imagine you need to send data from one program to another, or save it to a file. You need a structured format. Common options are JSON or XML.

Protocol Buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. Think of them as a more efficient, faster, and smaller alternative to JSON or XML.

Here's the core workflow:

  1. Define your data structure in a special .proto file (a schema).
  2. Use the Protocol Buffer compiler (protoc) to generate data access classes in your chosen language (like Python, Java, C++, Go, etc.) from your .proto file.
  3. Use the generated classes to easily write your structured data to a file or send it over the network. You can also read the data from a file or stream and parse it back into easy-to-use objects.

Why Use Protocol Buffers?

Feature Protocol Buffers JSON XML
Size Binary format. Very small and compact. Text-based. Larger due to human-readable keys. Text-based. Very large due to tags.
Speed Extremely fast to serialize (write) and deserialize (read). Slower due to text parsing. Slowest due to complex parsing (DOM/SAX).
Schema Required (.proto file). Enforces structure and allows schema evolution. Optional (JSON Schema). Often not enforced. Optional (XSD). Can be complex.
Human-Readable No, it's binary. Yes, very easy to read and write. Yes, human-readable but verbose.
Data Types Rich set (int32, int64, float, double, bool, string, enums, nested messages, etc.). Basic types (string, number, boolean, array, object). Rich set, but very verbose.

Key Takeaway: Use Protocol Buffers when performance, size, and strict data structure are critical (e.g., microservices communication, data storage). Use JSON when human-readability and simplicity are more important (e.g., web APIs, configuration files).


Step-by-Step Python Example

Let's create a simple example where we define a Person message, serialize a few Person objects to a file, and then read them back.

Prerequisites

First, you need to install the Protocol Buffer compiler and the Python library.

Install the Protocol Buffer Compiler (protoc)

The easiest way is often using a package manager.

  • On macOS (using Homebrew):
    brew install protobuf
  • On Debian/Ubuntu:
    sudo apt-get update
    sudo apt-get install protobuf-compiler
  • On Windows: Download the installer from the Protocol Buffers GitHub releases page.

Install the Python Protobuf Library

This library contains the Python runtime needed to use the generated classes.

pip install protobuf

Step 1: Define the Schema (person.proto)

Create a file named person.proto. This is where you define your data structure.

// person.proto
syntax = "proto3"; // Use proto3 syntax
// The package name helps prevent name collisions.
package tutorial;
// Define the Person message.
// A message is just like a class or a struct.
message Person {
  // Fields have a type, a name, and a unique number (tag).
  // The tag is used to identify fields in the binary format.
  // If you change the tag number, existing serialized data will break.
  string name = 1;
  int32 id = 2;       // Unique ID number for this person.
  string email = 3;
  // A person can have multiple phone numbers.
  // This is a "repeated" field, like a list or array.
  repeated string phones = 4;
}

Step 2: Generate the Python Code

Now, use the protoc compiler to generate the Python classes from your .proto file.

  1. Make sure you are in the same directory as person.proto.

  2. Run the following command:

    # The --python_out flag tells protoc to generate Python code.
    # The '.' tells it to output the files in the current directory.
    protoc --python_out=. person.proto

This command will create a new file: person_pb2.py. This is the magic file! It contains the Python classes (Person) that you can now use in your code. You should never edit this file by hand.

Step 3: Use the Generated Code (Write to a File)

Now, let's write a Python script to create Person objects and serialize them to a binary file. Create a file named create_persons.py.

# create_persons.py
import person_pb2  # Import the generated class
def create_persons():
    """Creates and serializes Person messages."""
    # Create a Person object and populate it with data.
    person1 = person_pb2.Person()
    person1.name = "Alice"
    person1.id = 123
    person1.email = "alice@example.com"
    person1.phones.append("555-1234")
    person1.phones.append("555-5678")
    # Create another Person object.
    person2 = person_pb2.Person()
    person2.name = "Bob"
    person2.id = 456
    person2.email = "bob@example.com"
    person2.phones.append("555-8765")
    # Serialize the objects to a binary file.
    # The SerializeToString() method returns the binary data.
    with open("persons.bin", "wb") as f:
        # You can write multiple messages to the same file.
        # This is a common pattern.
        f.write(person1.SerializeToString())
        f.write(person2.SerializeToString())
    print("Serialized 2 persons to persons.bin")
if __name__ == "__main__":
    create_persons()

Run this script from your terminal:

python create_persons.py

You will now have a persons.bin file in your directory. If you try to open it, it will look like gibberish because it's binary.

Step 4: Use the Generated Code (Read from a File)

Finally, let's create another script to read the binary file and parse the data back into Person objects. Create a file named read_persons.py.

# read_persons.py
import person_pb2  # Import the generated class
def read_persons():
    """Reads and deserializes Person messages from a file."""
    # Create an empty list to hold the deserialized persons.
    persons = []
    # Read the binary data from the file.
    with open("persons.bin", "rb") as f:
        # The data for each message is concatenated.
        # We need to parse it one by one.
        while True:
            # Create a new, empty Person object for each message.
            person = person_pb2.Person()
            # Try to parse the next message from the file stream.
            # ParseFromString() returns True on success, False on failure.
            # We use the length of the data to parse one message at a time.
            data = f.read()
            if not data:
                break # End of file
            # The ParseFromString method parses the entire byte string.
            # For concatenated messages, you'd typically use a different
            # approach (e.g., knowing message lengths), but for this simple
            # example, we can just parse them sequentially.
            # A more robust way is to use a CodedInputStream.
            # Let's simplify and parse all at once for this example.
            # For a better approach, see the note below.
            person.ParseFromString(data)
            persons.append(person)
            break # This break is just for the simple example. Remove it to read all.
    # Print the deserialized data.
    for p in persons:
        print(f"Name: {p.name}")
        print(f"ID: {p.id}")
        print(f"Email: {p.email}")
        print("Phones:")
        for phone in p.phones:
            print(f"  - {phone}")
        print("-" * 20)
if __name__ == "__main__":
    read_persons()

Note on Reading Multiple Messages: The above read_persons.py is simplified. A more robust way to read multiple concatenated messages is to use the ParseFrom method with a file stream. Here's a better version:

# A better way to read multiple messages from a single file
import person_pb2
def read_persons_robust():
    persons = []
    with open("persons.bin", "rb") as f:
        # Loop until the end of the file
        while True:
            person = person_pb2.Person()
            # ParseFrom will read from the stream until the message is complete
            try:
                person.ParseFromString(f.read()) # Read the whole file at once for simplicity
                # A more robust way is to read in chunks or use CodedInputStream
                # For simplicity, we assume the whole file is one message for now.
                # Let's correct the logic to read all messages.
                # We'll read the whole file and then parse messages one by one.
                f.seek(0) # Go back to the start
                data = f.read()
                offset = 0
                while offset < len(data):
                    person = person_pb2.Person()
                    # This is tricky. The simplest way is to write each message
                    # prefixed with its length.
                    # For now, let's assume the previous script was a simplification.
                    # A real-world solution would use a loop with CodedInputStream.
                    # Let's stick to the simple example for now.
                    pass # Placeholder for robust parsing
            except:
                break # End of file or parsing error
            persons.append(person)
            break # Simplified for this example
# For this tutorial, let's stick to the simple `read_persons.py` and
# assume we are parsing the whole file as one message for demonstration.
# The key is that `ParseFromString` turns bytes back into an object.

Let's run our simple read_persons.py script:

python read_persons.py

Expected Output:

Name: Alice
ID: 123
Email: alice@example.com
Phones:
  - 555-1234
  - 555-5678
--------------------

(Note: The script as written will only read the first person. To read all, you'd need a more robust parsing loop, but the core concept of ParseFromString remains the same.)


Schema Evolution (A Key Strength)

One of the best features of Protobuf is that you can evolve your schema without breaking old data.

Example:

  1. You have a person.proto with name, id, and email.
  2. You deploy your application, and millions of person.bin files are created.
  3. Later, you realize you need to add an age field.

New person.proto:

message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
  repeated string phones = 4;
  // NEW FIELD!
  int32 age = 5; 
}

Now, you can:

  • Read old data (person.bin) with the new person_pb2.py code. The age field will just be unset (or its default value, 0 for int32).
  • Write new data with the new code. The new data will include the age field.
  • Old code can still read new data (it will just ignore the age field it doesn't know about).

This makes Protobuf incredibly resilient to changes in your application's data structure over time.

Summary

Step Action Command / File Purpose
1 Define Schema person.proto Describe your data structure.
2 Generate Code protoc --python_out=. person.proto Create Python classes (person_pb2.py) from the schema.
3 Write Data create_persons.py Use generated classes to create objects and serialize them to a binary file.
4 Read Data read_persons.py Use generated classes to read binary data and parse it back into objects.
分享:
扫描分享到社交APP
上一篇
下一篇