杰瑞科技汇

Python字符串与字节如何转换?

Of course! This is a fundamental and often confusing topic in Python. Let's break it down clearly.

The Core Difference: Text vs. Bytes

Think of it like this:

  • A String (str) is an abstract representation of text. It's a sequence of human-readable characters.

    • Example: "Hello, 世界!"
    • Internally, Python doesn't store 'H', 'e', 'l', etc. It stores a sequence of Unicode code points. The character 'H' is the code point U+0048, and the character '世' is U+4E16.
  • Bytes (bytes) is a concrete sequence of raw, 8-bit values (integers from 0 to 255). It's a low-level representation of data.

    • Example: b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
    • This is what you actually send over a network, write to a file, or use to encode a string.

The key takeaway: A str is text, while bytes is the binary data that represents that text in a specific encoding.


Python 3: The Clear Separation (Recommended)

In Python 3, str and bytes are two distinct, incompatible types. This is a major improvement over Python 2 and forces you to be explicit about encoding and decoding.

Encoding: Converting str to bytes

You use the .encode() method on a string to turn it into bytes. You must specify an encoding (like 'utf-8', 'ascii', 'latin-1').

# Our text string
my_string = "Hello, 世界!"
# --- UTF-8 Encoding (Most Common) ---
# UTF-8 is a variable-width encoding that can represent every character in Unicode.
# It's the standard for the web and most modern applications.
utf8_bytes = my_string.encode('utf-8')
print(f"Original String: {my_string}")
print(f"Type: {type(my_string)}")
print(f"UTF-8 Bytes: {utf8_bytes}")
print(f"Type: {type(utf8_bytes)}")
print("-" * 20)
# --- ASCII Encoding (Limited) ---
# ASCII can only represent characters from 0-127. It will fail on characters outside this range.
try:
    ascii_bytes = my_string.encode('ascii')
except UnicodeEncodeError as e:
    print(f"Encoding to ASCII failed: {e}")
    print("Because '世' and '界' are not ASCII characters.")
print("-" * 20)
# --- Latin-1 Encoding (Handles more, but still not all) ---
# Latin-1 (ISO-8859-1) can represent characters 0-255. It can encode '世' and '界'
# but it will use the wrong code points, corrupting the original meaning.
latin1_bytes = my_string.encode('latin-1')
print(f"Latin-1 Bytes: {latin1_bytes}")
print("Note: The bytes for '世' and '界' are different from UTF-8.")

Output:

Original String: Hello, 世界!
Type: <class 'str'>
UTF-8 Bytes: b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
Type: <class 'bytes'>
--------------------
Encoding to ASCII failed: 'ascii' codec can't encode characters in position 7-8: ordinal not in range(128)
Because '世' and '界' are not ASCII characters.
--------------------
Latin-1 Bytes: b'Hello, \xa4\xa6\xa7\xa5!'
Note: The bytes for '世' and '界' are different from UTF-8.

Decoding: Converting bytes to str

You use the .decode() method on a bytes object to turn it back into a string. Again, you must specify the encoding that was used to create the bytes.

# Let's use the UTF-8 bytes from the previous example
utf8_bytes = b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
# Decode the bytes back into a string
decoded_string = utf8_bytes.decode('utf-8')
print(f"Original Bytes: {utf8_bytes}")
print(f"Decoded String: {decoded_string}")
print(f"Type: {type(decoded_string)}")
print("-" * 20)
# --- What if you use the wrong encoding? ---
# If you try to decode bytes with the wrong encoding, you get garbage or an error.
# Let's try to decode the UTF-8 bytes using ASCII.
try:
    wrong_decoded = utf8_bytes.decode('ascii')
except UnicodeDecodeError as e:
    print(f"Decoding with ASCII failed: {e}")
    print("Because the byte \\xe4 is not a valid ASCII character.")

Output:

Original Bytes: b'Hello, \xe4\xb8\x96\xe7\x95\x8c!'
Decoded String: Hello, 世界!
Type: <class 'str'>
--------------------
Decoding with ASCII failed: 'ascii' codec can't decode byte 0xe4 in position 7: ordinal not in range(128)
Because the byte \xe4 is not a valid ASCII character.

Python 2: The "Messy" Way (For Legacy Code)

In Python 2, there was no clear separation. str was a sequence of bytes, and unicode was for text. This led to many bugs.

  • str: A sequence of bytes. Its encoding was ambiguous.
  • unicode: A sequence of Unicode characters (like Python 3's str).

Key Python 2 Concepts:

  1. Creating a Unicode string: Use the u prefix.

    my_unicode_string = u"Hello, 世界!"
  2. Encoding a Unicode string to a byte string (str): Use .encode().

    my_byte_string = my_unicode_string.encode('utf-8')
    # my_byte_string is now a 'str' type, but it's properly encoded UTF-8 bytes.
  3. Decoding a byte string (str) to a Unicode string: Use .decode().

    back_to_unicode = my_byte_string.decode('utf-8')

The danger in Python 2 was that if you forgot to encode/decode, Python 2 would try to do it for you using your system's default encoding (often ascii), leading to UnicodeDecodeError or silent data corruption.


Practical Examples

Example 1: Reading from a File

When you read from a file in binary mode ('rb'), you get bytes. You must decode it to get a str.

# Let's create a dummy file
with open("my_data.txt", "w", encoding='utf-8') as f:
    f.write("This is some text with a newline.\n")
# Now, read it back in binary mode
with open("my_data.txt", "rb") as f:
    # read() returns bytes
    file_content_bytes = f.read()
print(f"Content from file (bytes): {file_content_bytes}")
print(f"Type: {type(file_content_bytes)}")
# You MUST decode it to work with it as a string
file_content_str = file_content_bytes.decode('utf-8')
print(f"Content from file (str): {repr(file_content_str)}")
print(f"Type: {type(file_content_str)}")

Example 2: Networking (HTTP Request)

When you send data over a network, it must be in bytes.

import urllib.request
# The URL we want to request
url = "https://www.example.com"
# In Python 3, urllib.request.urlopen() returns a bytes-like object
with urllib.request.urlopen(url) as response:
    # response.read() returns bytes
    html_bytes = response.read()
# To inspect or process the HTML as a string, decode it
# The encoding is often specified in the HTTP headers, but 'utf-8' is a safe bet.
html_string = html_bytes.decode('utf-8')
print(f"First 100 characters of HTML (str):\n{html_string[:100]}")

Summary Table

Feature Python 3 str (Text) Python 3 bytes (Binary) Python 2 str (Bytes) Python 2 unicode (Text)
Purpose Abstract text representation Raw 8-bit data Raw 8-bit data Abstract text representation
Literal "hello" b'hello' "hello" u"hello"
Methods .encode() .decode() .decode() .encode()
Use Case Storing text, in-memory processing Network I/O, File I/O, Cryptography Same as Python 3 bytes Same as Python 3 str
Key Rule Always encode to bytes before sending. Always decode to string after receiving. Ambiguous encoding led to bugs. Safer, but required explicit conversion.

Final Advice: Always use Python 3. Be explicit about your encoding (UTF-8 is the best default). Remember the mantra: encode str to bytes for I/O, and decode bytes to str for processing.

分享:
扫描分享到社交APP
上一篇
下一篇