杰瑞科技汇

Python unicodetype是什么?如何使用?

Of course! Let's break down "Unicode type" in Python.

Python unicodetype是什么?如何使用?-图1
(图片来源网络,侵删)

In Python 3, the concept of a "Unicode type" is central to how strings work. Unlike Python 2, which had two distinct string types (str and unicode), Python 3 has one single string type.

Here’s a complete guide covering everything from the basics to advanced topics like normalization and encodings.


The Core Concept: str is Unicode

In Python 3, every string you create is a sequence of Unicode code points. The built-in str type is the Unicode string type.

# When you write a string literal in your code, Python 3 interprets it as Unicode.
my_string = "Hello, 世界! 😊"
# The type is simply 'str'
print(type(my_string))
# <class 'str'>
# You can see the underlying Unicode code points using the `ord()` function
# (though iterating over the string is more common).
for char in my_string:
    print(f"Character: '{char}', Unicode Code Point: U+{ord(char):04X}")

Output:

Python unicodetype是什么?如何使用?-图2
(图片来源网络,侵删)
Character: 'H', Unicode Code Point: U+0048
Character: 'e', Unicode Code Point: U+0065
Character: 'l', Unicode Code Point: U+006C
Character: 'l', Unicode Code Point: U+006C
Character: 'o', Unicode Code Point: U+006F
Character: ',', Unicode Code Point: U+002C
Character: ' ', Unicode Code Point: U+0020
Character: '世', Unicode Code Point: U+4E16
Character: '界', Unicode Code Point: U+754C
Character: '!', Unicode Code Point: U+0021
Character: ' ', Unicode Code Point: U+0020
Character: '😊', Unicode Code Point: U+1F60A  # This is a single emoji character

Key Takeaway: In Python 3, you don't have a separate "Unicode type." The str type is Unicode. This is one of the biggest improvements in Python 3 and eliminates a whole class of bugs related to encoding that were common in Python 2.


The Role of Encodings: bytes vs. str

While str holds Unicode text, it often needs to be stored or transmitted. This is where encodings come in. An encoding is a set of rules for converting Unicode code points into a sequence of bytes.

The bytes type in Python represents this raw sequence of bytes.

The Golden Rule of Python 3 Text Handling:

"Unicode sandwich": All text should be str (Unicode) in your program's core logic. It should only be converted to bytes (an encoded representation) when writing to a file or network, and converted back from bytes to str when reading from a file or network.

Python unicodetype是什么?如何使用?-图3
(图片来源网络,侵删)
my_unicode_string = "Hello, 世界!"
# --- ENCODING: str -> bytes ---
# We encode the string into a byte sequence using a specific encoding (e.g., UTF-8)
utf8_bytes = my_unicode_string.encode('utf-8')
print(f"Original string (str): {my_unicode_string}")
print(f"Encoded as UTF-8 (bytes): {utf8_bytes}")
print(f"Type of encoded data: {type(utf8_bytes)}")
# Output: b'Hello, \xe4\xb8\x96\xe7\x95\x8c!' (each \x is a byte)
# --- DECODING: bytes -> str ---
# We decode the byte sequence back into a string
decoded_string = utf8_bytes.decode('utf-8')
print(f"Decoded string (str): {decoded_string}")
print(f"Type of decoded data: {type(decoded_string)}")
# Output: Hello, 世界!

Why is this important? If you use the wrong encoding, you get errors or "mojibake" (corrupted text).

# Example of a problem: trying to decode UTF-8 bytes as ASCII
# This will raise a UnicodeDecodeError because the Chinese characters
# cannot be represented in the ASCII encoding.
try:
    my_unicode_string.encode('utf-8').decode('ascii')
except UnicodeDecodeError as e:
    print(f"\nError! {e}")

Practical Operations on Unicode Strings

Python's built-in string methods handle Unicode correctly.

Case Conversion

.upper(), .lower(), .capitalize() work with Unicode characters.

greeting = "café"
print(greeting.upper())  # Output: CAFÉ
# Special case: German 'ß' (eszett)
german_word = "straße"
print(german_word.upper()) # Output: STRAẞE (Note: the capital ß is ẞ)

Length

len() counts the number of Unicode characters (code points), not bytes.

emoji_string = "Hi! 😊"
print(len(emoji_string)) # Output: 5 (H, i, !, space, 😊)

Checking Character Types

You can check if a character is a letter, number, etc., using methods like .isalpha(), .isdigit(). These are Unicode-aware.

char = "世"
print(f"Is '{char}' a letter? {char.isalpha()}") # Output: True
char_num = "٣" # This is the Arabic-Indic digit three
print(f"Is '{char_num}' a digit? {char_num.isdigit()}") # Output: True

Advanced Unicode: Normalization

Sometimes, the same text can be represented by different sequences of Unicode code points. This is called canonical equivalence.

A classic example is the letter :

  1. As a single character: (U+00E9)
  2. As a base character e (U+0065) followed by a combining acute accent (U+0301)

To a human, they are identical, but a computer sees them as different strings.

s1 = "café"        # Single character 'é'
s2 = "cafe\u0301"  # 'e' followed by combining acute accent
print(s1 == s2) # Output: False

Unicode Normalization is the process of converting text into a standardized form. Python's unicodedata module provides tools for this.

  • NFC (Normalization Form C): Composes characters where possible. This is the most common form. It's generally what you want for storing text.
  • NFD (Normalization Form D): Decomposes characters. This can be useful for text processing where you want to separate base characters from accents.
import unicodedata
s1 = "café"
s2 = "cafe\u0301"
# Normalize both strings to NFC form
nfc_s1 = unicodedata.normalize('NFC', s1)
nfc_s2 = unicodedata.normalize('NFC', s2)
print(f"s1: {s1}, s2: {s2}")
print(f"s1 == s2? {s1 == s2}") # False
print(f"\nNFC of s1: {nfc_s1}, NFC of s2: {nfc_s2}")
print(f"NFC of s1 == NFC of s2? {nfc_s1 == nfc_s2}") # True

Summary Table: Python 2 vs. Python 3

Feature Python 2 Python 3
String for Text unicode str
String for Bytes str bytes
Default String Literal str (bytes) str (Unicode)
Example s = "hello" (bytes) s = "hello" (Unicode)
Unicode Literal u"hello" Not needed, "hello" is Unicode
Encoding s.decode('utf-8') s.encode('utf-8')
Decoding s.encode('utf-8') b.decode('utf-8')

Key Takeaways

  1. In Python 3, str is the Unicode type. You don't need a special one.
  2. Distinguish between text (str) and data (bytes). This is the most important concept for handling internationalization.
  3. Follow the "Unicode sandwich": Keep text as str inside your program. Encode to bytes only for I/O.
  4. Always specify an encoding when working with files or network communication. UTF-8 is the recommended default.
  5. Use the unicodedata module for advanced tasks like normalization if you encounter issues with text that looks the same but compares as different.
分享:
扫描分享到社交APP
上一篇
下一篇