杰瑞科技汇

Python JSON 如何处理 Unicode?

Of course! Handling Unicode in Python with JSON is a fundamental task, especially since JSON is a text-based format. Here's a comprehensive guide covering the essentials, from basic usage to common pitfalls and solutions.

Python JSON 如何处理 Unicode?-图1
(图片来源网络,侵删)

The Short Answer: What You Need to Know

  1. Python 3: By default, json.loads() (string to Python) and json.dumps() (Python to string) handle Unicode correctly. You'll get Python str objects, not bytes.
  2. Python 2: This is where the main Unicode challenges arise. You must be careful with encoding/decoding to avoid unicode or str objects turning into str objects with \uXXXX escape sequences.
  3. Non-ASCII Characters: The json.dumps() function will automatically escape non-ASCII characters (like or ) by default. To preserve them as actual characters in the output string, use ensure_ascii=False.

Detailed Breakdown

Let's dive into the specifics for Python 3 and Python 2.

Python 3 (The Modern, Easy Way)

Python 3's json module is Unicode-aware by default. It's designed to work with text, not bytes.

json.loads() (Decoding JSON)

This function takes a JSON string and converts it into a Python object.

import json
# A JSON string with Unicode characters
json_string = '{"name": "José", "city": "München", "id": 123}'
# Load the JSON string into a Python dictionary
python_dict = json.loads(json_string)
print(python_dict)
# Output: {'name': 'José', 'city': 'München', 'id': 123}
# Check the type of the values
print(type(python_dict['name'])) # Output: <class 'str'>
print(type(python_dict['city'])) # Output: <class 'str'>

As you can see, the values are standard Python str objects, correctly representing the Unicode characters.

Python JSON 如何处理 Unicode?-图2
(图片来源网络,侵删)

json.dumps() (Encoding JSON)

This function takes a Python object and converts it into a JSON string.

Default Behavior (ensure_ascii=True)

By default, non-ASCII characters are escaped to their \uXXXX representation. This ensures the resulting string is pure ASCII, which is valid JSON.

import json
python_dict = {'name': 'José', 'city': 'München'}
# Default behavior: escapes non-ASCII characters
json_string_default = json.dumps(python_dict)
print(json_string_default)
# Output: {"name": "Jos\u00e9", "city": "M\u00fcnchen"}

Preserving Characters (ensure_ascii=False)

Python JSON 如何处理 Unicode?-图3
(图片来源网络,侵删)

If you want the JSON string to contain the actual Unicode characters (e.g., for writing to a UTF-8 encoded file), set ensure_ascii=False.

import json
python_dict = {'name': 'José', 'city': 'München'}
# Preserve non-ASCII characters
json_string_unicode = json.dumps(python_dict, ensure_ascii=False)
print(json_string_unicode)
# Output: {"name": "José", "city": "München"}

Important Note on ensure_ascii=False and Files: When ensure_ascii=False, the output of json.dumps() is a Unicode string. If you want to write this to a file, you must encode it to bytes (e.g., UTF-8).

import json
python_dict = {'name': 'José', 'city': 'München'}
# Get the Unicode string
json_string_unicode = json.dumps(python_dict, ensure_ascii=False)
print(f"Type of json_string_unicode: {type(json_string_unicode)}")
# Output: Type of json_string_unicode: <class 'str'>
# Write to a file with UTF-8 encoding
with open('data.json', 'w', encoding='utf-8') as f:
    # The file.write() method expects a string, which is what we have.
    # The 'encoding='utf-8'' part tells Python how to handle that string on disk.
    f.write(json_string_unicode)

Python 2 (The Tricky, Legacy Way)

In Python 2, str and unicode are different types, and you have to manage the encoding yourself.

json.loads() (Decoding JSON)

In Python 2, json.loads() can accept either a str (byte string) or a unicode object.

  • If you pass a str, it's assumed to be encoded in UTF-8 (the standard for JSON) and will be decoded into unicode objects.
  • If you pass a unicode object, it's used directly.
# Python 2
import json
# Case 1: Input is a byte string (str)
json_str = '{"name": "Jos\\u00e9", "city": "M\\u00fcnchen"}'
python_dict_from_str = json.loads(json_str)
print(python_dict_from_str)
# Output: {u'name': u'Jos\xe9', u'city': u'M\xfcnchen'}
print(type(python_dict_from_str['name'])) # Output: <type 'unicode'>
# Case 2: Input is a unicode string
json_unicode = u'{"name": "Jos\u00e9", "city": "M\u00fcnchen"}'
python_dict_from_unicode = json.loads(json_unicode)
print(python_dict_from_unicode)
# Output: {u'name': u'Jos\xe9', u'city': u'M\xfcnchen'}
print(type(python_dict_from_unicode['name'])) # Output: <type 'unicode'>

The key takeaway for Python 2 is that json.loads() consistently produces unicode objects for string values.

json.dumps() (Encoding JSON)

This is where it gets tricky. The default behavior often leads to unwanted escape sequences.

Default Behavior (ensure_ascii=True)

This is the default. It produces a str (byte string) where all non-ASCII characters are escaped.

# Python 2
import json
python_unicode_dict = {u'name': u'José', u'city': u'München'}
# Default behavior: produces a str with escaped characters
json_str_default = json.dumps(python_unicode_dict)
print(json_str_default)
# Output: {"name": "Jos\u00e9", "city": "M\u00fcnchen"}
print(type(json_str_default)) # Output: <type 'str'>

Preserving Characters (ensure_ascii=False)

This produces a unicode string with the actual characters.

# Python 2
import json
python_unicode_dict = {u'name': u'José', u'city': u'München'}
# Produce a unicode string with actual characters
json_unicode_str = json.dumps(python_unicode_dict, ensure_ascii=False)
print(json_unicode_str)
# Output: {u"name": u"Jos\xe9", u"city": u"M\xfcnchen"}
print(type(json_unicode_str)) # Output: <type 'unicode'>

Writing to a File in Python 2

If you have a unicode string from json.dumps(..., ensure_ascii=False) and want to write it to a file, you must encode it to a byte string first.

# Python 2
import json
python_unicode_dict = {u'name': u'José', u'city': u'München'}
json_unicode_str = json.dumps(python_unicode_dict, ensure_ascii=False)
# Encode the unicode string to a byte string before writing
with open('data_py2.json', 'w') as f:
    f.write(json_unicode_str.encode('utf-8'))

Common Pitfalls and Solutions

Pitfall Cause Solution
Getting \uXXXX escapes instead of characters Using json.dumps() with the default ensure_ascii=True. Set ensure_ascii=False in json.dumps().
UnicodeEncodeError when writing to a file Trying to write a Python 3 str (which is Unicode) to a file opened in binary mode ('wb'), or trying to write a Python 2 unicode string to a file opened in text mode without encoding. Python 3: Open the file in text mode with encoding: open(..., 'w', encoding='utf-8').
Python 2: Encode the unicode string to bytes before writing: my_unicode_str.encode('utf-8').
TypeError or ValueError with json.loads() Passing a Python dict to json.loads() instead of a string, or passing a malformed JSON string. Make sure the input to json.loads() is a valid JSON formatted string.
Data corruption on disk Reading a JSON file with the wrong encoding (e.g., reading UTF-8 as Latin-1). Always specify the correct encoding when opening files. For JSON, this is almost always utf-8.

Best Practices Summary

  1. Use Python 3. It handles this much more intuitively.

  2. Be Explicit with File Encodings: Always use encoding='utf-8' when opening files for JSON I/O. This prevents subtle bugs.

    # Best practice for reading
    with open('data.json', 'r', encoding='utf-8') as f:
        data = json.load(f) # Use load() for file objects
    # Best practice for writing
    with open('data.json', 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False) # Use dump() for file objects
  3. Use ensure_ascii=False for Readability: If your JSON will be read by humans (e.g., in a config file or API response), ensure_ascii=False makes it much more readable.

  4. Validate Your JSON: If you're dealing with untrusted data, consider using a library like jsonschema to validate the structure and types of your JSON after parsing.

分享:
扫描分享到社交APP
上一篇
下一篇