杰瑞科技汇

String的getBytes()方法字符集转换如何用?

Of course! The String.getBytes() method in Java is used to encode a String object into a sequence of bytes. This is a crucial step when you need to transfer or store text data, as computers work with bytes, not characters.

String的getBytes()方法字符集转换如何用?-图1
(图片来源网络,侵删)

Let's break it down in detail.

What is getBytes() for?

Computers store and process data in bytes (8-bit chunks). A String in Java is an object that represents a sequence of Unicode characters. To write a String to a file, send it over a network, or store it in a database, you must first convert it into a byte array using a specific character encoding.

Encoding is the process of mapping characters from a character set (like Unicode) to a sequence of bytes.


Method Signatures

The getBytes() method has several overloaded versions, which can be confusing. Here are the main ones:

String的getBytes()方法字符集转换如何用?-图2
(图片来源网络,侵删)

a) public byte[] getBytes()

This is the simplest form. It uses the platform's default charset to encode the string into a byte array.

  • How it works: It relies on the default character encoding of the Java Virtual Machine (JVM) on which the code is running. This default can vary from system to system (e.g., UTF-8 on Linux/macOS, Cp1252 on older Windows versions).
  • When to use: Avoid this in production code. Your application's behavior can change if it's moved to a different machine with a different default encoding. It's acceptable for quick, local testing.
String text = "Hello, 世界!";
// Uses the JVM's default charset
byte[] bytes = text.getBytes();

b) public byte[] getBytes(String charsetName)

This is the most common and recommended version. It allows you to explicitly specify the character encoding to use.

  • How it works: You pass the name of a supported character encoding (e.g., "UTF-8", "ISO-8859-1") as a String.
  • When to use: This is the version you should almost always use. It makes your code predictable and portable across different platforms.
  • Throws: UnsupportedEncodingException if the specified charset name is not supported by the JVM.
String text = "Hello, 世界!";
try {
    // Explicitly use UTF-8 encoding
    byte[] utf8Bytes = text.getBytes("UTF-8");
    System.out.println("Encoded with UTF-8: " + Arrays.toString(utf8Bytes));
    // Explicitly use ISO-8859-1 (Latin-1) encoding
    byte[] latin1Bytes = text.getBytes("ISO-8859-1");
    System.out.println("Encoded with ISO-8859-1: " + Arrays.toString(latin1Bytes));
} catch (UnsupportedEncodingException e) {
    e.printStackTrace();
}

c) public byte[] getBytes(Charset charset)

This is a modern, type-safe alternative to version (b). It was introduced in Java 7.

  • How it works: Instead of passing a String for the charset name, you pass a java.nio.charset.Charset object.
  • When to use: This is the best practice in modern Java. It's safer because you can't pass an invalid charset name as a string, which eliminates the UnsupportedEncodingException. It also allows for better compile-time checking.
  • Throws: No checked exceptions. If the charset is not supported, it's usually a configuration issue with the JVM.
import java.nio.charset.StandardCharsets;
String text = "Hello, 世界!";
// The modern, type-safe way to specify the charset
byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
byte[] asciiBytes = text.getBytes(StandardCharsets.US_ASCII);
byte[] isoBytes = text.getBytes(StandardCharsets.ISO_8859_1);
System.out.println("Encoded with UTF-8 (safe): " + Arrays.toString(utf8Bytes));

Example: Different Encodings Produce Different Results

This example clearly shows why choosing the right encoding is critical. The character '世' cannot be represented in the US-ASCII encoding.

String的getBytes()方法字符集转换如何用?-图3
(图片来源网络,侵删)
import java.util.Arrays;
public class GetBytesExample {
    public static void main(String[] args) {
        String text = "A test with 世界";
        System.out.println("Original String: " + text);
        System.out.println("Original String length: " + text.length());
        System.out.println("-------------------------------------------------");
        // 1. Using the default charset (not recommended for portability)
        // On most modern systems, this will be UTF-8.
        byte[] defaultBytes = text.getBytes();
        System.out.println("Default Charset: " + System.getProperty("file.encoding"));
        System.out.println("Bytes: " + Arrays.toString(defaultBytes));
        System.out.println("Bytes length: " + defaultBytes.length);
        System.out.println();
        // 2. Using UTF-8 (a universal and common encoding)
        byte[] utf8Bytes = text.getBytes(StandardCharsets.UTF_8);
        System.out.println("UTF-8 Charset:");
        System.out.println("Bytes: " + Arrays.toString(utf8Bytes));
        System.out.println("Bytes length: " + utf8Bytes.length);
        System.out.println();
        // 3. Using US-ASCII (can only handle basic English characters)
        // Characters not in the ASCII table will be replaced with a '?'.
        byte[] asciiBytes = text.getBytes(StandardCharsets.US_ASCII);
        System.out.println("US-ASCII Charset:");
        System.out.println("Bytes: " + Arrays.toString(asciiBytes));
        System.out.println("Bytes length: " + asciiBytes.length);
        System.out.println();
        // 4. Using ISO-8859-1 (Latin-1, can handle some European characters)
        // Characters not in the ISO-8859-1 table will be replaced with a '?'.
        byte[] isoBytes = text.getBytes(StandardCharsets.ISO_8859_1);
        System.out.println("ISO-8859-1 Charset:");
        System.out.println("Bytes: " + Arrays.toString(isoBytes));
        System.out.println("Bytes length: " + isoBytes.length);
    }
}

Output (on a system with UTF-8 as default):

Original String: A test with 世界
Original String length: 12
-------------------------------------------------
Default Charset: UTF-8
Bytes: [65, 32, 116, 101, 115, 116, 32, 119, 105, 116, 104, 32, -28, -72, -106, -27, -101, -67]
Bytes length: 18
UTF-8 Charset:
Bytes: [65, 32, 116, 101, 115, 116, 32, 119, 105, 116, 104, 32, -28, -72, -106, -27, -101, -67]
Bytes length: 18
US-ASCII Charset:
Bytes: [65, 32, 116, 101, 115, 116, 32, 119, 105, 116, 104, 32, 63, 63, 63]
Bytes length: 15
ISO-8859-1 Charset:
Bytes: [65, 32, 116, 101, 115, 116, 32, 119, 105, 116, 104, 32, 63, 63, 63]
Bytes length: 15

Notice how the UTF-8 encoding correctly represents the Chinese characters using 3 bytes each, while the other encodings replace them with a single byte.


The Reverse: Converting Bytes back to a String

When you receive a byte array, you must decode it back into a String using the same encoding that was used to create it. Otherwise, you will get garbled text (mojibake).

The corresponding method is the String constructor: new String(byte[] bytes, Charset charset).

import java.nio.charset.StandardCharsets;
public class StringFromBytesExample {
    public static void main(String[] args) {
        String original = "Decoding is the reverse of encoding.";
        // 1. Encode the string to bytes using UTF-8
        byte[] utf8Bytes = original.getBytes(StandardCharsets.UTF_8);
        System.out.println("Original string: " + original);
        // 2. Decode the bytes back to a string using the SAME UTF-8 encoding
        String decodedCorrectly = new String(utf8Bytes, StandardCharsets.UTF_8);
        System.out.println("Decoded correctly: " + decodedCorrectly); // Success!
        System.out.println("-------------------------------------------------");
        // 3. A common mistake: decoding with the wrong encoding
        // Let's pretend we received the 'utf8Bytes' but thought they were ISO-8859-1
        String decodedIncorrectly = new String(utf8Bytes, StandardCharsets.ISO_8859_1);
        System.out.println("Decoded incorrectly (as ISO-8859-1): " + decodedIncorrectly);
        // The output will be garbled because the byte patterns for UTF-8 are invalid in ISO-8859
分享:
扫描分享到社交APP
上一篇
下一篇