杰瑞科技汇

Python array如何指定dtype?

Of course! Let's dive deep into the dtype (data type) attribute in Python, which is fundamental to working with arrays, especially in libraries like NumPy and Pandas.

Python array如何指定dtype?-图1
(图片来源网络,侵删)

What is a dtype?

In short, dtype is an object that describes how the bytes in a fixed-size block of memory should be interpreted.

Think of it like this:

  • A Python list is a flexible container. It can hold numbers, strings, objects, and other lists, all mixed together. Each element in the list is a Python object, which has a lot of overhead.
  • A NumPy array is a contiguous block of memory. To make it fast and efficient, it needs to know exactly what kind of data it's holding. Is it a whole number (int)? A decimal number (float)? A piece of text (string)? The dtype tells the program how to read and write that block of memory.

Why is dtype Important?

  1. Memory Efficiency: Specifying a dtype allows you to control the memory footprint of your array. Storing a million integers as int8 (1 byte each) uses 1MB of memory, while storing them as int64 (8 bytes each) uses 8MB.
  2. Performance: When the CPU knows the exact data type, it can perform operations much faster. It doesn't have to check the type of each element, leading to highly optimized, compiled code (C or Fortran-level speed).
  3. Type Enforcement: NumPy arrays are homogeneous, meaning all elements must be of the same dtype. If you try to add a string to an integer array, NumPy will either raise an error or perform a type casting, rather than silently allowing it like a Python list.

Common dtypes in NumPy

Here are the most fundamental data types you'll encounter. The names often follow a pattern: <kind><size>.

Numeric Types

Type Description Size in Bytes Example
int8 Integer 1 np.array([1, 2, 3], dtype=np.int8)
int16 Integer 2 np.array([1, 2, 3], dtype=np.int16)
int32 Integer 4 np.array([1, 2, 3], dtype=np.int32)
int64 Integer (default) 8 np.array([1, 2, 3])
uint8 Unsigned Integer (0 to 255) 1 np.array([100, 200], dtype=np.uint8)
float16 Half-precision float 2 np.array([1.0, 2.0], dtype=np.float16)
float32 Single-precision float 4 np.array([1.0, 2.0], dtype=np.float32)
float64 Double-precision float (default) 8 np.array([1.0, 2.0])
bool Boolean (True / False) 1 np.array([True, False, True])

Note on int vs. uint:

Python array如何指定dtype?-图2
(图片来源网络,侵删)
  • int (signed) can be positive or negative (e.g., int8 range is -128 to 127).
  • uint (unsigned) can only be zero or positive (e.g., uint8 range is 0 to 255).

String and Object Types

Type Description
<UN> Unicode string. N is the number of characters. e.g., <U10 for strings up to 10 characters long.
object A catch-all type. The array will hold Python objects, losing the speed benefits of NumPy. Useful for storing arrays of different types (e.g., a mix of integers and strings).

Working with dtypes in NumPy

Creating an Array with a Specific dtype

You can specify the dtype when you create an array using the dtype argument.

import numpy as np
# Default integer type (usually int64 on 64-bit systems)
arr_int = np.array([1, 2, 3])
print(f"Array: {arr_int}")
print(f"Default dtype: {arr_int.dtype}\n")
# Explicitly set to a smaller integer type
arr_int8 = np.array([1, 2, 3], dtype=np.int8)
print(f"Array: {arr_int8}")
print(f"Specified dtype: {arr_int8.dtype}\n")
# Default float type (usually float64)
arr_float = np.array([1.1, 2.2, 3.3])
print(f"Array: {arr_float}")
print(f"Default dtype: {arr_float.dtype}\n")
# Explicitly set to a smaller float type
arr_float32 = np.array([1.1, 2.2, 3.3], dtype=np.float32)
print(f"Array: {arr_float32}")
print(f"Specified dtype: {arr_float32.dtype}\n")
# Create an array of strings
arr_str = np.array(['hello', 'world'])
print(f"Array: {arr_str}")
print(f"String dtype: {arr_str.dtype}") # Might be <U5 or similar

Checking an Array's dtype

Every NumPy array has a .dtype attribute.

arr = np.array([10, 20, 30])
print(arr.dtype)  # Output: int64 (or int32 depending on system)
arr_float = np.array([1.0, 2.0, 3.0])
print(arr_float.dtype) # Output: float64

Converting or "Casting" an Array's dtype

You can change the dtype of an existing array using the .astype() method. This is called casting.

# Create an array of floats
arr_float = np.array([1.1, 2.2, 3.3, 4.4])
print(f"Original array: {arr_float}, dtype: {arr_float.dtype}")
# Cast to integers (decimal part is truncated!)
arr_int = arr_float.astype(np.int32)
print(f"Casted array:  {arr_int}, dtype: {arr_int.dtype}")
# Cast to a different float type
arr_float16 = arr_float.astype(np.float16)
print(f"Casted to float16: {arr_float16}, dtype: {arr_float16.dtype}")

Warning: Casting can lead to loss of data or precision, as seen when casting float to int.

Python array如何指定dtype?-图3
(图片来源网络,侵删)

Type Conversion Rules

When you perform operations on arrays with different dtypes, NumPy follows a set of rules to determine the resulting dtype. Generally, the result will be the most "general" or "precise" type involved.

# Integer + Integer -> Integer
a = np.array([1, 2], dtype=np.int8)
b = np.array([3, 4], dtype=np.int16)
result = a + b
print(f"int8 + int16 -> {result.dtype}") # Output: int32
# Integer + Float -> Float
c = np.array([1, 2], dtype=np.int32)
d = np.array([3.0, 4.0], dtype=np.float64)
result = c + d
print(f"int32 + float64 -> {result.dtype}") # Output: float64
# Integer + Boolean -> Integer
e = np.array([1, 2, 3], dtype=np.int8)
f = np.array([True, False, True])
result = e + f
print(f"int8 + bool -> {result.dtype}") # Output: int8 (bool is a subtype of int)

dtype in Pandas

Pandas is built on top of NumPy, so its core data types are based on NumPy dtypes. However, Pandas adds its own names and a special type for handling missing data.

Pandas Type NumPy Equivalent Description
int64 np.int64 Integer.
float64 np.float64 Floating point number.
bool np.bool_ Boolean.
object np.object_ Python object.
category N/A For categorical data (fixed set of values).
datetime64[ns] N/A For dates and times.
timedelta64[ns] N/A For differences between two dates/times.

The key difference in Pandas is the introduction of missing data support. A NumPy array with dtype=float64 can have np.nan (Not a Number) to represent missing values. However, an array with dtype=int64 cannot have np.nan. Pandas handles this elegantly with its own types (e.g., Int64, Float64), which are nullable and can hold missing values while still behaving like integers or floats.

import pandas as pd
import numpy as np
# A standard NumPy integer array CANNOT hold NaN
try:
    arr = np.array([1, 2, np.nan], dtype=np.int64)
except TypeError as e:
    print(f"NumPy Error: {e}")
# A Pandas Series with dtype 'Int64' (capital I) CAN hold missing values
# represented by the special pd.NA value.
s = pd.Series([1, 2, None], dtype="Int64")
print("\nPandas Series with nullable Int64:")
print(s)
print(f"Pandas dtype: {s.dtype}") # Output: Int64

Summary

Feature Python List NumPy Array Pandas Series
Data Type Heterogeneous (can mix types) Homogeneous (one dtype for all) Homogeneous (one dtype), but with special nullable types
dtype Role N/A Crucial. Defines memory layout and enables speed. Crucial. Built on NumPy, with extensions for missing data.
Performance Slow for numerical operations Very Fast Fast, leverages NumPy.
Missing Data None np.nan (only for float/complex types) pd.NA, np.nan (with nullable dtypes like Int64)

Understanding dtype is a key step toward writing efficient, robust numerical code in Python. It's the bridge between the high-level flexibility of Python and the low-level performance of compiled languages.

分享:
扫描分享到社交APP
上一篇
下一篇