杰瑞科技汇

python loadtxt csv

Of course! Loading data from a CSV file is a fundamental task in Python. The most common and robust way to do this is by using the NumPy library's loadtxt function. However, the standard Python csv module and the Pandas library are also excellent tools, each with its own strengths.

python loadtxt csv-图1
(图片来源网络,侵删)

Here’s a complete guide covering all three methods, with a strong focus on your request: numpy.loadtxt.


The Scenario: A Sample CSV File

Let's assume you have a CSV file named data.csv with the following content. This file has a header row and some missing data represented by an empty string.

data.csv

Name, Age, Score, Department
Alice, 25, 88.5, Engineering
Bob, 30, 92.1,
Charlie, 22, 75.0, Marketing
Diana, 35, , Finance

Using numpy.loadtxt

numpy.loadtxt is powerful for loading numerical data from simple text files. Its main advantage is speed and direct conversion into a NumPy array.

python loadtxt csv-图2
(图片来源网络,侵删)

Key Parameters:

  • fname: The filename or file-like object.
  • delimiter: The character that separates values ( for CSV).
  • skiprows: Number of rows to skip at the beginning (use 1 to skip the header).
  • usecols: A list of column indices to load. Useful for selecting specific data.
  • dtype: The data type of the output array (e.g., float, int, str).
  • unpack: If True, the returned array is transposed, making it easy to assign columns to variables.

Example Code

import numpy as np
try:
    # Load data, skipping the header row
    # We'll try to load everything as float, but this will fail on the 'Name' and 'Department' columns
    data = np.loadtxt('data.csv', delimiter=',', skiprows=1, dtype=float)
    print("Successfully loaded data as float:")
    print(data)
except ValueError as e:
    print(f"Error as expected: {e}")
    print("\nThis happens because 'Name' and 'Department' columns contain non-numeric text.")
    print("Let's load only the numerical columns: 'Age' and 'Score'.")
    # Load only specific numerical columns (Age=1, Score=2)
    numerical_data = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=(1, 2), dtype=float)
    print("\nSuccessfully loaded numerical columns 'Age' and 'Score':")
    print(numerical_data)
    # Use unpack to assign columns to separate variables
    ages, scores = np.loadtxt('data.csv', delimiter=',', skiprows=1, usecols=(1, 2), unpack=True)
    print("\nUnpacked into separate variables:")
    print("Ages:", ages)
    print("Scores:", scores)

Output of the Example Code:

Error as expected: could not convert string to float: 'Alice'
This happens because 'Name' and 'Department' columns contain non-numeric text.
Let's load only the numerical columns: 'Age' and 'Score'.
Successfully loaded numerical columns 'Age' and 'Score':
[[25.  88.5]
 [30.  92.1]
 [22.  75. ]
 [35.   nan]]
Unpacked into separate variables:
Ages: [25. 30. 22. 35.]
Scores: [88.5 92.1 75.   nan]

⚠️ Important Limitations of loadtxt:

  • Homogeneous Data: It's designed for data of a single type (e.g., all floats or all integers). Mixing types (like numbers and strings) will cause a ValueError.
  • Missing Data: It doesn't handle missing data gracefully. An empty cell will cause a ValueError. In our example, NumPy correctly interprets the empty cell in the 'Score' column for Diana as nan (Not a Number), but this is a special case. If the cell contained text like "N/A", it would fail.
  • Headers: You must manually skip header rows with skiprows.

When to use numpy.loadtxt: When you have a clean, purely numerical CSV file and need the data as a fast NumPy array for scientific computing or machine learning.


Using the Standard csv Module

This is Python's built-in solution. It's very flexible and handles mixed data types and missing data gracefully.

Key Functions:

  • csv.reader: Reads the file row by row, returning each row as a list of strings.
  • csv.DictReader: Reads the file and returns each row as an ordered dictionary, using the header row as keys. This is often more convenient.

Example Code

import csv
print("--- Using csv.reader ---")
with open('data.csv', 'r') as file:
    # csv.reader returns an iterator
    csv_reader = csv.reader(file)
    # Skip the header row
    next(csv_reader)
    # Iterate over the remaining rows
    for row in csv_reader:
        # Each row is a list of strings
        print(f"Name: {row[0]}, Age: {row[1]}, Score: {row[2]}, Dept: {row[3]}")
print("\n--- Using csv.DictReader (often more useful) ---")
with open('data.csv', 'r') as file:
    # DictReader uses the first row of the file as keys for the dictionaries
    dict_reader = csv.DictReader(file)
    # You can access data by column name
    for row in dict_reader:
        # The missing data for 'Score' will be an empty string ''
        print(f"Name: {row['Name']}, Age: {row['Age']}, Score: '{row['Score']}', Dept: {row['Department']}")

Output of the Example Code:

--- Using csv.reader ---
Name: Alice, Age: 25, Score: 88.5, Dept: Engineering
Name: Bob, Age: 30, Score: 92.1, Dept: 
Name: Charlie, Age: 22, Score: 75.0, Dept: Marketing
Name: Diana, Age: 35, Score: , Dept: Finance
--- Using csv.DictReader (often more useful) ---
Name: Alice, Age: 25, Score: '88.5', Dept: Engineering
Name: Bob, Age: 30, Score: '92.1', Dept: 
Name: Charlie, Age: 22, Score: '75.0', Dept: Marketing
Name: Diana, Age: 35, Score: '', Dept: Finance

Note: All data is read as strings. You would need to manually convert types (e.g., int(row['Age'])).

When to use the csv module: When you need maximum flexibility, are working with mixed data types, or want to avoid external dependencies. It's perfect for simple scripts and data cleaning tasks.

python loadtxt csv-图3
(图片来源网络,侵删)

Using the pandas Library (Recommended for Data Analysis)

Pandas is the standard for data analysis in Python. Its read_csv function is incredibly robust and feature-rich.

Key Parameters:

  • filepath_or_buffer: The filename.
  • header: Row number(s) to use as the column names (0 for the first row).
  • usecols: List of column names or indices to load.
  • dtype: Dictionary of column names to data types.
  • na_values: Strings to be recognized as NaN (e.g., , 'N/A', 'NA').

Example Code

import pandas as pd
# Load the entire CSV into a DataFrame
# Pandas automatically infers data types and handles headers
df = pd.read_csv('data.csv')
print("--- Full Pandas DataFrame ---")
print(df)
print("\nDataFrame Info:")
df.info()
# --- Accessing data ---
print("\n--- Accessing specific columns ---")
print(df[['Name', 'Age']])
print("\n--- Accessing specific rows with .loc ---")
print(df.loc[df['Age'] > 28])
# --- Handling missing data ---
# Pandas automatically interprets empty strings as NaN (Not a Number)
print("\n--- Checking for missing values (NaN) ---")
print(df.isnull())
# You can easily fill missing values
print("\n--- Filling missing scores with the mean ---")
mean_score = df['Score'].mean()
df['Score'].fillna(mean_score, inplace=True)
print(df)

Output of the Example Code:

--- Full Pandas DataFrame ---
      Name  Age  Score Department
0    Alice   25   88.5  Engineering
1      Bob   30   92.1           
2  Charlie   22   75.0    Marketing
3    Diana   35     NaN     Finance
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Name        4 non-null      object 
 1   Age         4 non-null      int64  
 2   Score       3 non-null      float64
 3   Department  3 non-null      object 
dtypes: float64(1), int64(1), object(2)
memory usage: 256.0+ bytes
--- Accessing specific columns ---
    Name  Age
0  Alice   25
1    Bob   30
2  Charlie   22
3    Diana   35
--- Accessing specific rows with .loc ---
    Name  Age  Score Department
0  Alice   25   88.5  Engineering
1    Bob   30   92.1           
3  Diana   35     NaN     Finance
--- Checking for missing values (NaN) ---
   Name    Age  Score  Department
0  False  False   False       False
1  False  False   False        True
2  False  False   False       False
3  False  False    True       False
--- Filling missing scores with the mean ---
      Name  Age  Score Department
0    Alice   25   88.5  Engineering
1      Bob   30   92.1           
2  Charlie   22   75.0    Marketing
3    Diana   35   85.2     Finance

Note: The mean of (88.5, 92.1, 75.0) is 85.2.

When to use pandas: This is the recommended approach for almost any data analysis task. It handles headers, mixed data types, missing data, and provides powerful tools for filtering, grouping, and manipulating your data.


Summary and Comparison

Feature numpy.loadtxt csv Module pandas.read_csv
Primary Use Numerical data, speed Simple, flexible text parsing Data analysis, manipulation
Data Types Homogeneous (one type) All strings (manual conversion) Heterogeneous (infers types)
Headers Must be skipped (skiprows) Must be handled manually Automatic (header=0)
Missing Data Poor (often fails) Handled as empty strings Excellent (converts to NaN)
Output Type NumPy Array List of strings / Dicts Pandas DataFrame
Dependencies NumPy None (built-in) Pandas
Best For Scientific computing, ML input Quick scripts, simple parsing Data analysis, cleaning, exploration

Final Recommendation:

  • For quick, simple tasks or when you can't install external libraries, use the csv module.
  • For purely numerical data where performance is critical, use numpy.loadtxt.
  • For any serious data analysis, manipulation, or exploration, use pandas.read_csv. It is the industry standard for a reason.
分享:
扫描分享到社交APP
上一篇
下一篇