杰瑞科技汇

Python如何高效加载数据集?

Of course! Loading datasets is a fundamental first step in almost any data analysis or machine learning project in Python. The method you use depends heavily on the source and format of your data.

Python如何高效加载数据集?-图1
(图片来源网络,侵删)

Here’s a comprehensive guide covering the most common scenarios, from simple text files to specialized libraries.

Summary of Methods

Data Source Common File Formats Best Python Library(s) Key Function
Local Files CSV, TSV pandas pd.read_csv()
Local Files Excel (.xls, .xlsx) pandas pd.read_excel()
Local Files JSON pandas pd.read_json()
Local Files Plain Text pandas or built-in open() pd.read_table(), open()
Databases SQL Tables pandas, sqlalchemy pd.read_sql_query()
Online Repos CSV, etc. pandas pd.read_csv() with a URL
Machine Learning Built-in datasets scikit-learn sklearn.datasets.load_...()
Machine Learning Image datasets tensorflow.keras, torchvision tf.keras.utils.image_dataset_from_directory()

Loading Local Files with pandas

The Pandas library is the de-facto standard for data manipulation in Python. Its read_* functions are incredibly powerful and flexible.

a) Loading CSV/TSV Files (Most Common)

A CSV (Comma-Separated Values) file is the most common format for tabular data.

import pandas as pd
# Basic loading from a local file
df = pd.read_csv('my_data.csv')
# Display the first 5 rows
print(df.head())
# --- Common and Useful Arguments ---
# Specify a different delimiter (e.g., for a TSV file)
# df = pd.read_csv('my_data.tsv', sep='\t')
# Specify which columns to load
# df = pd.read_csv('my_data.csv', usecols=['column1', 'column3'])
# Specify the data types for columns to save memory
# df = pd.read_csv('my_data.csv', dtype={'column1': 'int32', 'column2': 'category'})
# Skip the first 2 rows
# df = pd.read_csv('my_data.csv', skiprows=2)
# Use the first column as the index
# df = pd.read_csv('my_data.csv', index_col=0)
# Parse dates automatically
# df = pd.read_csv('my_data.csv', parse_dates=['date_column'])

b) Loading Excel Files

For .xls or .xlsx files, you'll need the openpyxl or xlrd library installed. openpyxl is the modern standard.

Python如何高效加载数据集?-图2
(图片来源网络,侵删)
# First, install the library if you haven't:
# pip install openpyxl
import pandas as pd
# Load the first sheet by default
df_excel = pd.read_excel('my_data.xlsx')
# Load a specific sheet by name
df_sheet2 = pd.read_excel('my_data.xlsx', sheet_name='Sheet2')
# Load a specific sheet by index (0 for the first sheet)
df_sheet1_by_index = pd.read_excel('my_data.xlsx', sheet_name=0)

c) Loading JSON Files

JSON (JavaScript Object Notation) is very common for APIs and web data.

import pandas as pd
# Pandas can read JSON directly into a DataFrame
df_json = pd.read_json('data.json')
# For complex JSON, you might need to normalize it first
# Example: a JSON file with a list of records
# data = [{'id': 1, 'name': 'Alice'}, {'id': 2, 'name': 'Bob'}]
# df_json_normalized = pd.json_normalize(data)

Loading Data from Online Sources

You can often load data directly from a URL using Pandas, just like a local file.

import pandas as pd
# Example: Loading data from a URL (e.g., a CSV from GitHub)
url = 'https://raw.githubusercontent.com/cs109/2025_data/master/countries.csv'
df_online = pd.read_csv(url)
print(df_online.head())

Loading Data from Databases

For this, you'll need a library to connect to your database (e.g., psycopg2 for PostgreSQL, pymysql for MySQL) and sqlalchemy to create a connection engine.

# First, install the necessary libraries:
# pip install sqlalchemy psycopg2-binary
import pandas as pd
from sqlalchemy import create_engine
# 1. Create a database connection string
# Format: 'dialect+driver://username:password@host:port/database'
# Example for PostgreSQL:
# engine = create_engine('postgresql+psycopg2://my_user:my_password@localhost:5432/my_database')
# For SQLite (a simple file-based database):
engine = create_engine('sqlite:///my_database.db')
# 2. Load data using a SQL query
query = "SELECT * FROM users WHERE age > 30;"
df_sql = pd.read_sql_query(query, engine)
print(df_sql.head())
# You can also read an entire table directly
# df_table = pd.read_sql_table('users', engine)

Loading Datasets for Machine Learning

a) Scikit-learn's Built-in Datasets

Scikit-learn comes with several classic datasets for practice. These are returned as special objects.

from sklearn.datasets import load_iris, load_boston
# Load the Iris dataset (classification)
iris_data = load_iris()
# The data is in a dictionary-like object
X_iris = iris_data.data  # Features
y_iris = iris_data.target  # Labels
feature_names = iris_data.feature_names
target_names = iris_data.target_names
print("Features shape:", X_iris.shape)
print("Feature names:", feature_names)
print("Target names:", target_names)
# Load the Boston dataset (regression) - Note: deprecated in newer versions
# boston_data = load_boston()
# X_boston = boston_data.data
# y_boston = boston_data.target

b) Loading Image Datasets (Deep Learning)

For image data, you typically use utility functions from frameworks like TensorFlow/Keras or PyTorch.

TensorFlow/Keras Example:

This is useful when your images are organized in subdirectories, where each folder name is a class label.

dataset/
├── cats/
│   ├── cat1.jpg
│   ├── cat2.jpg
│   └── ...
└── dogs/
    ├── dog1.jpg
    ├── dog2.jpg
    └── ...
import tensorflow as tf
import matplotlib.pyplot as plt
# Define parameters
batch_size = 32
img_height = 180
img_width = 180
data_dir = 'path/to/your/dataset'
# Create a dataset from the directory
# This automatically infers labels from the subdirectory names
train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)
# Get class names
class_names = train_ds.class_names
print("Class names:", class_names)
# Visualize some data
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
  for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(class_names[labels[i]])
    plt.axis("off")
plt.show()

Key Takeaways

  1. For most tabular data (CSV, Excel, etc.), start with pandas. It's the most flexible and powerful tool.
  2. Know your data's format. The function you use (read_csv, read_excel, etc.) depends directly on the file type.
  3. Use arguments to control loading. pd.read_csv() has many arguments (sep, usecols, dtype, etc.) to handle messy or specific data requirements.
  4. For ML, use specialized loaders. Scikit-learn has its own loaders, and deep learning frameworks have utilities for images and text.
  5. Always inspect your data after loading. Use df.head(), df.info(), and df.describe() to make sure it loaded correctly.
分享:
扫描分享到社交APP
上一篇
下一篇