Python如何高效加载数据集？-杰瑞科技汇

Of course! Loading datasets is a fundamental first step in almost any data analysis or machine learning project in Python. The method you use depends heavily on the source and format of your data.

（图片来源网络，侵删）

Here’s a comprehensive guide covering the most common scenarios, from simple text files to specialized libraries.

Summary of Methods

Data Source	Common File Formats	Best Python Library(s)	Key Function
Local Files	CSV, TSV	`pandas`	`pd.read_csv()`
Local Files	Excel (`.xls`, `.xlsx`)	`pandas`	`pd.read_excel()`
Local Files	JSON	`pandas`	`pd.read_json()`
Local Files	Plain Text	`pandas` or built-in `open()`	`pd.read_table()`, `open()`
Databases	SQL Tables	`pandas`, `sqlalchemy`	`pd.read_sql_query()`
Online Repos	CSV, etc.	`pandas`	`pd.read_csv()` with a URL
Machine Learning	Built-in datasets	`scikit-learn`	`sklearn.datasets.load_...()`
Machine Learning	Image datasets	`tensorflow.keras`, `torchvision`	`tf.keras.utils.image_dataset_from_directory()`

Loading Local Files with `pandas`

The Pandas library is the de-facto standard for data manipulation in Python. Its read_* functions are incredibly powerful and flexible.

a) Loading CSV/TSV Files (Most Common)

A CSV (Comma-Separated Values) file is the most common format for tabular data.

import pandas as pd
# Basic loading from a local file
df = pd.read_csv('my_data.csv')
# Display the first 5 rows
print(df.head())
# --- Common and Useful Arguments ---
# Specify a different delimiter (e.g., for a TSV file)
# df = pd.read_csv('my_data.tsv', sep='\t')
# Specify which columns to load
# df = pd.read_csv('my_data.csv', usecols=['column1', 'column3'])
# Specify the data types for columns to save memory
# df = pd.read_csv('my_data.csv', dtype={'column1': 'int32', 'column2': 'category'})
# Skip the first 2 rows
# df = pd.read_csv('my_data.csv', skiprows=2)
# Use the first column as the index
# df = pd.read_csv('my_data.csv', index_col=0)
# Parse dates automatically
# df = pd.read_csv('my_data.csv', parse_dates=['date_column'])

b) Loading Excel Files

For .xls or .xlsx files, you'll need the openpyxl or xlrd library installed. openpyxl is the modern standard.

（图片来源网络，侵删）

# First, install the library if you haven't:
# pip install openpyxl
import pandas as pd
# Load the first sheet by default
df_excel = pd.read_excel('my_data.xlsx')
# Load a specific sheet by name
df_sheet2 = pd.read_excel('my_data.xlsx', sheet_name='Sheet2')
# Load a specific sheet by index (0 for the first sheet)
df_sheet1_by_index = pd.read_excel('my_data.xlsx', sheet_name=0)

c) Loading JSON Files

JSON (JavaScript Object Notation) is very common for APIs and web data.

import pandas as pd
# Pandas can read JSON directly into a DataFrame
df_json = pd.read_json('data.json')
# For complex JSON, you might need to normalize it first
# Example: a JSON file with a list of records
# data = [{'id': 1, 'name': 'Alice'}, {'id': 2, 'name': 'Bob'}]
# df_json_normalized = pd.json_normalize(data)

Loading Data from Online Sources

You can often load data directly from a URL using Pandas, just like a local file.

import pandas as pd
# Example: Loading data from a URL (e.g., a CSV from GitHub)
url = 'https://raw.githubusercontent.com/cs109/2025_data/master/countries.csv'
df_online = pd.read_csv(url)
print(df_online.head())

Loading Data from Databases

For this, you'll need a library to connect to your database (e.g., psycopg2 for PostgreSQL, pymysql for MySQL) and sqlalchemy to create a connection engine.

# First, install the necessary libraries:
# pip install sqlalchemy psycopg2-binary
import pandas as pd
from sqlalchemy import create_engine
# 1. Create a database connection string
# Format: 'dialect+driver://username:password@host:port/database'
# Example for PostgreSQL:
# engine = create_engine('postgresql+psycopg2://my_user:my_password@localhost:5432/my_database')
# For SQLite (a simple file-based database):
engine = create_engine('sqlite:///my_database.db')
# 2. Load data using a SQL query
query = "SELECT * FROM users WHERE age > 30;"
df_sql = pd.read_sql_query(query, engine)
print(df_sql.head())
# You can also read an entire table directly
# df_table = pd.read_sql_table('users', engine)

Loading Datasets for Machine Learning

a) Scikit-learn's Built-in Datasets

Scikit-learn comes with several classic datasets for practice. These are returned as special objects.

from sklearn.datasets import load_iris, load_boston
# Load the Iris dataset (classification)
iris_data = load_iris()
# The data is in a dictionary-like object
X_iris = iris_data.data  # Features
y_iris = iris_data.target  # Labels
feature_names = iris_data.feature_names
target_names = iris_data.target_names
print("Features shape:", X_iris.shape)
print("Feature names:", feature_names)
print("Target names:", target_names)
# Load the Boston dataset (regression) - Note: deprecated in newer versions
# boston_data = load_boston()
# X_boston = boston_data.data
# y_boston = boston_data.target

b) Loading Image Datasets (Deep Learning)

For image data, you typically use utility functions from frameworks like TensorFlow/Keras or PyTorch.

TensorFlow/Keras Example:

This is useful when your images are organized in subdirectories, where each folder name is a class label.

dataset/
├── cats/
│   ├── cat1.jpg
│   ├── cat2.jpg
│   └── ...
└── dogs/
    ├── dog1.jpg
    ├── dog2.jpg
    └── ...

import tensorflow as tf
import matplotlib.pyplot as plt
# Define parameters
batch_size = 32
img_height = 180
img_width = 180
data_dir = 'path/to/your/dataset'
# Create a dataset from the directory
# This automatically infers labels from the subdirectory names
train_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)
# Get class names
class_names = train_ds.class_names
print("Class names:", class_names)
# Visualize some data
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
  for i in range(9):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(images[i].numpy().astype("uint8"))
    plt.title(class_names[labels[i]])
    plt.axis("off")
plt.show()

Key Takeaways

For most tabular data (CSV, Excel, etc.), start with pandas. It's the most flexible and powerful tool.
Know your data's format. The function you use (read_csv, read_excel, etc.) depends directly on the file type.
Use arguments to control loading. pd.read_csv() has many arguments (sep, usecols, dtype, etc.) to handle messy or specific data requirements.
For ML, use specialized loaders. Scikit-learn has its own loaders, and deep learning frameworks have utilities for images and text.
Always inspect your data after loading. Use df.head(), df.info(), and df.describe() to make sure it loaded correctly.

Python如何高效加载数据集？

Summary of Methods

Loading Local Files with `pandas`

a) Loading CSV/TSV Files (Most Common)

b) Loading Excel Files

c) Loading JSON Files

Loading Data from Online Sources

Loading Data from Databases

Loading Datasets for Machine Learning

a) Scikit-learn's Built-in Datasets

b) Loading Image Datasets (Deep Learning)

Key Takeaways

99ANYc3cd6

Java与C Socket通信如何实现跨语言数据交互？

Python GtkTreeView如何实现数据动态加载？

Java日期如何转字符串？

iMac装Win7教程可行吗？驱动怎么解决？

Unity3D教程PDF哪里找？

男士瑜伽入门，哪些基本动作必学？

MacBook Air系统教程，新手如何快速上手？

微信Java开发教程，如何快速上手？

Freemarker教程怎么学？核心语法有哪些？

Java newInstance()与new有何区别？

MacBook Pro新手怎么入门？

Java String与File如何高效交互处理？

Java与C Socket通信如何实现跨语言数据交互？

java utc时间转换本地时间

Unity 2D游戏教程从哪开始学？

如何系统学习AutoCAD电气视频教程？

Python如何高效加载数据集？

Summary of Methods

Loading Local Files with pandas

a) Loading CSV/TSV Files (Most Common)

b) Loading Excel Files

c) Loading JSON Files

Loading Data from Online Sources

Loading Data from Databases

Loading Datasets for Machine Learning

a) Scikit-learn's Built-in Datasets

b) Loading Image Datasets (Deep Learning)

Key Takeaways

相关推荐

男士瑜伽入门，哪些基本动作必学？

Loading Local Files with `pandas`