Of course! Loading datasets is a fundamental first step in almost any data analysis or machine learning project in Python. The method you use depends heavily on the source and format of your data.

Here’s a comprehensive guide covering the most common scenarios, from simple text files to specialized libraries.
Summary of Methods
| Data Source | Common File Formats | Best Python Library(s) | Key Function |
|---|---|---|---|
| Local Files | CSV, TSV | pandas |
pd.read_csv() |
| Local Files | Excel (.xls, .xlsx) |
pandas |
pd.read_excel() |
| Local Files | JSON | pandas |
pd.read_json() |
| Local Files | Plain Text | pandas or built-in open() |
pd.read_table(), open() |
| Databases | SQL Tables | pandas, sqlalchemy |
pd.read_sql_query() |
| Online Repos | CSV, etc. | pandas |
pd.read_csv() with a URL |
| Machine Learning | Built-in datasets | scikit-learn |
sklearn.datasets.load_...() |
| Machine Learning | Image datasets | tensorflow.keras, torchvision |
tf.keras.utils.image_dataset_from_directory() |
Loading Local Files with pandas
The Pandas library is the de-facto standard for data manipulation in Python. Its read_* functions are incredibly powerful and flexible.
a) Loading CSV/TSV Files (Most Common)
A CSV (Comma-Separated Values) file is the most common format for tabular data.
import pandas as pd
# Basic loading from a local file
df = pd.read_csv('my_data.csv')
# Display the first 5 rows
print(df.head())
# --- Common and Useful Arguments ---
# Specify a different delimiter (e.g., for a TSV file)
# df = pd.read_csv('my_data.tsv', sep='\t')
# Specify which columns to load
# df = pd.read_csv('my_data.csv', usecols=['column1', 'column3'])
# Specify the data types for columns to save memory
# df = pd.read_csv('my_data.csv', dtype={'column1': 'int32', 'column2': 'category'})
# Skip the first 2 rows
# df = pd.read_csv('my_data.csv', skiprows=2)
# Use the first column as the index
# df = pd.read_csv('my_data.csv', index_col=0)
# Parse dates automatically
# df = pd.read_csv('my_data.csv', parse_dates=['date_column'])
b) Loading Excel Files
For .xls or .xlsx files, you'll need the openpyxl or xlrd library installed. openpyxl is the modern standard.

# First, install the library if you haven't:
# pip install openpyxl
import pandas as pd
# Load the first sheet by default
df_excel = pd.read_excel('my_data.xlsx')
# Load a specific sheet by name
df_sheet2 = pd.read_excel('my_data.xlsx', sheet_name='Sheet2')
# Load a specific sheet by index (0 for the first sheet)
df_sheet1_by_index = pd.read_excel('my_data.xlsx', sheet_name=0)
c) Loading JSON Files
JSON (JavaScript Object Notation) is very common for APIs and web data.
import pandas as pd
# Pandas can read JSON directly into a DataFrame
df_json = pd.read_json('data.json')
# For complex JSON, you might need to normalize it first
# Example: a JSON file with a list of records
# data = [{'id': 1, 'name': 'Alice'}, {'id': 2, 'name': 'Bob'}]
# df_json_normalized = pd.json_normalize(data)
Loading Data from Online Sources
You can often load data directly from a URL using Pandas, just like a local file.
import pandas as pd # Example: Loading data from a URL (e.g., a CSV from GitHub) url = 'https://raw.githubusercontent.com/cs109/2025_data/master/countries.csv' df_online = pd.read_csv(url) print(df_online.head())
Loading Data from Databases
For this, you'll need a library to connect to your database (e.g., psycopg2 for PostgreSQL, pymysql for MySQL) and sqlalchemy to create a connection engine.
# First, install the necessary libraries:
# pip install sqlalchemy psycopg2-binary
import pandas as pd
from sqlalchemy import create_engine
# 1. Create a database connection string
# Format: 'dialect+driver://username:password@host:port/database'
# Example for PostgreSQL:
# engine = create_engine('postgresql+psycopg2://my_user:my_password@localhost:5432/my_database')
# For SQLite (a simple file-based database):
engine = create_engine('sqlite:///my_database.db')
# 2. Load data using a SQL query
query = "SELECT * FROM users WHERE age > 30;"
df_sql = pd.read_sql_query(query, engine)
print(df_sql.head())
# You can also read an entire table directly
# df_table = pd.read_sql_table('users', engine)
Loading Datasets for Machine Learning
a) Scikit-learn's Built-in Datasets
Scikit-learn comes with several classic datasets for practice. These are returned as special objects.
from sklearn.datasets import load_iris, load_boston
# Load the Iris dataset (classification)
iris_data = load_iris()
# The data is in a dictionary-like object
X_iris = iris_data.data # Features
y_iris = iris_data.target # Labels
feature_names = iris_data.feature_names
target_names = iris_data.target_names
print("Features shape:", X_iris.shape)
print("Feature names:", feature_names)
print("Target names:", target_names)
# Load the Boston dataset (regression) - Note: deprecated in newer versions
# boston_data = load_boston()
# X_boston = boston_data.data
# y_boston = boston_data.target
b) Loading Image Datasets (Deep Learning)
For image data, you typically use utility functions from frameworks like TensorFlow/Keras or PyTorch.
TensorFlow/Keras Example:
This is useful when your images are organized in subdirectories, where each folder name is a class label.
dataset/
├── cats/
│ ├── cat1.jpg
│ ├── cat2.jpg
│ └── ...
└── dogs/
├── dog1.jpg
├── dog2.jpg
└── ...
import tensorflow as tf
import matplotlib.pyplot as plt
# Define parameters
batch_size = 32
img_height = 180
img_width = 180
data_dir = 'path/to/your/dataset'
# Create a dataset from the directory
# This automatically infers labels from the subdirectory names
train_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="training",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
val_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
validation_split=0.2,
subset="validation",
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size)
# Get class names
class_names = train_ds.class_names
print("Class names:", class_names)
# Visualize some data
plt.figure(figsize=(10, 10))
for images, labels in train_ds.take(1):
for i in range(9):
ax = plt.subplot(3, 3, i + 1)
plt.imshow(images[i].numpy().astype("uint8"))
plt.title(class_names[labels[i]])
plt.axis("off")
plt.show()
Key Takeaways
- For most tabular data (CSV, Excel, etc.), start with
pandas. It's the most flexible and powerful tool. - Know your data's format. The function you use (
read_csv,read_excel, etc.) depends directly on the file type. - Use arguments to control loading.
pd.read_csv()has many arguments (sep,usecols,dtype, etc.) to handle messy or specific data requirements. - For ML, use specialized loaders. Scikit-learn has its own loaders, and deep learning frameworks have utilities for images and text.
- Always inspect your data after loading. Use
df.head(),df.info(), anddf.describe()to make sure it loaded correctly.
