Of course! It sounds like you're looking for a way to use Python to work with Word (.docx) and Excel (.xlsx) files. This is a very common and powerful task for automating document and data processing.
Here's a comprehensive guide covering the most popular and effective Python libraries for this purpose.
Working with Excel Files (.xlsx, .xls)
For Excel, the go-to library is pandas. It's the industry standard for data manipulation in Python. For more advanced tasks like writing formulas or manipulating cell styles, you can use openpyxl.
Method 1: The pandas Library (Recommended for most tasks)
pandas is perfect for reading data into a DataFrame (a powerful table-like data structure), analyzing it, and then writing it back to an Excel file.
Installation:
pip install pandas openpyxl
(Note: openpyxl is required as an engine by pandas to write .xlsx files.)
Common Tasks:
a) Reading an Excel File
import pandas as pd
# Read the first sheet of an Excel file into a DataFrame
df = pd.read_excel('data.xlsx')
# Read a specific sheet by name
df_sales = pd.read_excel('data.xlsx', sheet_name='Sales')
# Read a specific sheet by index (0 is the first sheet)
df_customers = pd.read_excel('data.xlsx', sheet_name=0)
print(df.head()) # Display the first 5 rows
b) Writing to an Excel File
import pandas as pd
# Create a sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df_to_write = pd.DataFrame(data)
# Write the DataFrame to an Excel file
# This will create a new file or overwrite an existing one
df_to_write.to_excel('output.xlsx', index=False) # index=False prevents writing the DataFrame index
print("Excel file 'output.xlsx' created successfully.")
c) Appending Data to an Existing Sheet
This is a bit more complex as pandas doesn't have a direct "append" mode. The best way is to read the existing data, combine it with your new data, and write it all back.
import pandas as pd
# New data to append
new_data = {'Name': ['David'], 'Age': [40], 'City': ['Houston']}
new_df = pd.DataFrame(new_data)
try:
# Try to read the existing file
existing_df = pd.read_excel('output.xlsx')
# Combine the old and new DataFrames
final_df = pd.concat([existing_df, new_df], ignore_index=True)
except FileNotFoundError:
# If the file doesn't exist, just use the new data
final_df = new_df
# Write the combined data back to the file
final_df.to_excel('output.xlsx', index=False)
print("Data appended to 'output.xlsx'.")
Method 2: The openpyxl Library
Use openpyxl when you need more control over the Excel file itself, such as:
- Writing Excel formulas (
=SUM(A1:A10)). - Merging cells, changing cell colors, or adjusting fonts.
- Manipulating charts and images.
- Reading and modifying existing files without losing formatting.
Installation:
pip install openpyxl
Common Tasks:
a) Reading Data from Specific Cells
from openpyxl import load_workbook
# Load an existing workbook
wb = load_workbook('data.xlsx')
ws = wb.active # Get the active sheet
# Get the value from a specific cell
cell_value = ws['A1'].value
print(f"Value in cell A1: {cell_value}")
# Get all values from a column
column_values = []
for row in ws.iter_rows(min_row=2, max_row=10, min_col=1, max_col=1):
for cell in row:
column_values.append(cell.value)
print(f"Values in column A (rows 2-10): {column_values}")
wb.close() # Close the workbook
b) Writing Data and Formatting
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill
# Create a new workbook
wb = Workbook()
ws = wb.active= "Report"
# Write headers with bold font
headers = ['Product', 'Sales', 'Region']
ws.append(headers)
for cell in ws[1]:
cell.font = Font(bold=True)
# Write some data
ws.append(['Laptop', 1500, 'North'])
ws.append(['Mouse', 25, 'South'])
# Format a specific cell (e.g., make 'Sales' for Laptop green)
ws['B2'].fill = PatternFill(start_color="90EE90", end_color="90EE90", fill_type="solid")
# Save the workbook
wb.save('formatted_report.xlsx')
print("Formatted Excel file 'formatted_report.xlsx' created.")
Working with Word Documents (.docx)
For Word documents, the standard library is python-docx. It allows you to create, read, and modify .docx files.
Installation:
pip install python-docx
Important Note: python-docx manipulates the text and structure of a document. It does not handle the visual layout (like exact positioning of images or text boxes) in the same way you would in the Word GUI.
Common Tasks:
a) Creating a New Word Document
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
# Create a new Document object
doc = Document()
# Add a heading
doc.add_heading('Monthly Sales Report', level=1)
# Add a paragraph
doc.add_paragraph('This report summarizes the sales performance for the month of October.')
# Add a table
table = doc.add_table(rows=3, cols=3)
table.cell(0, 0).text = 'Product'
table.cell(0, 1).text = 'Units Sold'
table.cell(0, 2).text = 'Revenue'
# Populate the table
table.cell(1, 0).text = 'Laptop'
table.cell(1, 1).text = '50'
table.cell(1, 2).text = '$75,000'
table.cell(2, 0).text = 'Mouse'
table.cell(2, 1).text = '200'
table.cell(2, 2).text = '$5,000'
# Add a paragraph and align it to the right
p = doc.add_paragraph('End of Report.')
p.alignment = WD_ALIGN_PARAGRAPH.RIGHT
# Save the document
doc.save('sales_report.docx')
print("Word document 'sales_report.docx' created.")
b) Reading and Modifying an Existing Document
from docx import Document
# Open an existing document
doc = Document('sales_report.docx')
# --- Reading Content ---
print("--- Reading Document ---")
# Iterate through all paragraphs and print their text
for para in doc.paragraphs:
print(para.text)
# Iterate through all tables
print("\n--- Reading Tables ---")
for i, table in enumerate(doc.tables):
print(f"Table {i+1}:")
for row in table.rows:
for cell in row.cells:
print(cell.text, end='\t')
print() # New line after each row
# --- Modifying Content ---
# Find a paragraph and change its text
for para in doc.paragraphs:
if 'October' in para.text:
para.text = 'This report summarizes the sales performance for the month of November.'
# Add a bold run to the end of the paragraph
run = para.add_run(' (Revised)')
run.bold = True
break
# Add a new paragraph at the end
doc.add_paragraph('This document was automatically generated by Python.')
# Save the modified document (it's good practice to save with a new name)
doc.save('sales_report_modified.docx')
print("\nModified document saved as 'sales_report_modified.docx'")
Summary: Which Library to Use?
| Task | Excel (.xlsx) |
Word (.docx) |
Best For... |
|---|---|---|---|
| Data Analysis | pandas |
N/A | Reading large datasets, data cleaning, statistical analysis. |
| Simple Read/Write | pandas |
python-docx |
Quick tasks of getting data in and out of files. |
| Complex Formatting | openpyxl |
python-docx |
Changing cell styles, adding formulas, modifying document structure (headers, footers). |
| Appending Data | pandas (read, concat, write) |
N/A | Adding new rows to an existing sheet without losing data. |
| Automating Reports | pandas + openpyxl |
python-docx |
Creating dynamic reports by pulling data from a database (using pandas) and formatting it into a Word/Excel document. |
Putting It All Together: A Full Example
Let's create a script that reads data from an Excel file, performs a simple calculation, and generates a Word report.
Create the input Excel file (sales_data.xlsx)
| Product | Q1_Sales | Q2_Sales |
|---|---|---|
| Product A | 10000 | 12000 |
| Product B | 15000 | 18000 |
| Product C | 8000 | 9500 |
The Python Script (generate_report.py)
import pandas as pd
from docx import Document
from docx.shared import Pt, Inches
from docx.enum.text import WD_ALIGN_PARAGRAPH
# --- Step 1: Read data from Excel ---
try:
df = pd.read_excel('sales_data.xlsx')
print("Successfully read data from Excel.")
except FileNotFoundError:
print("Error: 'sales_data.xlsx' not found. Please create it first.")
exit()
# --- Step 2: Process the data ---
# Calculate total sales for each product
df['Total_Sales'] = df['Q1_Sales'] + df['Q2_Sales']
# Calculate overall total
grand_total = df['Total_Sales'].sum()
# --- Step 3: Generate the Word report ---
doc = Document()
doc.add_heading('Sales Performance Report', 0)
# Add an introduction
doc.add_paragraph('This report details the sales performance for Q1 and Q2 of the fiscal year.')
# Add the data table to the Word document
# We need to convert the DataFrame to a list of lists for the table
table_data = df.values.tolist()
# Add headers
table_headers = list(df.columns)
# Create the table in Word
table = doc.add_table(rows=1, cols=len(table_headers))
hdr_cells = table.rows[0].cells
for i, header in enumerate(table_headers):
hdr_cells[i].text = str(header)
# Add the data rows
for row_data in table_data:
row_cells = table.add_row().cells
for i, cell_data in enumerate(row_data):
row_cells[i].text = str(cell_data)
# Add a summary paragraph
summary_para = doc.add_paragraph()
summary_para.add_run('Summary: ').bold = True
summary_para.add_run(f'The grand total sales for all products is ${grand_total:,}.')
# Save the final report
doc.save('final_sales_report.docx')
print("\nReport generated successfully: 'final_sales_report.docx'")
When you run this script, it will read sales_data.xlsx, calculate the totals, and create a professional-looking final_sales_report.docx.
