Of course! Here is a comprehensive guide to using Beautiful Soup, one of the most popular Python libraries for web scraping.

What is Beautiful Soup?
Beautiful Soup is a Python library designed for pulling data out of HTML and XML files. It creates a parsed tree of the document's contents, making it incredibly easy to navigate, search, and modify the parse tree.
Think of it as giving your Python program the ability to "read" and "understand" the structure of a webpage.
Key Features:
- Parses HTML and XML documents.
- Provides simple, Pythonic ways to navigate the parsed tree (e.g., find tags by name, find tags by CSS class).
- Converts incoming documents to Unicode and outgoing documents to UTF-8.
- Tolerates "badly-formed" markup, which is common on the real web.
- Works with various parsers (like Python's built-in
html.parser,lxml, andhtml5lib).
Installation
First, you need to install Beautiful Soup. It's highly recommended to also install a parser like lxml, as it's much faster than the built-in html.parser.

pip install beautifulsoup4 pip install lxml
Basic Workflow: A Step-by-Step Example
Let's scrape the titles and links from the Wikipedia page for "Python (programming language)".
Step 1: Fetch the Web Page
You need to get the HTML content of the page. The standard library for this in Python is requests.
pip install requests
Now, let's write the code to fetch the page.
import requests
from bs4 import BeautifulSoup
# The URL of the page we want to scrape
url = 'https://en.wikipedia.org/wiki/Python_(programming_language)'
# Send an HTTP GET request to the URL
try:
response = requests.get(url)
# Raise an exception if the request was unsuccessful (e.g., 404 Not Found)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
else:
# The HTML content of the page is in the .text attribute
html_content = response.text
print("Successfully fetched the page!")
Step 2: Create a BeautifulSoup Object
Now, we'll parse the HTML content using Beautiful Soup and the lxml parser.

# Create a BeautifulSoup object
# The first argument is the HTML content
# The second argument is the parser to use ('lxml', 'html.parser', etc.)
soup = BeautifulSoup(html_content, 'lxml')
Step 3: Find Elements in the Parsed Tree
This is the core of Beautiful Soup. We'll use methods like .find() and .find_all() to locate the data we need.
Goal: Find all the main section headings (e.g., "History", "Design philosophy and features", etc.). On Wikipedia, these are inside <h2> tags.
# Find all <h2> tags on the page
section_headings = soup.find_all('h2')
print(f"Found {len(section_headings)} <h2> tags.")
Goal: Extract the text from these headings.
The <h2> tag also contains an <span> with a class. We want to get the text inside the <h2> but ignore the <span>.
for heading in section_headings:
# .get_text() strips all tags and returns just the human-readable text
heading_text = heading.get_text(strip=True)
print(heading_text)
Step 4: Refine Your Search with Attributes and CSS Selectors
Often, you need to be more specific. Let's find the main infobox table on the right side of the page. It has a class called infobox.
# Find the first element with the class 'infobox'
infobox = soup.find('table', class_='infobox')
if infobox:
print("Found the infobox table!")
# Find all rows (<tr>) in the infobox
rows = infobox.find_all('tr')
for row in rows:
# Find the header cell (<th>) in the row
header_cell = row.find('th')
# Find the data cell (<td>) in the row
data_cell = row.find('td')
if header_cell and data_cell:
# .get_text(strip=True) cleans up whitespace
header = header_cell.get_text(strip=True)
data = data_cell.get_text(strip=True)
print(f"{header}: {data}")
Note: In modern Python, class_ is used instead of class because class is a reserved keyword.
Core Methods and Concepts
| Method | Description | Example |
|---|---|---|
soup.find(tag, attrs, ...) |
Finds the first matching element. | soup.find('div', id='main-content') |
soup.find_all(tag, attrs, ...) |
Finds all matching elements and returns a list. | soup.find_all('a', class_='link') |
.get_text() |
Returns all the text inside an element and its children. | my_element.get_text() |
.text |
Similar to .get_text(), but as a property. |
my_element.text |
.string |
Returns the text inside an element only if it has a single string child. Returns None otherwise. |
my_element.string |
.parent |
Returns the parent of a tag. | my_element.parent |
.find_next_sibling() |
Finds the next sibling element. | my_element.find_next_sibling('p') |
.select(CSS_SELECTOR) |
A powerful method that uses CSS selectors to find elements. | soup.select('div.content p a') |
Navigating the Tree (The "Dot" Notation)
You can navigate the tree using dot notation for tags.
# Find the <body> tag body = soup.body # Find the first <div> inside the body first_div = body.div # Find the first <p> inside that first <div> first_paragraph = first_div.p print(first_paragraph)
Handling Different Parsers
Beautiful Soup supports several parsers. The choice can affect speed and how it handles "broken" HTML.
| Parser | Pros | Cons | How to Use |
|---|---|---|---|
lxml |
Very fast, lenient with broken HTML. | Requires external C libraries to be installed. | BeautifulSoup(html, 'lxml') |
html.parser |
Built into Python, no extra installation needed. | Slower than lxml, less lenient. |
BeautifulSoup(html, 'html.parser') |
html5lib |
Extremely lenient, parses the page exactly like a web browser. | Very slow, requires external library. | BeautifulSoup(html, 'html5lib') |
Recommendation: Use lxml if you can install it. It's the best balance of speed and robustness. Use html.parser as a fallback if you can't install anything else.
A Complete, Runnable Example
Let's combine everything into a final script that scrapes the titles of the latest news from Hacker News.
import requests
from bs4 import BeautifulSoup
def get_hacker_news_titles():
"""
Fetches the Hacker News homepage and prints the titles of the top 30 news links.
"""
url = 'https://news.ycombinator.com'
headers = {
'User-Agent': 'My Web Scraper 1.0 (https://example.com/my-scraper)'
}
try:
response = requests.get(url, headers=headers)
response.raise_for_status()
except requests.exceptions.RequestException as e:
print(f"Error fetching Hacker News: {e}")
return
soup = BeautifulSoup(response.text, 'lxml')
# On Hacker News, story titles are in <span> tags with the class 'titleline'
# We use a CSS selector here, which is very powerfulspans = soup.select('span.titleline')
print("--- Top Hacker News Stories ---")
for i, span in enumerate(title_spans[:30]): # Get the top 30
# The actual title is in an <a> tag inside the <span>
title_link = span.find('a')
if title_link:
title = title_link.get_text(strip=True)
print(f"{i+1}. {title}")
if __name__ == "__main__":
get_hacker_news_titles()
Important Considerations & Best Practices
- Be Respectful (Check
robots.txt): Before scraping a site, check itsrobots.txtfile (e.g.,https://example.com/robots.txt). This file tells bots which parts of the site they are not allowed to access. Always respect these rules. - Set a User-Agent: Some websites block default
requestsUser-Agents because they are associated with simple scrapers. Set a custom one to make your
