Of course! Here's a comprehensive guide on using boilerpipe in Python, including what it is, how to install it, and practical code examples.
What is Boilerpipe?
Boilerpipe is a Python library (a port of the original Java library) designed for content extraction from web pages. Its primary goal is to strip away the "boilerplate" content—the parts of a page you usually don't want, like navigation menus, headers, footers, ads, and sidebars—and leave behind the core, meaningful article text.
It's incredibly useful for:
- Web Scraping: Getting clean text from news articles, blog posts, etc.
- Data Mining: Preparing text data for analysis.
- Building Readers: Creating a "reader mode" for browsers.
Installation
First, you need to install the library. It's available on PyPI.
pip install boilerpipe
boilerpipe relies on a Java runtime environment (JRE) to function. You must have Java installed on your system and available in your system's PATH.
How to check if Java is installed: Open your terminal or command prompt and run:
java -version
If you see a version number (e.g., openjdk version "11.0.16"), you're good to go. If not, you'll need to install it. You can download it from the official Oracle Java website or use a package manager like apt (on Debian/Ubuntu) or brew (on macOS).
Basic Usage
The main entry point is the extractor module. The most common extractor is DefaultExtractor, which is a good general-purpose choice.
Let's start with a simple example.
Example 1: Extracting Text from a URL
This is the most typical use case. You provide a URL and get back the main text.
from boilerpipe.extract import Extractor
# The URL of the article you want to process
url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Create an extractor instance.
# The 'Extractor' class can take a URL directly.
extractor = Extractor(extractor='DefaultExtractor', url=url)
# Extract the text
article_text = extractor.getText()
print("--- Article Text ---")
print(article_text)
print("\n--- End of Text ---")
# You can also get the HTML of the main content
article_html = extractor.getHTML()
# print(article_html)
Output: The output will be the clean, main text of the Wikipedia article, without the navigation, sidebars, or footer.
Key Features and Advanced Usage
Choosing the Right Extractor
boilerpipe comes with several pre-built extractors, each tuned for different types of pages. Choosing the right one can significantly improve accuracy.
DefaultExtractor: A good general-purpose extractor. A good starting point.ArticleExtractor: Specifically designed for news articles and blog posts. It's often more accurate thanDefaultExtractorfor this content type.KeepEverythingExtractor: The opposite of what you usually want. It keeps everything and only removes things that are very likely boilerplate (like common ad prefixes). Useful if you want to be very conservative.LargestContentExtractor: Finds the block of text with the largest number of words. Simple but can be effective on some pages.
Example using ArticleExtractor:
from boilerpipe.extract import Extractor
url = 'https://www.nytimes.com/2025/10/27/technology/openai-sam-altman.html'
# Use the 'ArticleExtractor' for news articles
extractor = Extractor(extractor='ArticleExtractor', url=url)
article_text = extractor.getText()
print("--- Article Text (from ArticleExtractor) ---")
print(article_text[:500] + "...") # Print the first 500 chars
Extracting from HTML Strings
You don't have to use a URL. You can also pass raw HTML content directly.
from boilerpipe.extract import Extractor
# Some sample HTML with boilerplate
html = """
<html>
<head>My Awesome Page</title>
<link rel="stylesheet" href="style.css">
</head>
<body>
<div id="header">
<h1>Navigation</h1>
<ul>
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
</div>
<div id="main-content">
<h2>The Real Article Title</h2>
<p>This is the first paragraph of the actual article content that I want to extract. It contains the main information.</p>
<p>Here is a second paragraph with more details and relevant text for the user.</p>
</div>
<div id="sidebar">
<h3>Related Links</h3>
<p>You might also be interested in...</p>
</div>
<div id="footer">
<p>© 2025 My Awesome Page. All rights reserved.</p>
</div>
</body>
</html>
"""
# Create an extractor from the HTML string
# Note: We don't specify a URL this time.
extractor = Extractor(extractor='DefaultExtractor', html=html)
article_text = extractor.getText()
print("--- Extracted Text from HTML ---")
print(article_text)
# Expected Output:
# The Real Article Title
# This is the first paragraph of the actual article content that I want to extract. It contains the main information.
# Here is a second paragraph with more details and relevant text for the user.
Getting Different Kinds of Output
Besides getText() and getHTML(), you can get more structured data.
getArticle(): Returns a dictionary containingtitle,text, andauthors(if found). This is very convenient.
from boilerpipe.extract import Extractor
url = 'https://en.wikipedia.org/wiki/Web_scraping'
extractor = Extractor(extractor='DefaultExtractor', url=url)
article = extractor.getArticle()
print("--- Article Metadata ---")
print(f"Title: {article['title']}")
print(f"Authors: {article['authors']}") # Often empty for Wikipedia
print("\n--- Article Text ---")
print(article['text'])
Customizing Extraction
You can fine-tune the extraction process by providing your own list of "boilerplate" blocks or by marking blocks as "content" or "boilerplate". This is an advanced feature but gives you full control.
from boilerpipe.extract import Extractor from boilerpipe.document import TextDocument # Imagine you have analyzed a page and know exactly what to keep/remove html = "<html><body><div id='nav'>This is a menu</div><div id='content'>This is the good stuff</div></body></html>' # Create a TextDocument from the HTML doc = TextDocument(html) # You can now manually mark blocks # (This is a simplified conceptual example; the actual API for block manipulation is more complex) # In a real scenario, you would iterate through doc.getTextBlocks() and set their status. # For this example, we'll just show the standard way, but the point is that # you can get the 'doc' object and modify it before final extraction. extractor = Extractor(extractor='DefaultExtractor', doc=doc) text = extractor.getText() print(text) # Output: # This is the good stuff
Alternatives to boilerpipe
While boilerpipe is excellent, it's good to know other options.
-
readability-lxml/python-readability: A Python port of the popular Readability.js library. It's generally considered to be very accurate and easy to use. It's often the go-to choice for new projects.- Install:
pip install readability-lxml - It requires
lxml:pip install lxml
- Install:
-
trafilatura: A modern and powerful library that combinestrafilatura's own extraction algorithm with a fallback toreadability. It's known for its high accuracy and speed.- Install:
pip install trafilatura - It also requires
lxml.
- Install:
-
newspaper3k: A more comprehensive library for building news article scrapers. It includes extraction, but also NLP features like keyword and summary generation.- Install:
pip install newspaper3k
- Install:
Summary: boilerpipe vs. Alternatives
| Library | Pros | Cons | Best For |
|---|---|---|---|
boilerpipe |
Mature, robust, many extractors, good control. | Requires Java, can be slower. | General-purpose content extraction, especially when you need fine-grained control. |
readability-lxml |
Very accurate, simple to use, no Java dependency. | Less configurable than boilerpipe. |
Most use cases where you just need the main article text reliably. |
trafilatura |
Extremely accurate, fast, modern, no Java dependency. | Can be slightly more complex to get started than readability. |
High-performance applications and projects where accuracy is paramount. |
newspaper3k |
All-in-one (scraping, NLP, extraction), easy to use. | Can be heavy if you only need extraction, less focused on extraction quality. | Building a full-featured news aggregator or analysis tool. |
For most new Python projects, I would recommend starting with trafilatura or readability-lxml due to their excellent performance and lack of a Java dependency. However, boilerpipe remains a powerful and reliable tool, especially if you're working in an environment where Java is already a standard.
