杰瑞科技汇

如何用Python提取正文内容?

Of course! Here's a comprehensive guide on using boilerpipe in Python, including what it is, how to install it, and practical code examples.

What is Boilerpipe?

Boilerpipe is a Python library (a port of the original Java library) designed for content extraction from web pages. Its primary goal is to strip away the "boilerplate" content—the parts of a page you usually don't want, like navigation menus, headers, footers, ads, and sidebars—and leave behind the core, meaningful article text.

It's incredibly useful for:

  • Web Scraping: Getting clean text from news articles, blog posts, etc.
  • Data Mining: Preparing text data for analysis.
  • Building Readers: Creating a "reader mode" for browsers.

Installation

First, you need to install the library. It's available on PyPI.

pip install boilerpipe

boilerpipe relies on a Java runtime environment (JRE) to function. You must have Java installed on your system and available in your system's PATH.

How to check if Java is installed: Open your terminal or command prompt and run:

java -version

If you see a version number (e.g., openjdk version "11.0.16"), you're good to go. If not, you'll need to install it. You can download it from the official Oracle Java website or use a package manager like apt (on Debian/Ubuntu) or brew (on macOS).


Basic Usage

The main entry point is the extractor module. The most common extractor is DefaultExtractor, which is a good general-purpose choice.

Let's start with a simple example.

Example 1: Extracting Text from a URL

This is the most typical use case. You provide a URL and get back the main text.

from boilerpipe.extract import Extractor
# The URL of the article you want to process
url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Create an extractor instance.
# The 'Extractor' class can take a URL directly.
extractor = Extractor(extractor='DefaultExtractor', url=url)
# Extract the text
article_text = extractor.getText()
print("--- Article Text ---")
print(article_text)
print("\n--- End of Text ---")
# You can also get the HTML of the main content
article_html = extractor.getHTML()
# print(article_html)

Output: The output will be the clean, main text of the Wikipedia article, without the navigation, sidebars, or footer.


Key Features and Advanced Usage

Choosing the Right Extractor

boilerpipe comes with several pre-built extractors, each tuned for different types of pages. Choosing the right one can significantly improve accuracy.

  • DefaultExtractor: A good general-purpose extractor. A good starting point.
  • ArticleExtractor: Specifically designed for news articles and blog posts. It's often more accurate than DefaultExtractor for this content type.
  • KeepEverythingExtractor: The opposite of what you usually want. It keeps everything and only removes things that are very likely boilerplate (like common ad prefixes). Useful if you want to be very conservative.
  • LargestContentExtractor: Finds the block of text with the largest number of words. Simple but can be effective on some pages.

Example using ArticleExtractor:

from boilerpipe.extract import Extractor
url = 'https://www.nytimes.com/2025/10/27/technology/openai-sam-altman.html'
# Use the 'ArticleExtractor' for news articles
extractor = Extractor(extractor='ArticleExtractor', url=url)
article_text = extractor.getText()
print("--- Article Text (from ArticleExtractor) ---")
print(article_text[:500] + "...") # Print the first 500 chars

Extracting from HTML Strings

You don't have to use a URL. You can also pass raw HTML content directly.

from boilerpipe.extract import Extractor
# Some sample HTML with boilerplate
html = """
<html>
  <head>My Awesome Page</title>
    <link rel="stylesheet" href="style.css">
  </head>
  <body>
    <div id="header">
      <h1>Navigation</h1>
      <ul>
        <li><a href="/home">Home</a></li>
        <li><a href="/about">About</a></li>
        <li><a href="/contact">Contact</a></li>
      </ul>
    </div>
    <div id="main-content">
      <h2>The Real Article Title</h2>
      <p>This is the first paragraph of the actual article content that I want to extract. It contains the main information.</p>
      <p>Here is a second paragraph with more details and relevant text for the user.</p>
    </div>
    <div id="sidebar">
      <h3>Related Links</h3>
      <p>You might also be interested in...</p>
    </div>
    <div id="footer">
      <p>&copy; 2025 My Awesome Page. All rights reserved.</p>
    </div>
  </body>
</html>
"""
# Create an extractor from the HTML string
# Note: We don't specify a URL this time.
extractor = Extractor(extractor='DefaultExtractor', html=html)
article_text = extractor.getText()
print("--- Extracted Text from HTML ---")
print(article_text)
# Expected Output:
# The Real Article Title
# This is the first paragraph of the actual article content that I want to extract. It contains the main information.
# Here is a second paragraph with more details and relevant text for the user.

Getting Different Kinds of Output

Besides getText() and getHTML(), you can get more structured data.

  • getArticle(): Returns a dictionary containing title, text, and authors (if found). This is very convenient.
from boilerpipe.extract import Extractor
url = 'https://en.wikipedia.org/wiki/Web_scraping'
extractor = Extractor(extractor='DefaultExtractor', url=url)
article = extractor.getArticle()
print("--- Article Metadata ---")
print(f"Title: {article['title']}")
print(f"Authors: {article['authors']}") # Often empty for Wikipedia
print("\n--- Article Text ---")
print(article['text'])

Customizing Extraction

You can fine-tune the extraction process by providing your own list of "boilerplate" blocks or by marking blocks as "content" or "boilerplate". This is an advanced feature but gives you full control.

from boilerpipe.extract import Extractor
from boilerpipe.document import TextDocument
# Imagine you have analyzed a page and know exactly what to keep/remove
html = "<html><body><div id='nav'>This is a menu</div><div id='content'>This is the good stuff</div></body></html>'
# Create a TextDocument from the HTML
doc = TextDocument(html)
# You can now manually mark blocks
# (This is a simplified conceptual example; the actual API for block manipulation is more complex)
# In a real scenario, you would iterate through doc.getTextBlocks() and set their status.
# For this example, we'll just show the standard way, but the point is that
# you can get the 'doc' object and modify it before final extraction.
extractor = Extractor(extractor='DefaultExtractor', doc=doc)
text = extractor.getText()
print(text)
# Output:
# This is the good stuff

Alternatives to boilerpipe

While boilerpipe is excellent, it's good to know other options.

  1. readability-lxml / python-readability: A Python port of the popular Readability.js library. It's generally considered to be very accurate and easy to use. It's often the go-to choice for new projects.

    • Install: pip install readability-lxml
    • It requires lxml: pip install lxml
  2. trafilatura: A modern and powerful library that combines trafilatura's own extraction algorithm with a fallback to readability. It's known for its high accuracy and speed.

    • Install: pip install trafilatura
    • It also requires lxml.
  3. newspaper3k: A more comprehensive library for building news article scrapers. It includes extraction, but also NLP features like keyword and summary generation.

    • Install: pip install newspaper3k

Summary: boilerpipe vs. Alternatives

Library Pros Cons Best For
boilerpipe Mature, robust, many extractors, good control. Requires Java, can be slower. General-purpose content extraction, especially when you need fine-grained control.
readability-lxml Very accurate, simple to use, no Java dependency. Less configurable than boilerpipe. Most use cases where you just need the main article text reliably.
trafilatura Extremely accurate, fast, modern, no Java dependency. Can be slightly more complex to get started than readability. High-performance applications and projects where accuracy is paramount.
newspaper3k All-in-one (scraping, NLP, extraction), easy to use. Can be heavy if you only need extraction, less focused on extraction quality. Building a full-featured news aggregator or analysis tool.

For most new Python projects, I would recommend starting with trafilatura or readability-lxml due to their excellent performance and lack of a Java dependency. However, boilerpipe remains a powerful and reliable tool, especially if you're working in an environment where Java is already a standard.

分享:
扫描分享到社交APP
上一篇
下一篇