杰瑞科技汇

Python gethtmltext如何获取网页文本内容?

Of course! The phrase "Python gethtmltext" is a common search query that usually means "How do I get the text content from an HTML file or a webpage using Python?"

There are two primary ways to do this, depending on whether you have a local HTML file or a URL.

  1. For a local HTML file: Use Python's built-in libraries.
  2. For a live webpage (URL): Use the popular requests library to fetch the HTML and then BeautifulSoup to parse it.

Let's break down both methods.


Method 1: Parsing a Local HTML File (Using Built-in Libraries)

If you have an HTML file saved on your computer (e.g., my_page.html), you can read it and extract the text using Python's standard library. The best tool for this is BeautifulSoup, which you'll need to install. The built-in html.parser is a good start, but for more robust parsing, lxml is highly recommended.

Step 1: Install BeautifulSoup

First, you need to install the library. It's best to install it with a parser like lxml.

pip install beautifulsoup4 lxml

Step 2: The Python Code

Let's say you have a file named my_page.html with the following content:

my_page.html

<!DOCTYPE html>
<html>
<head>My Test Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <p>This is the first paragraph. It has some <b>bold text</b> and an <a href="https://example.com">link</a>.</p>
    <p>This is the second paragraph.</p>
    <div class="footer">
        <p>Footer text with some <i>italic</i> styling.</p>
    </div>
</body>
</html>

Here is the Python script to read this file and get the text.

from bs4 import BeautifulSoup
# The path to your local HTML file
file_path = 'my_page.html'
try:
    # Open the file in read mode ('r')
    with open(file_path, 'r', encoding='utf-8') as file:
        # Read the content of the file
        html_content = file.read()
    # Create a BeautifulSoup object to parse the HTML
    # 'lxml' is a fast and efficient parser
    soup = BeautifulSoup(html_content, 'lxml')
    # Get all the text from the parsed HTML
    # .get_text() strips all HTML tags and returns a single string
    all_text = soup.get_text()
    # Print the extracted text
    print("--- Extracted Text (All) ---")
    print(all_text)
    print("\n" + "="*30 + "\n")
    # You can also get text from specific tags
    # For example, get text from all <p> tags
    paragraph_texts = soup.find_all('p')
    print("--- Extracted Text (from <p> tags only) ---")
    for p in paragraph_texts:
        print(p.get_text(strip=True)) # strip=True removes leading/trailing whitespace
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Output of the script:

--- Extracted Text (All) ---
Welcome to the Page
This is the first paragraph. It has some bold text and an link.
This is the second paragraph.
Footer text with some italic styling.
==============================
--- Extracted Text (from <p> tags only) ---
This is the first paragraph. It has some bold text and an link.
This is the second paragraph.
Footer text with some italic styling.

Method 2: Fetching and Parsing a Live Webpage (URL)

This is the more common scenario. You want to get the text from a live website. The standard process is:

  1. Fetch the HTML content of the URL using the requests library.
  2. Parse the HTML with BeautifulSoup.
  3. Extract the text.

Step 1: Install requests

If you don't have it, install the requests library.

pip install requests

Step 2: The Python Code

This code fetches the HTML from http://quotes.toscrape.com/, a website designed for scraping practice.

import requests
from bs4 import BeautifulSoup
# The URL of the webpage you want to scrape
url = 'http://quotes.toscrape.com/'
try:
    # Send an HTTP GET request to the URL
    response = requests.get(url, timeout=10) # timeout is good practice
    # Raise an exception if the request was unsuccessful (e.g., 404 Not Found)
    response.raise_for_status()
    # The HTML content of the page is in the .text attribute of the response
    html_content = response.text
    # Create a BeautifulSoup object to parse the HTML
    soup = BeautifulSoup(html_content, 'lxml')
    # Get all the text from the parsed HTML
    all_text = soup.get_text()
    # Print the extracted text (it might be very long!)
    print("--- Extracted Text from URL ---")
    # Let's just print the first 500 characters to avoid flooding the console
    print(all_text[:500] + "...")
    # A more practical example: Get all quotes and their authors
    print("\n" + "="*30 + "\n")
    print("--- Extracting Quotes and Authors ---")
    quotes = soup.find_all('div', class_='quote')
    for quote in quotes:
        text = quote.find('span', class_='text').get_text(strip=True)
        author = quote.find('small', class_='author').get_text(strip=True)
        print(f'"{text}" - {author}')
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Output of the script:

--- Extracted Text from URL ---
...
    Quotes To Scrape
    Login
    Quotes to Scrape
    “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
    by Albert Einstein
    (About)
    “It is our choices, Harry, that show what we truly are, far more than our abilities.”
    by J.K. Rowling
    (Login)
...
==============================
--- Extracting Quotes and Authors ---
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." - Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities." - J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." - Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." - Jane Austen
...

Key Functions and Methods Recap

Function/Method Description Example
requests.get(url) Fetches the HTML content from a given URL. response = requests.get('http://example.com')
response.text Returns the raw HTML content of the response as a string. html = response.text
BeautifulSoup(html, 'lxml') Creates a soup object that parses the HTML, making it easy to navigate. soup = BeautifulSoup(html, 'lxml')
soup.get_text() Extracts all the text from the soup object, stripping all HTML tags. all_text = soup.get_text()
soup.find('tag', attrs={...}) Finds the first occurrence of a tag matching the criteria. first_p = soup.find('p')
soup.find_all('tag', attrs={...}) Finds all occurrences of a tag matching the criteria and returns a list. all_divs = soup.find_all('div', class_='content')
.get_text(strip=True) A method on a tag element that gets its text and removes leading/trailing whitespace. p_text = my_p_tag.get_text(strip=True)
分享:
扫描分享到社交APP
上一篇
下一篇