Python gethtmltext如何获取网页文本内容？-杰瑞科技汇

Of course! The phrase "Python gethtmltext" is a common search query that usually means "How do I get the text content from an HTML file or a webpage using Python?"

There are two primary ways to do this, depending on whether you have a local HTML file or a URL.

For a local HTML file: Use Python's built-in libraries.
For a live webpage (URL): Use the popular requests library to fetch the HTML and then BeautifulSoup to parse it.

Let's break down both methods.

Method 1: Parsing a Local HTML File (Using Built-in Libraries)

If you have an HTML file saved on your computer (e.g., my_page.html), you can read it and extract the text using Python's standard library. The best tool for this is BeautifulSoup, which you'll need to install. The built-in html.parser is a good start, but for more robust parsing, lxml is highly recommended.

Step 1: Install BeautifulSoup

First, you need to install the library. It's best to install it with a parser like lxml.

pip install beautifulsoup4 lxml

Step 2: The Python Code

Let's say you have a file named my_page.html with the following content:

my_page.html

<!DOCTYPE html>
<html>
<head>My Test Page</title>
</head>
<body>
    <h1>Welcome to the Page</h1>
    <p>This is the first paragraph. It has some <b>bold text</b> and an <a href="https://example.com">link</a>.</p>
    <p>This is the second paragraph.</p>
    <div class="footer">
        <p>Footer text with some <i>italic</i> styling.</p>
    </div>
</body>
</html>

Here is the Python script to read this file and get the text.

from bs4 import BeautifulSoup
# The path to your local HTML file
file_path = 'my_page.html'
try:
    # Open the file in read mode ('r')
    with open(file_path, 'r', encoding='utf-8') as file:
        # Read the content of the file
        html_content = file.read()
    # Create a BeautifulSoup object to parse the HTML
    # 'lxml' is a fast and efficient parser
    soup = BeautifulSoup(html_content, 'lxml')
    # Get all the text from the parsed HTML
    # .get_text() strips all HTML tags and returns a single string
    all_text = soup.get_text()
    # Print the extracted text
    print("--- Extracted Text (All) ---")
    print(all_text)
    print("\n" + "="*30 + "\n")
    # You can also get text from specific tags
    # For example, get text from all <p> tags
    paragraph_texts = soup.find_all('p')
    print("--- Extracted Text (from <p> tags only) ---")
    for p in paragraph_texts:
        print(p.get_text(strip=True)) # strip=True removes leading/trailing whitespace
except FileNotFoundError:
    print(f"Error: The file '{file_path}' was not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Output of the script:

--- Extracted Text (All) ---
Welcome to the Page
This is the first paragraph. It has some bold text and an link.
This is the second paragraph.
Footer text with some italic styling.
==============================
--- Extracted Text (from <p> tags only) ---
This is the first paragraph. It has some bold text and an link.
This is the second paragraph.
Footer text with some italic styling.

Method 2: Fetching and Parsing a Live Webpage (URL)

This is the more common scenario. You want to get the text from a live website. The standard process is:

Fetch the HTML content of the URL using the requests library.
Parse the HTML with BeautifulSoup.
Extract the text.

Step 1: Install `requests`

If you don't have it, install the requests library.

pip install requests

Step 2: The Python Code

This code fetches the HTML from http://quotes.toscrape.com/, a website designed for scraping practice.

import requests
from bs4 import BeautifulSoup
# The URL of the webpage you want to scrape
url = 'http://quotes.toscrape.com/'
try:
    # Send an HTTP GET request to the URL
    response = requests.get(url, timeout=10) # timeout is good practice
    # Raise an exception if the request was unsuccessful (e.g., 404 Not Found)
    response.raise_for_status()
    # The HTML content of the page is in the .text attribute of the response
    html_content = response.text
    # Create a BeautifulSoup object to parse the HTML
    soup = BeautifulSoup(html_content, 'lxml')
    # Get all the text from the parsed HTML
    all_text = soup.get_text()
    # Print the extracted text (it might be very long!)
    print("--- Extracted Text from URL ---")
    # Let's just print the first 500 characters to avoid flooding the console
    print(all_text[:500] + "...")
    # A more practical example: Get all quotes and their authors
    print("\n" + "="*30 + "\n")
    print("--- Extracting Quotes and Authors ---")
    quotes = soup.find_all('div', class_='quote')
    for quote in quotes:
        text = quote.find('span', class_='text').get_text(strip=True)
        author = quote.find('small', class_='author').get_text(strip=True)
        print(f'"{text}" - {author}')
except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Output of the script:

--- Extracted Text from URL ---
...
    Quotes To Scrape
    Login
    Quotes to Scrape
    “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
    by Albert Einstein
    (About)
    “It is our choices, Harry, that show what we truly are, far more than our abilities.”
    by J.K. Rowling
    (Login)
...
==============================
--- Extracting Quotes and Authors ---
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." - Albert Einstein
"It is our choices, Harry, that show what we truly are, far more than our abilities." - J.K. Rowling
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." - Albert Einstein
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." - Jane Austen
...

Key Functions and Methods Recap

Function/Method	Description	Example
`requests.get(url)`	Fetches the HTML content from a given URL.	`response = requests.get('http://example.com')`
`response.text`	Returns the raw HTML content of the response as a string.	`html = response.text`
`BeautifulSoup(html, 'lxml')`	Creates a `soup` object that parses the HTML, making it easy to navigate.	`soup = BeautifulSoup(html, 'lxml')`
`soup.get_text()`	Extracts all the text from the `soup` object, stripping all HTML tags.	`all_text = soup.get_text()`
`soup.find('tag', attrs={...})`	Finds the first occurrence of a tag matching the criteria.	`first_p = soup.find('p')`
`soup.find_all('tag', attrs={...})`	Finds all occurrences of a tag matching the criteria and returns a list.	`all_divs = soup.find_all('div', class_='content')`
`.get_text(strip=True)`	A method on a tag element that gets its text and removes leading/trailing whitespace.	`p_text = my_p_tag.get_text(strip=True)`

Python gethtmltext如何获取网页文本内容？

Method 1: Parsing a Local HTML File (Using Built-in Libraries)

Step 1: Install BeautifulSoup

Step 2: The Python Code

Method 2: Fetching and Parsing a Live Webpage (URL)

Step 1: Install `requests`

Step 2: The Python Code

Key Functions and Methods Recap

99ANYc3cd6

Java enum 如何转为 int？

Linux内核视频教程该怎么学？

Python HTTPHandler如何实现自定义请求处理？

Excel VBA视频教程怎么学更高效？

Python多进程如何高效实现并行计算？

男士瑜伽入门，哪些基本动作必学？

Java创建字符串对象数组对象数组对象，如何正确操作？

python pandas header

python bashplotlib安装，如何正确配置使用？

Adobe Flash CS教程视频如何快速入门？

Python sock shutdown如何正确关闭连接？

AE教程PDF哪里找？新手如何学AE？

Java字符串如何高效插入另一字符串？

Java String转int，如何实现？

Adobe Audition CC教程从哪学？新手入门必看技巧？

哪里能下载Adobe Premiere教程？

Python gethtmltext如何获取网页文本内容？

Method 1: Parsing a Local HTML File (Using Built-in Libraries)

Step 1: Install BeautifulSoup

Step 2: The Python Code

Method 2: Fetching and Parsing a Live Webpage (URL)

Step 1: Install requests

Step 2: The Python Code

Key Functions and Methods Recap

相关推荐

男士瑜伽入门，哪些基本动作必学？

Step 1: Install `requests`