杰瑞科技汇

Python PDF如何高效提取关键词?

Of course! Searching for keywords in a PDF is a common task in Python. The best library for this job is PyPDF2, which is powerful, easy to use, and specifically designed for reading and manipulating PDF files.

Python PDF如何高效提取关键词?-图1
(图片来源网络,侵删)

Here’s a complete guide covering the basics, advanced techniques, and alternative libraries.


The Best Tool: PyPDF2

PyPDF2 is a pure-Python library capable of splitting, merging, cropping, and transforming PDF files. Its main feature for our purpose is reading text content from a PDF.

Installation

First, you need to install the library. Open your terminal or command prompt and run:

pip install PyPDF2

Basic Keyword Search

This is the most straightforward approach. We'll read the text from each page and check if our keyword is present.

Python PDF如何高效提取关键词?-图2
(图片来源网络,侵删)
import PyPDF2
def find_keyword_in_pdf(pdf_path, keyword):
    """
    Searches for a keyword in a PDF file and prints the pages where it's found.
    """
    try:
        with open(pdf_path, 'rb') as file:
            # Create a PDF reader object
            reader = PyPDF2.PdfReader(file)
            # Get the total number of pages
            num_pages = len(reader.pages)
            print(f"PDF has {num_pages} pages.")
            found = False
            # Loop through each page
            for page_num in range(num_pages):
                # Get the page object
                page = reader.pages[page_num]
                # Extract text from the page
                text = page.extract_text()
                # Check if the keyword is in the text (case-insensitive)
                if keyword.lower() in text.lower():
                    print(f"Keyword '{keyword}' found on page {page_num + 1}")
                    found = True
            if not found:
                print(f"Keyword '{keyword}' not found in the document.")
    except FileNotFoundError:
        print(f"Error: The file '{pdf_path}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")
# --- Example Usage ---
# Create a dummy PDF for testing (you can replace this with your actual file path)
# For this example, let's assume you have a file named 'my_document.pdf'
pdf_file = 'my_document.pdf' 
search_keyword = 'Python'
find_keyword_in_pdf(pdf_file, search_keyword)

How it works:

  1. open(pdf_path, 'rb'): Opens the PDF file in binary read mode ('rb'), which is required for PDFs.
  2. PyPDF2.PdfReader(file): Creates a reader object to parse the PDF.
  3. len(reader.pages): Gets the total number of pages in the document.
  4. reader.pages[page_num]: Accesses a specific page by its index (0-based).
  5. page.extract_text(): This is the key method. It extracts all the text from the page.
  6. keyword.lower() in text.lower(): A case-insensitive check to see if the keyword exists in the page's text.

Advanced Search with PyPDF2

What if you need more than just a "yes/no" answer? You might want to see the context in which the keyword appears.

Extracting Sentences with the Keyword

This function will find the keyword and print the sentence it's in.

import PyPDF2
import re # Regular expressions are great for finding sentences
def find_keyword_with_context(pdf_path, keyword):
    """
    Finds a keyword and prints the sentence it appears in.
    """
    try:
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            # Create a regex pattern to find sentences containing the keyword
            # \b ensures we match whole words
            pattern = re.compile(rf'\b.*{re.escape(keyword)}.*\b', re.IGNORECASE)
            for page_num, page in enumerate(reader.pages):
                text = page.extract_text()
                # Find all matches on the page
                matches = pattern.findall(text)
                if matches:
                    print(f"--- Page {page_num + 1} ---")
                    for match in matches:
                        # Clean up whitespace and print
                        print(f"  -> {match.strip()}")
                    print() # Add a newline for readability
    except FileNotFoundError:
        print(f"Error: The file '{pdf_path}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")
# --- Example Usage ---
pdf_file = 'my_document.pdf'
search_keyword = 'data'
find_keyword_with_context(pdf_file, search_keyword)

Key improvements:

Python PDF如何高效提取关键词?-图3
(图片来源网络,侵删)
  • re.escape(keyword): This is important! It escapes any special regex characters in your keyword (like or ), so it's treated as a literal string.
  • re.IGNORECASE: Makes the search case-insensitive.
  • pattern.findall(text): Finds all occurrences of the pattern in the text, returning them as a list.
  • enumerate(reader.pages): A more Pythonic way to loop through pages while also getting the index.

Handling Scanned PDFs (Images)

This is a crucial point: PyPDF2 and similar text-based libraries cannot read text from a scanned PDF. A scanned PDF is essentially a collection of images. To extract text from it, you need to use Optical Character Recognition (OCR).

The best library for this is pytesseract, which is a Python wrapper for Google's Tesseract-OCR engine.

Installation (More Complex)

  1. Install Tesseract OCR Engine:

    • Windows: Download the installer from the Tesseract at UB Mannheim page. During installation, make sure to note the installation path (e.g., C:\Program Files\Tesseract-OCR).
    • macOS: brew install tesseract
    • Linux (Debian/Ubuntu): sudo apt-get install tesseract-ocr
  2. Install pytesseract and Pillow (Image Processing Library):

    pip install pytesseract Pillow
  3. (Windows) Configure Tesseract Path: You might need to tell pytesseract where the tesseract.exe file is located. You can do this in your code:

    import pytesseract
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' # Update this path

OCR Example with pytesseract

This example requires a PDF that is just a single image. For multi-page scanned PDFs, you'd need a library like pdf2image to convert each page to an image first.

import pytesseract
from PIL import Image
# --- IMPORTANT: Set the Tesseract path if on Windows ---
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def find_keyword_in_scanned_pdf(image_path, keyword):
    """
    Uses OCR to find a keyword in an image (e.g., a page from a scanned PDF).
    """
    try:
        # Open the image file
        img = Image.open(image_path)
        # Use Tesseract to extract text
        text = pytesseract.image_to_string(img)
        # Check for the keyword
        if keyword.lower() in text.lower():
            print(f"Keyword '{keyword}' found in the scanned image.")
            # You can add the context-finding logic from the previous example here
        else:
            print(f"Keyword '{keyword}' not found in the scanned image.")
    except FileNotFoundError:
        print(f"Error: The file '{image_path}' was not found.")
    except Exception as e:
        print(f"An error occurred: {e}")
# --- Example Usage ---
# This would be a path to a PNG or JPG file of a scanned page
scanned_image_file = 'scanned_page.png'
search_keyword = 'invoice'
find_keyword_in_scanned_pdf(scanned_image_file, search_keyword)

Alternative Libraries

While PyPDF2 is excellent, other libraries have their strengths.

Library Best For Pros Cons
pdfplumber Precise Text & Table Extraction Extremely accurate text layout analysis, great for tables, handles complex layouts well. Slightly slower than PyPDF2 for simple text extraction.
pymupdf (fitz) High Performance & Rich Features Blazing fast, can extract images, vector graphics, and handle encrypted PDFs. Can be more complex for beginners.
pdfminer.six Low-Level Text Parsing Gives you fine-grained control over text parsing, good for complex or malformed PDFs. API is more complex and less intuitive.

Example with pdfplumber

pdfplumber is fantastic when PyPDF2 misses text or when you need to preserve layout information.

pip install pdfplumber
import pdfplumber
def find_keyword_with_pdfplumber(pdf_path, keyword):
    """
    Searches for a keyword using pdfplumber, which is great for accuracy.
    """
    try:
        with pdfplumber.open(pdf_path) as pdf:
            for i, page in enumerate(pdf.pages):
                text = page.extract_text()
                if keyword.lower() in text.lower():
                    print(f"Found '{keyword}' on page {i + 1} with pdfplumber.")
                    # pdfplumber can also give you words with their exact coordinates
                    words = page.extract_words()
                    for word in words:
                        if keyword.lower() in word['text'].lower():
                            print(f"  - Found at coordinates: {word['x0']}, {word['top']}")
    except Exception as e:
        print(f"An error occurred: {e}")
# --- Example Usage ---
pdf_file = 'my_document.pdf'
search_keyword = 'Python'
find_keyword_with_pdfplumber(pdf_file, search_keyword)

Summary: Which One Should I Use?

Your Goal Recommended Library Why?
Simple keyword search in a text-based PDF. PyPDF2 Easy to install and use, perfect for the job.
I need high accuracy or am dealing with tables/complex layouts. pdfplumber Its text extraction is more robust and reliable.
My PDF is a scan (image). pytesseract It's an OCR engine, the only way to get text from images.
I need maximum performance or advanced features (images, vectors). pymupdf It's a fast, powerful, and modern PDF library.
分享:
扫描分享到社交APP
上一篇
下一篇