杰瑞科技汇

Python pytesseract如何实现OCR文字识别?

Of course! Here is a comprehensive guide to using pytesseract in Python, from installation to advanced usage.

Python pytesseract如何实现OCR文字识别?-图1
(图片来源网络,侵删)

What is pytesseract?

pytesseract is a Python wrapper for Google's Tesseract-OCR Engine. Tesseract is one of the most accurate open-source Optical Character Recognition (OCR) engines available. pytesseract allows you to use this powerful engine directly within your Python scripts to extract text from images.


Installation

Using pytesseract requires two main components:

  1. The Tesseract OCR Engine itself.
  2. The pytesseract Python package.

Step 1: Install the Tesseract OCR Engine

You need to install Tesseract on your system before you can use the wrapper.

For Windows:

Python pytesseract如何实现OCR文字识别?-图2
(图片来源网络,侵删)
  1. Download the installer from the Tesseract at UB Mannheim page.
  2. Run the installer. Important: During installation, make sure to note the installation path (e.g., C:\Program Files\Tesseract-OCR). You will need this path for the next step.
  3. Also, during installation, you can select additional language data. If you need to recognize text in languages other than English, make sure to check them.

For macOS (using Homebrew):

brew install tesseract

This will install the engine and the English language data by default. For other languages:

brew install tesseract-lang

For Linux (Debian/Ubuntu):

sudo apt update
sudo apt install tesseract-ocr

To install additional languages (e.g., French, Spanish):

Python pytesseract如何实现OCR文字识别?-图3
(图片来源网络,侵删)
sudo apt install tesseract-ocr-fra tesseract-ocr-spa

Step 2: Install the pytesseract Python Package

Use pip to install the Python wrapper:

pip install pytesseract

You will also need a library like Pillow (PIL) to handle image files.

pip install Pillow

Basic Usage

Here's a simple "Hello World" example to get you started.

A. Windows Configuration (Important!)

If you installed Tesseract to a non-standard location (like the default C:\Program Files\...), you need to tell pytesseract where to find it. You can do this by setting an environment variable or directly in your script.

Method 1: Setting the TESSDATA_PREFIX Environment Variable This is the recommended way. Set it to the directory containing the tessdata folder (e.g., C:\Program Files\Tesseract-OCR\tessdata).

Method 2: Specifying the Path in Code You can specify the path to the tesseract.exe file directly.

import pytesseract
from PIL import Image
# --- Only for Windows if Tesseract is not in your PATH ---
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# -----------------------------------------------------------
# Open an image file
image_path = 'path/to/your/image.png'
img = Image.open(image_path)
# Use pytesseract to get text from the image
text = pytesseract.image_to_string(img)
# Print the extracted text
print(text)

Image Pre-processing for Better Accuracy

The quality of your OCR results heavily depends on the quality of the input image. Pre-processing can dramatically improve accuracy.

Here are some common techniques using OpenCV and Pillow.

First, install OpenCV:

pip install opencv-python

Example: Pre-processing a scanned document

import cv2
import pytesseract
from PIL import Image
# Load the image
image_path = 'noisy_image.png'
img = cv2.imread(image_path)
# 1. Convert to Grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# 2. Apply Gaussian Blur to reduce noise
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
# 3. Apply Adaptive Thresholding
# This is excellent for documents with varying lighting
thresh = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                               cv2.THRESH_BINARY, 11, 2)
# 4. Invert the image (Tesseract works better with black text on white background)
inverted = cv2.bitwise_not(thresh)
# Optional: Dilate to make text thicker
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
dilated = cv2.dilate(inverted, kernel, iterations=1)
# Save the pre-processed image to see the result
cv2.imwrite('preprocessed_image.png', dilated)
# Use pytesseract on the pre-processed image
# You can specify the language, e.g., 'eng' for English
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(dilated, config=custom_config)
print(text)

Advanced Configuration and Parameters

pytesseract offers several parameters to fine-tune its behavior.

--oem (OCR Engine Mode)

Controls the type of algorithm used.

  • 0 - Legacy Tesseract engine only.
  • 1 - Neural nets LSTM engine only.
  • 2 - Legacy + LSTM engines.
  • 3 - Default, based on what is available.

--psm (Page Segmentation Mode)

Controls how the page is analyzed.

  • 0 - Orientation and script detection (OSD) only.
  • 1 - Automatic page segmentation, but no OSD.
  • 2 - Automatic page segmentation, but with OSD.
  • 3 - Fully automatic page segmentation, but no OSD. (Default)
  • 4 - Assume a single column of text.
  • 5 - Assume a single uniform block of vertically aligned text.
  • 6 - Assume a single uniform block of text.
  • 7 - Treat the image as a single text line.
  • 8 - Treat the image as a single word.
  • 9 - Treat the image as a single single word in a circle.
  • 10 - Treat the image as a single character.
  • 11 - Sparse text. Find as much text as possible in no particular order.
  • 12 - Sparse text with OSD.
  • 13 - Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.

Example of using a custom configuration:

config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(img, config=config)

Extracting Detailed Data (Bounding Boxes, Confidence, etc.)

pytesseract can return more than just plain text. It can provide the location and confidence of each recognized word or text block.

image_to_data()

This is the most powerful function for getting detailed output. It returns a string in a TSV (Tab-Separated Values) format.

import pytesseract
from PIL import Image
import csv
image_path = 'path/to/your/image.png'
img = Image.open(image_path)
# Get detailed data including bounding boxes
data = pytesseract.image_to_data(img)
# The output is a string. Let's parse it.
# Each line represents a detected element (block, paragraph, line, word).
# We are usually interested in the lines where 'conf' (confidence) is not '-1'.
for i, line in enumerate(data.split('\n')):
    if i == 0: # Skip the header row
        continue
    cols = line.split('\t')
    if len(cols) == 12:
        # [level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, text]
        level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, text = cols
        if conf != '-1':
            print(f"Text: {text}, Confidence: {conf}, BBox: ({left}, {top}, {width}, {height})")
# You can also parse it into a list of dictionaries
data_list = []
for i, line in enumerate(data.split('\n')):
    if i == 0:
        continue
    cols = line.split('\t')
    if len(cols) == 12 and cols[11].strip() != '':
        data_list.append({
            'level': int(cols[0]),
            'page_num': int(cols[1]),
            'block_num': int(cols[2]),
            'par_num': int(cols[3]),
            'line_num': int(cols[4]),
            'word_num': int(cols[5]),
            'left': int(cols[6]),
            'top': int(cols[7]),
            'width': int(cols[8]),
            'height': int(cols[9]),
            'conf': int(cols[10]),
            'text': cols[11]
        })
print("\n--- Parsed Data ---")
for item in data_list:
    print(item)

image_to_boxes()

This function provides the coordinates for each recognized character.

boxes = pytesseract.image_to_boxes(img)
print(boxes)
# Output format:
# T 10 90 20 100 0
# h 12 88 21 96 0
# e 14 86 23 94 0
# ...
# Each line is: <char> <left> <bottom> <right> <top> <page_num>

Handling Different Languages

To recognize text in a language other than English, you need to have the corresponding language data installed (see Step 1) and specify it in the function call.

# Assuming you installed 'fra' (French) language data
img = Image.open('french_text.png')
# Specify the language using a 3-letter code
text_french = pytesseract.image_to_string(img, lang='fra')
print(text_french)
# For multiple languages
text_multi = pytesseract.image_to_string(img, lang='eng+fra')
print(text_multi)

Summary and Best Practices

  1. Install Everything: Make sure Tesseract is installed on your system and pytesseract is installed in your Python environment.
  2. Pre-process Images: This is the most crucial step for getting good results. Convert to grayscale, threshold, and invert images.
  3. Choose the Right PSM: Experiment with different --psm values. 6 (single uniform block) is often a good starting point for documents.
  4. Use image_to_data(): When you need more than just the text, this function provides invaluable information like confidence scores and bounding boxes.
  5. Handle Languages Explicitly: Always specify the lang parameter if you are not using English.
  6. Be Aware of Limitations: Tesseract is powerful but not perfect. It struggles with:
    • Handwritten text (unless specifically trained for it).
    • Very stylized or artistic fonts.
    • Images with extremely low resolution or heavy distortion.
分享:
扫描分享到社交APP
上一篇
下一篇