杰瑞科技汇

Java PDF转XML如何实现?

Of course! Converting a PDF to XML in Java is a common requirement for data extraction, document processing, and content analysis. There's no single, built-in Java library for this, so you need to use a third-party library.

The process typically involves two main steps:

  1. Parse the PDF: Extract text, structure (like paragraphs, tables), and sometimes images from the PDF file.
  2. Generate XML: Serialize the extracted data into a well-formed XML structure.

Here’s a comprehensive guide covering the most popular and effective libraries, with code examples.


Recommended Libraries

Here are the top choices, categorized by their approach:

Library Approach Pros Cons Best For
Apache PDFBox Text Extraction Free, Open Source (Apache 2.0), pure Java, good for simple text. Layout preservation is poor. Struggles with complex layouts, tables, and scanned images. Simple text extraction where layout isn't critical.
PDFTextStream Text Extraction Commercial (free trial), very accurate text extraction. Not free for production use. Projects with a budget where high accuracy for text is needed.
iText 7 (PDF to XML add-on) Layout & Structure Commercial (AGPL free), powerful layout analysis, can extract tables. Complex licensing (AGPL can be problematic for commercial apps), steeper learning curve. Extracting structured data like tables and preserving document layout.
Aspose.PDF Layout & Structure Commercial (free trial), excellent layout and table extraction, mature API. Not free for production use. Professional, high-fidelity conversion where budget is available.
OCR with Tesseract Image-based PDFs Free, Open Source (Apache 2.0), extracts text from scanned documents. Requires a separate OCR step, complex to integrate, less accurate than native text extraction. Converting scanned PDFs (image-only) into searchable text/XML.

Method 1: Apache PDFBox (Simple & Free)

This is the most popular free option. It's great for getting the raw text out of a PDF. The resulting XML will be very basic, just a container for the text.

Step 1: Add Dependency

Add this to your pom.xml:

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.2</version> <!-- Check for the latest version -->
</dependency>

Step 2: Java Code

This code will load a PDF, extract all text, and wrap it in a simple XML structure.

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class PdfBoxToXml {
    public static void main(String[] args) {
        String pdfFilePath = "path/to/your/document.pdf";
        String outputXmlFilePath = "output/document.xml";
        try (PDDocument document = PDDocument.load(new File(pdfFilePath))) {
            // PDFTextStripper is used to extract text
            PDFTextStripper stripper = new PDFTextStripper();
            // Optional: Extract text from a specific page
            // stripper.setStartPage(1);
            // stripper.setEndPage(1);
            // Get all text from the PDF
            String text = stripper.getText(document);
            // Create a simple XML structure
            String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
                         "<document>\n" +
                         "  <content>\n" +
                         "    " + text.replace("\n", "\n    ") + "\n" + // Preserve newlines
                         "  </content>\n" +
                         "</document>";
            // Write the XML to a file
            java.nio.file.Files.write(java.nio.file.Paths.get(outputXmlFilePath), xml.getBytes());
            System.out.println("PDF converted to XML successfully: " + outputXmlFilePath);
        } catch (IOException e) {
            System.err.println("Error processing PDF: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Limitation: This approach loses all formatting, page breaks, and structural information. It's just a blob of text.


Method 2: iText 7 (Advanced & Structured)

iText is a powerful commercial library with a free AGPL license. Its pdf2xml add-on is specifically designed to preserve the document's structure (paragraphs, tables, lists) in the XML output.

Step 1: Add Dependency

Add this to your pom.xml:

<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>itext7-core</artifactId>
    <version>7.2.5</version> <!-- Check for the latest version -->
    <type>pom</type>
</dependency>
<dependency>
    <groupId>com.itextpdf</groupId>
    <artifactId>pdf2xml</artifactId>
    <version>4.0.3</version> <!-- Check for the latest version -->
</dependency>

Step 2: Java Code

iText's PdfToXmlConverter handles the entire process.

import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.pdf2xml.PdfToXmlConverter;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
public class ItextToXml {
    public static void main(String[] args) {
        String pdfFilePath = "path/to/your/document.pdf";
        String outputXmlFilePath = "output/document_itext.xml";
        try (PdfDocument pdfDoc = new PdfDocument(new PdfReader(pdfFilePath))) {
            // Create an instance of the PdfToXmlConverter
            PdfToXmlConverter converter = new PdfToXmlConverter(pdfDoc, new FileOutputStream(outputXmlFilePath));
            // Optional: Configure converter properties
            // For example, to set the tag root name
            // converter.setTagRootName("myDocument");
            // Perform the conversion
            converter.convert();
            System.out.println("PDF converted to XML successfully with iText: " + outputXmlFilePath);
        } catch (IOException e) {
            System.err.println("Error processing PDF with iText: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Advantage: The output XML is much richer. It tags elements like <paragraph>, <table>, <header>, etc., preserving the original document's structure. This is ideal for data mining and content analysis.


Method 3: Handling Scanned PDFs (OCR)

If your PDF is a scanned image (contains no "real" text), you must use Optical Character Recognition (OCR) first.

Step 1: Add Dependencies

You'll need PDFBox to load the PDF and Tesseract for OCR. You also need the Tesseract OCR data files (traineddata).

<!-- PDFBox for PDF handling -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.2</version>
</dependency>
<!-- Tesseract OCR for Java -->
<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>5.7.0</version> <!-- Check for the latest version -->
</dependency>

Setup: Download the Tesseract OCR data from GitHub - tesseract-ocr/tessdata. Place the eng.traineddata (for English) file in a directory, for example, C:/tessdata.

Step 2: Java Code

This code first converts each page of the PDF to an image, then uses Tesseract to extract text from that image.

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.pdmodel.PDDocument;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
public class OcrPdfToXml {
    public static void main(String[] args) {
        String pdfFilePath = "path/to/your/scanned_document.pdf";
        String outputXmlFilePath = "output/scanned_document.xml";
        String tesseractDataPath = "C:/tessdata"; // Path to your tessdata directory
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath(tesseractDataPath);
        tesseract.setLanguage("eng"); // Set language
        StringBuilder fullText = new StringBuilder();
        try (PDDocument document = PDDocument.load(new File(pdfFilePath))) {
            PDFRenderer pdfRenderer = new PDFRenderer(document);
            for (int page = 0; page < document.getNumberOfPages(); page++) {
                System.out.println("Processing page " + (page + 1));
                // Render the PDF page as an image
                BufferedImage image = pdfRenderer.renderImageWithDPI(page, 300); // 300 DPI for good quality
                // Perform OCR on the image
                String pageText = tesseract.doOCR(image);
                fullText.append(pageText).append("\n\n"); // Add page separator
            }
            // Create a simple XML structure
            String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
                         "<scanned_document>\n" +
                         "  <pages>\n" +
                         "    " + fullText.toString().replace("\n", "\n    ") + "\n" +
                         "  </pages>\n" +
                         "</scanned_document>";
            // Write the XML to a file
            java.nio.file.Files.write(java.nio.file.Paths.get(outputXmlFilePath), xml.getBytes());
            System.out.println("Scanned PDF converted to XML successfully: " + outputXmlFilePath);
        } catch (IOException | TesseractException e) {
            System.err.println("Error processing scanned PDF: " + e.getMessage());
            e.printStackTrace();
        }
    }
}

Summary and Recommendation

Your Goal Recommended Library
Quick & Free text extraction Apache PDFBox. It's easy to set up and gets the job done for simple documents.
Preserve structure (tables, layout) iText 7. If you can work with the AGPL license or have a budget, this is the most powerful option for structured data.
Convert scanned documents PDFBox + Tesseract (OCR). This is the standard, free approach for image-based PDFs.
High-fidelity, commercial-grade conversion Aspose.PDF. If budget is not a constraint, this is a top-tier alternative to iText.

For most projects starting out, begin with Apache PDFBox. If you find its lack of structure limiting, then invest time in learning iText 7. If you're dealing with scans, PDFBox + Tesseract is the way to go.

分享:
扫描分享到社交APP
上一篇
下一篇