杰瑞科技汇

Java如何实现Word转HTML?

Of course! Converting Word documents (.docx) to HTML is a common requirement for web applications. Here’s a comprehensive guide on how to do this in Java, covering the most popular and effective libraries.

Java如何实现Word转HTML?-图1
(图片来源网络,侵删)

Summary of Approaches

Library How it Works Pros Cons Best For
Apache POI Low-level API that reads the Word file's XML structure. Full control, no external dependencies. Very complex and verbose. You have to handle everything manually. Developers who need maximum control and are willing to write a lot of code.
docx4j High-level library that builds on Apache POI. Much easier to use. Excellent support for complex Word features (headers, footers, tables, images). Can be heavy. The free version is powerful, but commercial support is available. Most use cases. The recommended choice for robust, high-fidelity conversions.
Freemarker A templating engine. You design an HTML template and use data extracted from a Word doc to populate it. Maximum flexibility for the final HTML output. You control the exact structure and styling. Requires two steps: extract data from Word, then apply it to the template. Projects where the final HTML must adhere to a very specific, pre-defined structure.

Approach 1: Using Apache POI (The "Hard Way")

Apache POI is the most famous Java library for Microsoft Office formats. However, its Word processing component (XWPF) is very low-level. Converting to HTML requires you to iterate through every paragraph, run, and style, and manually generate the corresponding HTML tags.

This is not recommended for a quick solution but is good to understand what's happening under the hood.

Example Code (Simplified)

import org.apache.poi.xwpf.usermodel.*;
import org.openxmlformats.schemas.wordprocessingml.x2006.main.*;
import java.io.*;
public class PoiToHtmlConverter {
    public static void main(String[] args) throws Exception {
        // 1. Load the Word document
        XWPFDocument document = new XWPFDocument(new FileInputStream("input.docx"));
        // 2. Start building the HTML string
        StringBuilder htmlBuilder = new StringBuilder();
        htmlBuilder.append("<html><head><meta charset=\"UTF-8\"></head><body>");
        // 3. Iterate through paragraphs
        for (XWPFParagraph p : document.getParagraphs()) {
            String alignment = getAlignment(p.getCTP(). getPPr() != null ? p.getCTP().getPPr().getJc() : null);
            htmlBuilder.append("<p style=\"text-align: ").append(alignment).append("\">");
            // 4. Iterate through runs (text with the same formatting)
            for (XWPFRun r : p.getRuns()) {
                String text = r.getText(0);
                String bold = r.isBold() ? "font-weight: bold;" : "";
                String italic = r.isItalic() ? "font-style: italic;" : "";
                String fontSize = r.getFontSize() != -1 ? "font-size: " + r.getFontSize() + "pt;" : "";
                htmlBuilder.append("<span style=\"").append(bold).append(italic).append(fontSize).append("\">")
                           .append(escapeHtml(text))
                           .append("</span>");
            }
            htmlBuilder.append("</p>");
        }
        // 5. Handle tables (this is even more complex)
        for (XWPFTable table : document.getTables()) {
            htmlBuilder.append("<table border=\"1\">");
            for (XWPFTableRow row : table.getRows()) {
                htmlBuilder.append("<tr>");
                for (XWPFTableCell cell : row.getTableCells()) {
                    htmlBuilder.append("<td>");
                    for (XWPFParagraph p : cell.getParagraphs()) {
                        // Similar logic to the paragraph loop above
                        htmlBuilder.append(p.getText());
                    }
                    htmlBuilder.append("</td>");
                }
                htmlBuilder.append("</tr>");
            }
            htmlBuilder.append("</table>");
        }
        htmlBuilder.append("</body></html>");
        // 6. Write the HTML to a file
        try (PrintWriter out = new PrintWriter("output_poi.html")) {
            out.println(htmlBuilder.toString());
        }
        System.out.println("Conversion complete. Check output_poi.html");
    }
    private static String getAlignment(CTJc jc) {
        if (jc == null) return "left";
        switch (jc.getVal()) {
            case CENTER: return "center";
            case RIGHT: return "right";
            case BOTH: return "justify";
            default: return "left";
        }
    }
    private static String escapeHtml(String input) {
        return input.replace("&", "&amp;")
                    .replace("<", "&lt;")
                    .replace(">", "&gt;")
                    .replace("\"", "&quot;")
                    .replace("'", "&#39;");
    }
}

As you can see, this is a lot of work and doesn't even cover images, headers, footers, or complex styles properly.


Approach 2: Using docx4j (The Recommended Way)

docx4j is designed specifically for this kind of task. It has built-in functionality to convert a Word document to a well-formed HTML string, handling most formatting automatically.

Java如何实现Word转HTML?-图2
(图片来源网络,侵删)

Step 1: Add the Dependency

Add the docx4j library to your project. If you're using Maven, add this to your pom.xml:

<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-core</artifactId>
    <version>11.4.4</version> <!-- Use the latest version -->
</dependency>
<dependency>
    <groupId>org.docx4j</groupId>
    <artifactId>docx4j-export-fo</artifactId>
    <version>11.4.4</version> <!-- This dependency is needed for the conversion -->
</dependency>

Step 2: Write the Java Code

The code is remarkably simple.

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.convert.in.xhtml.XHTMLImporterImpl;
import org.docx4j.convert.out.html.HtmlExporterNG2;
import java.io.File;
import java.io.FileOutputStream;
import java.io.OutputStream;
public class Docx4jToHtmlConverter {
    public static void main(String[] args) throws Exception {
        // 1. Load the Word document
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File("input.docx"));
        // 2. Convert the WordMLPackage to an XHTML string
        // The XHTMLImporterImpl handles the conversion
        XHTMLImporterImpl xhtmlImporter = new XHTMLImporterImpl(wordMLPackage);
        org.w3c.dom.Document htmlDom = xhtmlImporter.convert(wordMLPackage.getMainDocumentPart());
        // 3. Use the HtmlExporterNG2 to write the XHTML to a file
        // This pretty-prints the HTML and makes it look nice
        try (OutputStream os = new FileOutputStream("output_docx4j.html")) {
            HtmlExporterNG2 exporter = new HtmlExporterNG2();
            exporter.export(htmlDom, os);
        }
        System.out.println("Conversion complete. Check output_docx4j.html");
    }
}

This code will produce a output_docx4j.html file that includes styles as inline CSS, preserving the look and feel of the original document much better than the POI example.


Approach 3: Using Freemarker (The Template-Driven Way)

This approach is different. You don't directly convert Word to HTML. Instead, you use a library (like docx4j or Apache POI) to extract data from the Word document, and then use Freemarker to render this data into a pre-defined HTML template.

Java如何实现Word转HTML?-图3
(图片来源网络,侵删)

This is ideal when you need the final HTML to match a specific design (e.g., a corporate website template).

Step 1: Add Dependencies

You'll need docx4j to read the Word file and Freemarker for templating.

<!-- pom.xml -->
<dependencies>
    <dependency>
        <groupId>org.docx4j</groupId>
        <artifactId>docx4j-core</artifactId>
        <version>11.4.4</version>
    </dependency>
    <dependency>
        <groupId>org.freemarker</groupId>
        <artifactId>freemarker</artifactId>
        <version>2.3.32</version>
    </dependency>
</dependencies>

Step 2: Create an HTML Template

Create a file named template.ftl in a src/main/resources/templates directory.

<!-- src/main/resources/templates/template.ftl -->
<!DOCTYPE html>
<html>
<head>${document.title}</title>
    <style>
        body { font-family: sans-serif; }
        .content { max-width: 800px; margin: auto; }
    </style>
</head>
<body>
    <div class="content">
        <h1>${document.title}</h1>
        <p><em>Generated on: ${.now?string("yyyy-MM-dd HH:mm")}</em></p>
        <#list document.paragraphs as para>
            <p>${para.text}</p>
        </#list>
        <#if document.hasTable>
            <h2>Data Table</h2>
            <table border="1">
                <#list document.tableData as row>
                    <tr>
                        <#list row as cell>
                            <td>${cell}</td>
                        </#list>
                    </tr>
                </#list>
            </table>
        </#if>
    </div>
</body>
</html>

Step 3: Write the Java Code

This code extracts simple data from the Word document and uses Freemarker to fill the template.

import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.*;
import freemarker.template.Configuration;
import freemarker.template.Template;
import java.io.File;
import java.io.FileWriter;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class FreemarkerDocxConverter {
    public static void main(String[] args) throws Exception {
        // 1. Load the document
        WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage.load(new File("input.docx"));
        MainDocumentPart documentPart = wordMLPackage.getMainDocumentPart();
        // 2. Extract data from the document
        Map<String, Object> dataModel = new HashMap<>();
        dataModel.put("title", "My Document Report"); // You could extract this from a custom property
        List<String> paragraphs = new ArrayList<>();
        for (Object o : documentPart.getContent()) {
            if (o instanceof P) {
                P paragraph = (P) o;
                paragraphs.add(getParagraphText(paragraph));
            }
        }
        dataModel.put("paragraphs", paragraphs);
        // 3. Configure Freemarker
        Configuration cfg = new Configuration(Configuration.VERSION_2_3_32);
        cfg.setDirectoryForTemplateLoading(new File("src/main/resources/templates"));
        // 4. Process the template
        Template template = cfg.getTemplate("template.ftl");
        try (Writer out = new FileWriter("output_freemarker.html")) {
            template.process(dataModel, out);
        }
        System.out.println("Conversion complete. Check output_freemarker.html");
    }
    private static String getParagraphText(P paragraph) {
        StringBuilder text = new StringBuilder();
        for (Object r : paragraph.getR()) {
            R run = (R) r;
            for (Object t : run.getT()) {
                text.append(((CTText) t).getValue());
            }
        }
        return text.toString();
    }
}

Final Recommendation

  • For quick and reliable conversions that preserve formatting, use docx4j. It's the most practical and powerful solution for this task.
  • If you need complete control over the final HTML structure and styling, and are willing to build a data extraction process, use Freemarker.
  • Avoid using Apache POI directly for this conversion unless you have very specific, low-level requirements and are prepared for a significant development effort.
分享:
扫描分享到社交APP
上一篇
下一篇