杰瑞科技汇

如何在Linux用Python运行ICTCLAS分词?

Of course! Here is a comprehensive guide on how to use ICTCLAS (now known as ICTCLAS50) with Python on a Linux system.

如何在Linux用Python运行ICTCLAS分词?-图1
(图片来源网络,侵删)

ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) is a very famous and powerful Chinese lexical analysis (word segmentation) toolkit. While the original ICTCLAS is no longer actively developed, its successor, ICTCLAS50, remains a popular and highly accurate choice.

This guide will walk you through the entire process: from installation and compilation to creating a simple Python wrapper.

Overview of the Steps

  1. Download: Get the source code for ICTCLAS50.
  2. Install Dependencies: Install necessary libraries like C++ compilers and Python development headers.
  3. Compile: Build the C++ shared library (.so file) that Python will interact with.
  4. Python Wrapper: Write a Python script using the ctypes library to load and call the functions from the compiled C++ library.
  5. Run and Test: Execute the Python script to perform word segmentation.

Step 1: Download ICTCLAS50

The official source for ICTCLAS50 is hosted by the Institute of Computing Technology, Chinese Academy of Sciences (CAS). You can find it on their website.

  1. Go to the Download Page: Navigate to the ICTCLAS50 download page. A direct link might change, but you can search for "ICTCLAS50 下载" (ICTCLAS50 Download).
  2. Get the Source Package: Download the source code package, which is usually a .tar.gz or .zip file. For this guide, we'll assume a file named ICTCLAS50_SRC_2025.zip.
  3. Extract the File: Place the downloaded file in your home directory or a preferred location and extract it.
# Navigate to your home directory
cd ~
# Unzip the file (you might need to install unzip first: sudo apt-get install unzip)
unzip ICTCLAS50_SRC_2025.zip

This will create a directory named ICTCLAS50_SRC_2025.

如何在Linux用Python运行ICTCLAS分词?-图2
(图片来源网络,侵删)

Step 2: Install Dependencies

ICTCLAS50 is written in C++, so you need a C++ compiler and the Python development libraries to create a bridge between C++ and Python.

On Debian/Ubuntu-based systems, run:

sudo apt-get update
sudo apt-get install build-essential g++ python3-dev
  • build-essential: Installs gcc, g++, and make, which are required for compilation.
  • python3-dev: Installs the C header files for Python 3, allowing you to build C extensions.

Step 3: Compile the C++ Library

Now, let's compile the ICTCLAS50 source code into a shared library (.so file) that Python can load.

  1. Navigate to the Source Directory:

    如何在Linux用Python运行ICTCLAS分词?-图3
    (图片来源网络,侵删)
    cd ~/ICTCLAS50_SRC_2025
  2. Modify the Makefile (Important!): The Makefile is configured for older systems. We need to update the compiler flags to be compatible with modern Linux systems.

    Open the Makefile with a text editor like nano or vim:

    nano Makefile

    Find the CFLAGS and CXXFLAGS lines. They likely look like this:

    # Old lines
    CFLAGS = -g -O2 -Wall -fPIC
    CXXFLAGS = -g -O2 -Wall -fPIC

    Change them to include the necessary paths for your Python installation. You can find the correct path by running python3-config --includes. A good, modern configuration would be:

    # New lines
    CFLAGS = -g -O2 -Wall -fPIC $(shell python3-config --includes)
    CXXFLAGS = -g -O2 -Wall -fPIC $(shell python3-config --includes)

    Save the file and exit (Ctrl+X, then Y, then Enter in nano).

  3. Run Make: The Makefile is designed to build everything. Simply run make.

    make

    This will take a few minutes. If it completes without errors, you will find the compiled shared library in the src directory. It will be named libICTCLAS50.so.


Step 4: Create the Python Wrapper

We will use Python's built-in ctypes library to load and use the libICTCLAS50.so file. ctypes allows you to call functions in foreign libraries directly from Python.

  1. Prepare the Data Files: ICTCLAS50 needs data files for its dictionary and models. These are usually in the data directory. For the Python script to find them, it's easiest to create a symbolic link to the data directory in your current working directory.

    # From inside ~/ICTCLAS50_SRC_2025
    ln -s data data_link
  2. Write the Python Script: Create a new Python file named run_ictclas.py in the ~/ICTCLAS50_SRC_2025 directory.

    nano run_ictclas.py

    Paste the following code into the file. The comments explain each part.

    import os
    import ctypes
    from ctypes import *
    # --- 1. Load the Library ---
    # Make sure the script is run from the directory containing libICTCLAS50.so
    try:
        ictclas = ctypes.CDLL('./src/libICTCLAS50.so')
    except OSError as e:
        print(f"Error loading library: {e}")
        print("Please ensure you are running this script from the ICTCLAS50 source directory")
        print("and that the 'src/libICTCLAS50.so' file exists.")
        exit()
    # --- 2. Define Function Prototypes ---
    # We need to tell ctypes about the arguments and return types of the C functions.
    # void ICTCLAS_Init(const char* sInitDirPath, int nLogLevel)
    ictclas.ICTCLAS_Init.argtypes = [c_char_p, c_int]
    ictclas.ICTCLAS_Init.restype = None
    # const char* ICTCLAS_ParagraphProcess(const char* sParagraph, int nCount, int *nRes)
    ictclas.ICTCLAS_ParagraphProcess.argtypes = [c_char_p, c_int, POINTER(c_int)]
    ictclas.ICTCLAS_ParagraphProcess.restype = c_char_p
    # void ICTCLAS_Exit()
    ictclas.ICTCLAS_Exit.argtypes = []
    ictclas.ICTCLAS_Exit.restype = None
    # --- 3. Define Constants (from ICTCLAS.h) ---
    # The log level. 0 for no log.
    LOG_VERBOSE = 0
    # The result buffer size.
    RESULT_BUFFER_SIZE = 65536
    def segment_text(text):
        """
        Segments a given Chinese text string using ICTCLAS50.
        """
        # Ensure the text is in bytes, as C functions expect char*
        text_bytes = text.encode('utf-8')
        # Initialize ICTCLAS
        # The second argument is the path to the data directory. We use our symlink.
        data_path = b'./data_link'
        ictclas.ICTCLAS_Init(data_path, LOG_VERBOSE)
        print("ICTCLAS Initialized.")
        # Process the text
        # The second argument is the length of the text.
        # The third argument is a pointer to an integer that will store the result length.
        res_len = c_int(0)
        result_pointer = ictclas.ICTCLAS_ParagraphProcess(text_bytes, len(text_bytes), byref(res_len))
        if not result_pointer:
            print("Error: ICTCLAS_ParagraphProcess returned NULL.")
            ictclas.ICTCLAS_Exit()
            return None
        # The C function returns a pointer to a null-terminated string.
        # We need to decode it from bytes to a Python string.
        segmented_text = c_char_p(result_pointer).value.decode('utf-8')
        # Free the memory allocated by the C function (VERY IMPORTANT!)
        # The original C++ library has a function for this, but it's often omitted
        # in the C interface. If you see a `ICTCLAS_Free` function, use it.
        # If not, you might need to manage memory carefully or find a way to avoid leaks.
        # For this example, we'll assume the library manages its own memory or the
        # OS will clean it up when the process exits, but this is not ideal.
        # A better library would provide a free function.
        # Let's assume the provided library has `ICTCLAS_FreeResult`
        if hasattr(ictclas, 'ICTCLAS_FreeResult'):
            ictclas.ICTCLAS_FreeResult(result_pointer)
        # Exit ICTCLAS
        ictclas.ICTCL
分享:
扫描分享到社交APP
上一篇
下一篇