Of course! Here is a comprehensive guide on how to use ICTCLAS (now known as ICTCLAS50) with Python on a Linux system.

ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) is a very famous and powerful Chinese lexical analysis (word segmentation) toolkit. While the original ICTCLAS is no longer actively developed, its successor, ICTCLAS50, remains a popular and highly accurate choice.
This guide will walk you through the entire process: from installation and compilation to creating a simple Python wrapper.
Overview of the Steps
- Download: Get the source code for ICTCLAS50.
- Install Dependencies: Install necessary libraries like C++ compilers and Python development headers.
- Compile: Build the C++ shared library (
.sofile) that Python will interact with. - Python Wrapper: Write a Python script using the
ctypeslibrary to load and call the functions from the compiled C++ library. - Run and Test: Execute the Python script to perform word segmentation.
Step 1: Download ICTCLAS50
The official source for ICTCLAS50 is hosted by the Institute of Computing Technology, Chinese Academy of Sciences (CAS). You can find it on their website.
- Go to the Download Page: Navigate to the ICTCLAS50 download page. A direct link might change, but you can search for "ICTCLAS50 下载" (ICTCLAS50 Download).
- Get the Source Package: Download the source code package, which is usually a
.tar.gzor.zipfile. For this guide, we'll assume a file namedICTCLAS50_SRC_2025.zip. - Extract the File: Place the downloaded file in your home directory or a preferred location and extract it.
# Navigate to your home directory cd ~ # Unzip the file (you might need to install unzip first: sudo apt-get install unzip) unzip ICTCLAS50_SRC_2025.zip
This will create a directory named ICTCLAS50_SRC_2025.

Step 2: Install Dependencies
ICTCLAS50 is written in C++, so you need a C++ compiler and the Python development libraries to create a bridge between C++ and Python.
On Debian/Ubuntu-based systems, run:
sudo apt-get update sudo apt-get install build-essential g++ python3-dev
build-essential: Installsgcc,g++, andmake, which are required for compilation.python3-dev: Installs the C header files for Python 3, allowing you to build C extensions.
Step 3: Compile the C++ Library
Now, let's compile the ICTCLAS50 source code into a shared library (.so file) that Python can load.
-
Navigate to the Source Directory:
(图片来源网络,侵删)cd ~/ICTCLAS50_SRC_2025
-
Modify the Makefile (Important!): The
Makefileis configured for older systems. We need to update the compiler flags to be compatible with modern Linux systems.Open the
Makefilewith a text editor likenanoorvim:nano Makefile
Find the
CFLAGSandCXXFLAGSlines. They likely look like this:# Old lines CFLAGS = -g -O2 -Wall -fPIC CXXFLAGS = -g -O2 -Wall -fPIC
Change them to include the necessary paths for your Python installation. You can find the correct path by running
python3-config --includes. A good, modern configuration would be:# New lines CFLAGS = -g -O2 -Wall -fPIC $(shell python3-config --includes) CXXFLAGS = -g -O2 -Wall -fPIC $(shell python3-config --includes)
Save the file and exit (
Ctrl+X, thenY, thenEnterin nano). -
Run Make: The
Makefileis designed to build everything. Simply runmake.make
This will take a few minutes. If it completes without errors, you will find the compiled shared library in the
srcdirectory. It will be namedlibICTCLAS50.so.
Step 4: Create the Python Wrapper
We will use Python's built-in ctypes library to load and use the libICTCLAS50.so file. ctypes allows you to call functions in foreign libraries directly from Python.
-
Prepare the Data Files: ICTCLAS50 needs data files for its dictionary and models. These are usually in the
datadirectory. For the Python script to find them, it's easiest to create a symbolic link to the data directory in your current working directory.# From inside ~/ICTCLAS50_SRC_2025 ln -s data data_link
-
Write the Python Script: Create a new Python file named
run_ictclas.pyin the~/ICTCLAS50_SRC_2025directory.nano run_ictclas.py
Paste the following code into the file. The comments explain each part.
import os import ctypes from ctypes import * # --- 1. Load the Library --- # Make sure the script is run from the directory containing libICTCLAS50.so try: ictclas = ctypes.CDLL('./src/libICTCLAS50.so') except OSError as e: print(f"Error loading library: {e}") print("Please ensure you are running this script from the ICTCLAS50 source directory") print("and that the 'src/libICTCLAS50.so' file exists.") exit() # --- 2. Define Function Prototypes --- # We need to tell ctypes about the arguments and return types of the C functions. # void ICTCLAS_Init(const char* sInitDirPath, int nLogLevel) ictclas.ICTCLAS_Init.argtypes = [c_char_p, c_int] ictclas.ICTCLAS_Init.restype = None # const char* ICTCLAS_ParagraphProcess(const char* sParagraph, int nCount, int *nRes) ictclas.ICTCLAS_ParagraphProcess.argtypes = [c_char_p, c_int, POINTER(c_int)] ictclas.ICTCLAS_ParagraphProcess.restype = c_char_p # void ICTCLAS_Exit() ictclas.ICTCLAS_Exit.argtypes = [] ictclas.ICTCLAS_Exit.restype = None # --- 3. Define Constants (from ICTCLAS.h) --- # The log level. 0 for no log. LOG_VERBOSE = 0 # The result buffer size. RESULT_BUFFER_SIZE = 65536 def segment_text(text): """ Segments a given Chinese text string using ICTCLAS50. """ # Ensure the text is in bytes, as C functions expect char* text_bytes = text.encode('utf-8') # Initialize ICTCLAS # The second argument is the path to the data directory. We use our symlink. data_path = b'./data_link' ictclas.ICTCLAS_Init(data_path, LOG_VERBOSE) print("ICTCLAS Initialized.") # Process the text # The second argument is the length of the text. # The third argument is a pointer to an integer that will store the result length. res_len = c_int(0) result_pointer = ictclas.ICTCLAS_ParagraphProcess(text_bytes, len(text_bytes), byref(res_len)) if not result_pointer: print("Error: ICTCLAS_ParagraphProcess returned NULL.") ictclas.ICTCLAS_Exit() return None # The C function returns a pointer to a null-terminated string. # We need to decode it from bytes to a Python string. segmented_text = c_char_p(result_pointer).value.decode('utf-8') # Free the memory allocated by the C function (VERY IMPORTANT!) # The original C++ library has a function for this, but it's often omitted # in the C interface. If you see a `ICTCLAS_Free` function, use it. # If not, you might need to manage memory carefully or find a way to avoid leaks. # For this example, we'll assume the library manages its own memory or the # OS will clean it up when the process exits, but this is not ideal. # A better library would provide a free function. # Let's assume the provided library has `ICTCLAS_FreeResult` if hasattr(ictclas, 'ICTCLAS_FreeResult'): ictclas.ICTCLAS_FreeResult(result_pointer) # Exit ICTCLAS ictclas.ICTCL
