Of course! Let's dive deep into how to work with Kylin cubes using Python. This is a common task for data engineers and data scientists who need to query pre-aggregated data in a fast, interactive way.

We'll cover three primary methods, starting with the most common and recommended one.
What is a Kylin Cube?
First, a quick recap. Apache Kylin is an open-source, distributed analytical engine designed to provide sub-second query latency on massive datasets (petabytes). It achieves this by:
- Cube Build: Pre-computing and aggregating data from a data source (like Hadoop HDFS or cloud storage) into a multi-dimensional data structure called a Cube.
- Query Acceleration: When a query comes in, Kylin doesn't scan raw data. It uses the pre-calculated results from the cube to answer the query almost instantly.
Python is used to automate the process of building these cubes and, more frequently, to query them for applications like dashboards, reports, and data analysis.
Method 1: Using pykylin (The Recommended Python SDK)
The pykylin library is the official and most robust Python SDK for interacting with the Kylin API. It allows you to perform all major operations: querying, getting cube metadata, and managing the build process.

Installation
First, install the library:
pip install pykylin
Authentication & Connection
You need to connect to your Kylin instance, which requires credentials and the URL of the Kylin server.
from pykylin.client import KylinClient
# Replace with your Kylin instance details
kylin_server_url = 'http://your-kylin-server:7070' # No trailing slash
username = 'ADMIN'
password = 'KYLIN'
# Create a client instance
client = KylinClient(kylin_server_url, username, password)
# Test the connection (optional but good practice)
try:
client.get_all_projects()
print("Successfully connected to Kylin!")
except Exception as e:
print(f"Connection failed: {e}")
Querying a Cube (The Core Task)
This is the most common use case. You'll execute SQL-like queries against your cube.
# Define your SQL query against the cube
# Note: The cube name is specified in the FROM clause
sql_query = """
SELECT
part_dt,
sum(price) AS total_sales
FROM
your_kylin_cube_name <-- IMPORTANT: Replace with your cube name
WHERE
part_dt >= '2025-01-01' AND part_dt <= '2025-01-31'
GROUP BY
part_dt
ORDER BY
total_sales DESC
LIMIT 10
"""
try:
# Execute the query
result_df = client.execute(sql_query)
# The result is a pandas DataFrame, which is very convenient!
print("\nQuery Result:")
print(result_df)
# You can now use the DataFrame for further analysis or plotting
# For example, if you have matplotlib installed:
# import matplotlib.pyplot as plt
# result_df.plot(x='part_dt', y='total_sales', kind='bar')
# plt.title("Total Sales per Day in January 2025")
# plt.show()
except Exception as e:
print(f"Query failed: {e}")
Other Useful Operations with pykylin
The SDK is powerful beyond just querying.

Getting a List of Cubes:
project_name = 'your_project_name' # Replace with your project name
cubes = client.get_cubes(project_name)
print("\nAvailable Cubes:")
for cube in cubes:
print(f"- Cube Name: {cube['name']}, Description: {cube['description']}")
Getting Cube Metadata (Columns, Dims, Measures):
cube_name = 'your_kylin_cube_name' # Replace with your cube name
cube_meta = client.get_cube(cube_name)
print(f"\nCube Metadata for '{cube_name}':")
print(f"Dimensions: {[dim['name'] for dim in cube_meta['dimensions']]}")
print(f"Measures: {[measure['name'] for measure in cube_meta['measures']]}")
Triggering a Cube Build: This is useful for ETL automation. Note that building a large cube can take a long time.
# Get the cube object to find its unique name/ID
cube_object = next((c for c in client.get_cubes(project_name) if c['name'] == cube_name), None)
if cube_object:
try:
print(f"\nStarting build for cube: {cube_name}...")
job_info = client.cube_build(cube_object['uuid']) # Use the cube's UUID
print(f"Build job started. Job ID: {job_info['uuid']}")
# You can then monitor the job status using job_info['uuid']
except Exception as e:
print(f"Failed to start build: {e}")
Method 2: Using requests (Direct API Calls)
If you can't install pykylin or need more control over the API calls, you can use the standard requests library to interact directly with Kylin's REST API. This is essentially what pykylin does under the hood.
Authentication
Kylin uses a session-based authentication. You first call the authentication endpoint to get a token, which you then include in all subsequent requests.
import requests
import json
# Configuration
KYLIN_URL = 'http://your-kylin-server:7070/kylin/api'
USERNAME = 'ADMIN'
PASSWORD = 'KYLIN'
CUBE_NAME = 'your_kylin_cube_name'
# --- Step 1: Authenticate and get a token ---
auth_url = f"{KYLIN_URL}/user/authentication"
auth_payload = {
"credentials": {
"username": USERNAME,
"password": PASSWORD
}
}
headers = {'Content-Type': 'application/json'}
response = requests.post(auth_url, data=json.dumps(auth_payload), headers=headers)
if response.status_code == 200:
token = response.json()['uuid']
print("Authentication successful. Token received.")
# Add the token to the headers for future requests
headers['Authorization'] = f'Basic {token}'
else:
print(f"Authentication failed: {response.text}")
exit()
Querying a Cube
The endpoint for querying is /query. The request body contains the SQL query.
# --- Step 2: Execute a query ---
query_url = f"{KYLIN_URL}/query"
sql_query = f"""
SELECT part_dt, sum(price) as total_sales
FROM {CUBE_NAME}
GROUP BY part_dt
LIMIT 5
"""
query_payload = {
"sql": sql_query,
"offset": 0,
"limit": 50000,
"accept": "application/json"
}
response = requests.post(query_url, data=json.dumps(query_payload), headers=headers)
if response.status_code == 200:
query_result = response.json()
print("\nQuery Result (via requests):")
# The result is in 'results' key, with column names in 'columnNames'
results = query_result['results']
columns = query_result['columnNames']
for row in results:
print(dict(zip(columns, row)))
else:
print(f"Query failed: {response.text}")
Method 3: Using JDBC (For Advanced Use Cases)
For some applications, especially those requiring a more "database-like" connection (e.g., using SQLAlchemy, pandas read_sql_table, or BI tools that connect via JDBC), you can use the JayDeBeApi library to connect to Kylin's JDBC driver.
Setup
- Download the JDBC Driver: Get the Kylin JDBC JAR file from your Kylin server's
tools/jdbcdirectory or the official download site. - Install JayDeBeApi:
pip install JayDeBeApi
Connection and Query
import jaydebeapi
# Configuration
# Path to the downloaded Kylin JDBC driver JAR file
jdbc_driver_path = '/path/to/kylin-jdbc-4.0.2.jar' # <-- IMPORTANT: Update this path
jdbc_url = 'jdbc:kylin://your-kylin-server:7070/your_project_name'
jdbc_user = 'ADMIN'
jdbc_password = 'KYLIN'
try:
# Establish the connection
conn = jaydebeapi.connect(
'org.apache.kylin.jdbc.KylinDriver',
jdbc_url,
[jdbc_user, jdbc_password],
jdbc_driver_path
)
# Get a cursor
curs = conn.cursor()
# Execute a query
sql_query = f"SELECT part_dt, SUM(price) FROM {CUBE_NAME} GROUP BY part_dt LIMIT 5"
curs.execute(sql_query)
# Fetch and print results
print("\nQuery Result (via JDBC):")
column_names = [desc[0] for desc in curs.description]
print(column_names)
for row in curs.fetchall():
print(row)
# Close the connection
curs.close()
conn.close()
except Exception as e:
print(f"JDBC connection or query failed: {e}")
Summary: Which Method to Use?
| Method | Pros | Cons | Best For |
|---|---|---|---|
pykylin (SDK) |
Official, high-level, easy to use. Returns pandas DataFrames. Handles auth and API details. | Requires an extra library installation. | Almost all use cases. Data analysis, ETL automation, and general Python scripting. This is the recommended approach. |
requests (Direct API) |
No extra libraries needed (requests is standard). Full control over API calls. | More verbose. Requires manual handling of authentication, tokens, and JSON parsing. | Quick scripts, integration into systems where you can't add new dependencies, or for learning the Kylin API. |
| JDBC | Allows integration with tools that speak JDBC (SQLAlchemy, pandas, Tableau, Power BI). | More complex setup (requires JDBC driver). Slower than the REST API for simple queries. | Connecting BI tools or using Python's data ecosystem (like SQLAlchemy ORM) to treat Kylin like a regular database. |
For any new project, start with pykylin. It provides the most Pythonic and efficient way to work with Kylin cubes.
