杰瑞科技汇

Python Kylin Cube如何高效构建与优化?

Of course! Let's dive deep into how to work with Kylin cubes using Python. This is a common task for data engineers and data scientists who need to query pre-aggregated data in a fast, interactive way.

Python Kylin Cube如何高效构建与优化?-图1
(图片来源网络,侵删)

We'll cover three primary methods, starting with the most common and recommended one.

What is a Kylin Cube?

First, a quick recap. Apache Kylin is an open-source, distributed analytical engine designed to provide sub-second query latency on massive datasets (petabytes). It achieves this by:

  1. Cube Build: Pre-computing and aggregating data from a data source (like Hadoop HDFS or cloud storage) into a multi-dimensional data structure called a Cube.
  2. Query Acceleration: When a query comes in, Kylin doesn't scan raw data. It uses the pre-calculated results from the cube to answer the query almost instantly.

Python is used to automate the process of building these cubes and, more frequently, to query them for applications like dashboards, reports, and data analysis.


Method 1: Using pykylin (The Recommended Python SDK)

The pykylin library is the official and most robust Python SDK for interacting with the Kylin API. It allows you to perform all major operations: querying, getting cube metadata, and managing the build process.

Python Kylin Cube如何高效构建与优化?-图2
(图片来源网络,侵删)

Installation

First, install the library:

pip install pykylin

Authentication & Connection

You need to connect to your Kylin instance, which requires credentials and the URL of the Kylin server.

from pykylin.client import KylinClient
# Replace with your Kylin instance details
kylin_server_url = 'http://your-kylin-server:7070'  # No trailing slash
username = 'ADMIN'
password = 'KYLIN'
# Create a client instance
client = KylinClient(kylin_server_url, username, password)
# Test the connection (optional but good practice)
try:
    client.get_all_projects()
    print("Successfully connected to Kylin!")
except Exception as e:
    print(f"Connection failed: {e}")

Querying a Cube (The Core Task)

This is the most common use case. You'll execute SQL-like queries against your cube.

# Define your SQL query against the cube
# Note: The cube name is specified in the FROM clause
sql_query = """
SELECT
    part_dt,
    sum(price) AS total_sales
FROM
    your_kylin_cube_name  <-- IMPORTANT: Replace with your cube name
WHERE
    part_dt >= '2025-01-01' AND part_dt <= '2025-01-31'
GROUP BY
    part_dt
ORDER BY
    total_sales DESC
LIMIT 10
"""
try:
    # Execute the query
    result_df = client.execute(sql_query)
    # The result is a pandas DataFrame, which is very convenient!
    print("\nQuery Result:")
    print(result_df)
    # You can now use the DataFrame for further analysis or plotting
    # For example, if you have matplotlib installed:
    # import matplotlib.pyplot as plt
    # result_df.plot(x='part_dt', y='total_sales', kind='bar')
    # plt.title("Total Sales per Day in January 2025")
    # plt.show()
except Exception as e:
    print(f"Query failed: {e}")

Other Useful Operations with pykylin

The SDK is powerful beyond just querying.

Python Kylin Cube如何高效构建与优化?-图3
(图片来源网络,侵删)

Getting a List of Cubes:

project_name = 'your_project_name' # Replace with your project name
cubes = client.get_cubes(project_name)
print("\nAvailable Cubes:")
for cube in cubes:
    print(f"- Cube Name: {cube['name']}, Description: {cube['description']}")

Getting Cube Metadata (Columns, Dims, Measures):

cube_name = 'your_kylin_cube_name' # Replace with your cube name
cube_meta = client.get_cube(cube_name)
print(f"\nCube Metadata for '{cube_name}':")
print(f"Dimensions: {[dim['name'] for dim in cube_meta['dimensions']]}")
print(f"Measures: {[measure['name'] for measure in cube_meta['measures']]}")

Triggering a Cube Build: This is useful for ETL automation. Note that building a large cube can take a long time.

# Get the cube object to find its unique name/ID
cube_object = next((c for c in client.get_cubes(project_name) if c['name'] == cube_name), None)
if cube_object:
    try:
        print(f"\nStarting build for cube: {cube_name}...")
        job_info = client.cube_build(cube_object['uuid']) # Use the cube's UUID
        print(f"Build job started. Job ID: {job_info['uuid']}")
        # You can then monitor the job status using job_info['uuid']
    except Exception as e:
        print(f"Failed to start build: {e}")

Method 2: Using requests (Direct API Calls)

If you can't install pykylin or need more control over the API calls, you can use the standard requests library to interact directly with Kylin's REST API. This is essentially what pykylin does under the hood.

Authentication

Kylin uses a session-based authentication. You first call the authentication endpoint to get a token, which you then include in all subsequent requests.

import requests
import json
# Configuration
KYLIN_URL = 'http://your-kylin-server:7070/kylin/api'
USERNAME = 'ADMIN'
PASSWORD = 'KYLIN'
CUBE_NAME = 'your_kylin_cube_name'
# --- Step 1: Authenticate and get a token ---
auth_url = f"{KYLIN_URL}/user/authentication"
auth_payload = {
    "credentials": {
        "username": USERNAME,
        "password": PASSWORD
    }
}
headers = {'Content-Type': 'application/json'}
response = requests.post(auth_url, data=json.dumps(auth_payload), headers=headers)
if response.status_code == 200:
    token = response.json()['uuid']
    print("Authentication successful. Token received.")
    # Add the token to the headers for future requests
    headers['Authorization'] = f'Basic {token}'
else:
    print(f"Authentication failed: {response.text}")
    exit()

Querying a Cube

The endpoint for querying is /query. The request body contains the SQL query.

# --- Step 2: Execute a query ---
query_url = f"{KYLIN_URL}/query"
sql_query = f"""
SELECT part_dt, sum(price) as total_sales
FROM {CUBE_NAME}
GROUP BY part_dt
LIMIT 5
"""
query_payload = {
    "sql": sql_query,
    "offset": 0,
    "limit": 50000,
    "accept": "application/json"
}
response = requests.post(query_url, data=json.dumps(query_payload), headers=headers)
if response.status_code == 200:
    query_result = response.json()
    print("\nQuery Result (via requests):")
    # The result is in 'results' key, with column names in 'columnNames'
    results = query_result['results']
    columns = query_result['columnNames']
    for row in results:
        print(dict(zip(columns, row)))
else:
    print(f"Query failed: {response.text}")

Method 3: Using JDBC (For Advanced Use Cases)

For some applications, especially those requiring a more "database-like" connection (e.g., using SQLAlchemy, pandas read_sql_table, or BI tools that connect via JDBC), you can use the JayDeBeApi library to connect to Kylin's JDBC driver.

Setup

  • Download the JDBC Driver: Get the Kylin JDBC JAR file from your Kylin server's tools/jdbc directory or the official download site.
  • Install JayDeBeApi:
    pip install JayDeBeApi

Connection and Query

import jaydebeapi
# Configuration
# Path to the downloaded Kylin JDBC driver JAR file
jdbc_driver_path = '/path/to/kylin-jdbc-4.0.2.jar' # <-- IMPORTANT: Update this path
jdbc_url = 'jdbc:kylin://your-kylin-server:7070/your_project_name'
jdbc_user = 'ADMIN'
jdbc_password = 'KYLIN'
try:
    # Establish the connection
    conn = jaydebeapi.connect(
        'org.apache.kylin.jdbc.KylinDriver',
        jdbc_url,
        [jdbc_user, jdbc_password],
        jdbc_driver_path
    )
    # Get a cursor
    curs = conn.cursor()
    # Execute a query
    sql_query = f"SELECT part_dt, SUM(price) FROM {CUBE_NAME} GROUP BY part_dt LIMIT 5"
    curs.execute(sql_query)
    # Fetch and print results
    print("\nQuery Result (via JDBC):")
    column_names = [desc[0] for desc in curs.description]
    print(column_names)
    for row in curs.fetchall():
        print(row)
    # Close the connection
    curs.close()
    conn.close()
except Exception as e:
    print(f"JDBC connection or query failed: {e}")

Summary: Which Method to Use?

Method Pros Cons Best For
pykylin (SDK) Official, high-level, easy to use. Returns pandas DataFrames. Handles auth and API details. Requires an extra library installation. Almost all use cases. Data analysis, ETL automation, and general Python scripting. This is the recommended approach.
requests (Direct API) No extra libraries needed (requests is standard). Full control over API calls. More verbose. Requires manual handling of authentication, tokens, and JSON parsing. Quick scripts, integration into systems where you can't add new dependencies, or for learning the Kylin API.
JDBC Allows integration with tools that speak JDBC (SQLAlchemy, pandas, Tableau, Power BI). More complex setup (requires JDBC driver). Slower than the REST API for simple queries. Connecting BI tools or using Python's data ecosystem (like SQLAlchemy ORM) to treat Kylin like a regular database.

For any new project, start with pykylin. It provides the most Pythonic and efficient way to work with Kylin cubes.

分享:
扫描分享到社交APP
上一篇
下一篇