aio-starrocks-best-practices
From plugin aio-starrocks ·
v1.0.2· Install:/plugin install aio-starrocks@aiocean-plugins
StarRocks Best Practices
Official best practices from docs.starrocks.io/docs/best_practices/.
Core design philosophy: "Designing for efficiency does more than improve query speed — it decreases costs by reducing storage, CPU, and object storage API costs."
Related Skills
/aio-starrocks-query-tuning— Query performance tuning, EXPLAIN plans, operator metrics, hints/starrocks— Query syntax, cluster connections, Grafana integration/starrocks-expert— General table types, data loading, query optimization
1. Partitioning
Partitioning enables coarse-grain data pruning via partition elimination AND metadata-only lifecycle operations (TTL, GDPR deletes, tenant isolation).
Partition Key Selection
- Time-first — If 80%+ of queries filter by time, lead with
date_trunc('day', dt) - Tenant isolation — Include
tenant_idwhen managing data per-tenant - Retention alignment — Include columns you'll purge via
DROP PARTITION - Composite keys — Creates
#tenants x #dayspartitions — keep total below ~100K to avoid FE memory strain
Granularity Decision
| Granularity | Best For | Advantage | Limitation |
|---|---|---|---|
| Daily | BI/reporting | 365 partitions/year; simple TTL | Coarse for sub-day queries |
| Hourly | IoT/burst workloads | Hot-spot isolation | 8,700 partitions/year |
| Weekly/Monthly | Historical archive | Minimal metadata | Less precise pruning |
Sizing Rule
- Each partition <= 100GB
- <= 20K tablets per partition (across replicas)
- Total partitions < 100K (FE memory constraint)
- Tablets per BE < 200K
DDL Templates
Single-tenant clickstream:
CREATE TABLE click_stream (
user_id BIGINT, event_time DATETIME, url STRING, ...)
DUPLICATE KEY(user_id, event_time)
PARTITION BY date_trunc('day', event_time)
DISTRIBUTED BY HASH(user_id) BUCKETS xxx;
Multi-tenant SaaS (recommended for sales-engine pattern):
CREATE TABLE metrics (
tenant_id INT, dt DATETIME, metric_name STRING, v DOUBLE)
PRIMARY KEY(tenant_id, dt, metric_name)
PARTITION BY date_trunc('DAY', dt)
DISTRIBUTED BY HASH(tenant_id) BUCKETS xxx;
Large-tenant composite (when single tenant > 100GB/partition):
CREATE TABLE activity (
tenant_id INT, dt DATETIME, id BIGINT, ....)
DUPLICATE KEY(dt, id)
PARTITION BY tenant_id, date_trunc('MONTH', dt)
DISTRIBUTED BY HASH(id) BUCKETS xxx;
Partitioning vs Bucketing
- Partitions = lifecycle management tools (TTL, DROP PARTITION, GDPR). Enable query-time partition pruning — skip entire data blocks.
- Buckets = parallelism levers. Distribute data within partitions for parallel scan/ingest.
2. Table Clustering (Sort Keys)
"A thoughtful sort-key is the highest-leverage physical-design knob in StarRocks."
Why Sort Keys Matter
Sort keys deliver compounding benefits across write, storage, and read:
- I/O elimination — Segment and page pruning via min/max metadata skips irrelevant data blocks
- Point lookups — Sparse prefix index enables millisecond queries on leading sort columns
- Sorted aggregation — Streaming aggregation (2-3x faster) when GROUP BY aligns with sort key
- Compression & caching — Sorted data improves encoding efficiency and CPU cache locality
Sort Key Selection Playbook
Decision hierarchy:
- Equality columns first — High-cardinality columns with frequent
=/INfilters - Range columns second — Timestamps or numeric ranges for temporal/value windows
- Aggregation helpers third — GROUP BY columns that enable sorted aggregation
Configuration Rules
| Rule | Guidance |
|---|---|
| Width | 3-5 columns max. Wider keys degrade ingest and exhaust the 36-byte prefix-index limit |
| Cardinality order | Low-cardinality before high-cardinality enhances compression |
| String columns | Long strings consume prefix-index bytes, blocking subsequent columns from indexing |
Reference Templates
| Scenario | Partition | Sort Key | Why |
|---|---|---|---|
| B2C Orders | date_trunc('day', order_ts) | (user_id, order_ts) | User-first filters; then temporal ranges |
| IoT Telemetry | date_trunc('day', ts) | (device_id, ts) | Device time-series dominates |
| SaaS Multi-Tenant | tenant_id | (dt, event_id) | Tenant isolation; dashboard time clustering |
| Dimension Lookup | none | (dim_id) | Point lookups only |
DDL Example
CREATE TABLE telemetry (
device_id VARCHAR,
ts DATETIME,
value DOUBLE
)
ENGINE=OLAP
PRIMARY KEY(device_id, ts)
PARTITION BY date_trunc('day', ts)
DISTRIBUTED BY HASH(device_id) BUCKETS 16
ORDER BY (device_id, ts);
Anti-Patterns
- Placing long string columns at the sort-key head (wastes prefix-index bytes)
- Overly wide sort keys (>5 columns)
- Misaligning partition and sort keys (defeating logical pruning order)
3. Bucketing (Distribution Strategy)
Quick Decision Framework
| Scenario | Choice | Rationale |
|---|---|---|
| Stable join/filter keys, high cardinality | Hash | Enables pruning, colocated joins, local aggregation |
| Write-heavy logs/events; multi-tenant | Random | Prevents skew, uniform throughput, elastic growth |
| Aggregate/Primary Key tables | Hash | Only option for these table types |
| Duplicate Key tables needing elasticity | Random | Auto-splits when bucket_size set |
Hash Bucketing
DISTRIBUTED BY HASH(column1, column2) BUCKETS n
PROPERTIES ("colocate_with" = "group_name")
Key requirements:
- Key must be stable, evenly distributed, high-cardinality
- Cardinality rule: >= 1000x the number of BE nodes to prevent skew
- Tablet sizing: Target 1-10 GB per tablet initially
- Tablets > 10GB = compaction efficiency degradation
Query optimizations enabled:
-- Tablet pruning: single tablet accessed
SELECT sum(amount) FROM sales WHERE customer_id = 123;
-- Local aggregation: no shuffle phase
SELECT customer_id, sum(amount) FROM sales GROUP BY customer_id;
-- Colocated join: no network shuffle between BEs
SELECT c.region, sum(s.amount)
FROM sales s JOIN customers c USING (customer_id)
WHERE s.sale_date BETWEEN '2025-01-01' AND '2025-01-31'
GROUP BY c.region;
Colocated join setup (matching key + bucket count required):
CREATE TABLE sales (
sale_id BIGINT, customer_id INT, sale_date DATE, amount DECIMAL(10,2))
DISTRIBUTED BY HASH(customer_id) BUCKETS 48
PARTITION BY date_trunc('DAY', sale_date)
PROPERTIES ("colocate_with" = "group1");
CREATE TABLE customers (
customer_id INT, region VARCHAR(32), status TINYINT)
DISTRIBUTED BY HASH(customer_id) BUCKETS 48
PROPERTIES ("colocate_with" = "group1");
Random Bucketing
DISTRIBUTED BY RANDOM
PROPERTIES ("bucket_size" = "1GB") -- Enables auto-split (v3.2+)
- Round-robin row assignment (no hash key)
- Auto tablet splitting when partition grows (requires
bucket_size) - Growth-only — no shrinking
- Limitation: Duplicate Key tables only
- Trade-off: No bucket pruning; every query scans all tablets in a partition; no colocated joins
Operational Maintenance
- Random: Always set
bucket_size(e.g., 1GB) for auto-split - Hash: Monitor tablet size; re-shard before tablets exceed 5-10 GB (
ALTER TABLE ... BUCKETS n) - Both: Watch for metadata bloat with excessive tablet counts
Anti-Patterns
- Low-cardinality hash keys — Creates hot tablets and imbalanced writes
- Undersizing initial buckets — Hampers ingestion parallelism and compaction
- Random bucketing for dimensional joins — Eliminates locality optimizations
- Ignoring
bucket_sizein Random mode — Tablets never split; metadata grows unbounded
4. Primary Key Table Tuning
Primary Key Index Types
| Type | Recommendation | Notes |
|---|---|---|
| Full in-memory | NOT recommended | Excessive memory waste |
| Local disk persistent | Standard option | Good for shared-nothing |
| Cloud-native persistent | Recommended for shared-data | Avoids disk capacity constraints, eliminates index rebuilds after rebalancing |
Key Design Principles
- Focus on uniqueness requirements during import/updates, NOT query acceleration
- Minimize column count and size (default max: 128 bytes)
- Use
ORDER BYclause separately for query optimization via sort keys
Resource Consumption Formula
- Storage:
(key_size + 8 bytes) x row_count x 50% - Memory:
min(l0_max_mem_usage x tablet_count, update_memory_limit_percent x BE_memory)
Memory Management
Monitor: http://be_ip:be_http_port/mem_tracker?type=update
Reduce import memory overhead:
l0_max_mem_usage = <value < 104857600> # Default 104857600 (100MB)
skip_pk_preload = true
transaction_apply_worker_count = <cpu_cores - n>
transaction_publish_version_worker_count = <cpu_cores - n>
Trade-off: Reduced memory increases I/O; fewer worker threads slow ingestion
Performance Balance
| Goal | Configuration |
|---|---|
| High freshness + low query latency | Increase compact_threads; decrease update_compaction_per_tablet_min_interval_seconds |
| Good freshness, limited resources | Raise lake_ingest_slowdown_threshold and lake_compaction_score_upper_bound |
| Good latency, limited resources | Reduce write frequency; batch larger data loads |
Monitoring
- Shared-data:
SHOW PROC '/transactions/{db}/running'for compaction slowdown messages - Shared-nothing: Monitor
tablet_max_versionsthreshold before ingestion failures
5. Authentication & Authorization
Three-Layer Access Control
- Identity Authentication — "I am who I claim to be" (user verification)
- Access Authentication — Group/role-based login eligibility to the cluster
- Operation Authorization — Query execution and data access permissions
Authentication Methods
| Method | User Storage | Setup | Best For |
|---|---|---|---|
| Native User | In-cluster | Manual creation | Small user bases |
| Security Integration | External system | Configuration-driven | Large enterprises |
Configuration Examples
Native user with external auth (LDAP):
CREATE USER <username> IDENTIFIED WITH authentication_ldap_simple
AS 'uid=tom,ou=company,dc=example,dc=com';
Security integration (LDAP):
CREATE SECURITY INTEGRATION <name> PROPERTIES (
"type" = "authentication_ldap_simple",
"authentication_ldap_simple_server_host" = "",
"authentication_ldap_simple_server_port" = "",
"authentication_ldap_simple_bind_base_dn" = "",
"authentication_ldap_simple_user_search_attr" = ""
);
ADMIN SET FRONTEND CONFIG (
"authentication_chain" = "<security_integration_name>"
);
Group provider + role grants:
CREATE GROUP PROVIDER <name> PROPERTIES (
"type" = "ldap",
"ldap_conn_url" = "",
"ldap_bind_root_dn" = "",
"ldap_bind_base_dn" = ""
);
GRANT <role> TO EXTERNAL GROUP <group_name>;
Solution Selection
- Full external control — Security Integration + Apache Ranger
- Minimal setup, native control — Security Integration + Internal RBAC
- Legacy — Native users with GRANT statements
Supported protocols: LDAP, OIDC, OAuth 2.0, JWT, native passwords
Critical: User IDs and group names must match across auth, group provider, and authorization systems. Mismatches cause permission failures.
6. Audit Log & Resource Groups
Core principle: Use data-driven resource allocation by analyzing starrocks_audit_db__.starrocks_audit_tbl__ rather than guesswork.
CPU Resource Allocation
Analyze per-user CPU consumption, allocate proportionally:
-- Aggregate cpuCostNs per user, last 30 days
SELECT user, SUM(cpuCostNs) / 1e9 AS cpu_seconds
FROM starrocks_audit_db__.starrocks_audit_tbl__
WHERE state IN ('EOF', 'OK')
AND timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user
ORDER BY cpu_seconds DESC;
Configuration:
exclusive_cpu_cores— Cannot exceed single BE core count; sum across all groups <= BE totalcpu_weight— For soft-isolation groups; determines relative share on remaining cores
Rule of thumb: If a user is 16% of CPU on a 64-core BE, allocate ~11 cores.
Memory Management
-- Peak single-query memory per user
SELECT user, MAX(memCostBytes) / (1024 * 1024) AS peak_mem_mb
FROM starrocks_audit_db__.starrocks_audit_tbl__
WHERE state IN ('EOF', 'OK')
AND timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY user;
big_query_mem_limit— Set high to avoid false-positive termination of legitimate large queriesmem_limit— Set high (e.g.,0.9for 90%)- Per-BE usage ~=
total_max_mem_mb / number_of_BEs
Concurrency Control
-- Peak concurrent queries per user per minute
SELECT user, DATE_FORMAT(timestamp, '%Y-%m-%d %H:%i') AS minute_bucket,
COUNT(*) AS concurrent_queries
FROM starrocks_audit_db__.starrocks_audit_tbl__
WHERE queryType = 'query' AND state IN ('EOF', 'OK')
AND timestamp >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY user, minute_bucket
ORDER BY concurrent_queries DESC;
concurrency_limit— Set to 1.5x observed peak for headroom- For extreme spikes: enable Query Queues for load smoothing
- Minute-level analysis may underestimate per-second spikes
Materialized View Resource Isolation
Prevent async MV refreshes from degrading interactive queries:
CREATE RESOURCE GROUP rg_mv (
user = 'mv_user',
query_type IN ('insert', 'select')
)
WITH (
'cpu_weight' = '32',
'mem_limit' = '0.9',
'concurrency_limit' = '10',
'spill_mem_limit_threshold' = '0.5'
);
-- Assign to MV at creation
CREATE MATERIALIZED VIEW ... PROPERTIES ('resource_group' = 'rg_mv');
-- Or existing MV
ALTER MATERIALIZED VIEW ... SET ("resource_group" = "rg_mv");
Anti-Patterns
- Relying on guesswork instead of audit log analysis
- Setting
exclusive_cpu_coressum to exceed available BE cores - Using low
concurrency_limitwithout headroom buffer - Applying low
mem_limitthat terminates legitimate queries - Allowing MV refreshes to share resources with interactive workloads
Decision Matrix — Table Design by Use Case
| Decision | Single-Tenant Fact | Multi-Tenant SaaS | IoT/Events | Dimension Lookup |
|---|---|---|---|---|
| Partition | date_trunc('day', ts) | date_trunc('DAY', dt) | date_trunc('day', ts) | None |
| Sort Key | (user_id, ts) | (tenant_id, dt) | (device_id, ts) | (dim_id) |
| Distribution | HASH(user_id) | HASH(tenant_id) | RANDOM w/ bucket_size | HASH(dim_id) |
| Bucket Type | Hash (colocate joins) | Hash (tenant filter pruning) | Random (prevent skew) | Hash |
| PK Index | Local disk persistent | Cloud-native (elastic) | Cloud-native | Local disk |
| Achievement | Query on user; streaming agg | Tenant isolation; fast TTL | Write throughput; device clustering | Millisecond lookups |