9.4 KiB

Raw Blame History

GitHub Copilot Instructions for StarRocks

Project Overview

StarRocks is a high-performance, cloud-native analytical database system designed for real-time analytics and ad-hoc queries. It features a streamlined architecture with both shared-nothing and shared-data deployment modes, supporting sub-second query performance for complex analytical workloads.

Key Technologies:

Backend (BE): C++ - Core analytical engine, storage layer, and query execution
Frontend (FE): Java - SQL parsing, query planning, metadata management, and coordination
Java Extensions: Java - External connectors and UDF framework
Testing: Python - Integration tests and SQL test framework

Architecture Components

Backend (be/) - C++

The core analytical engine responsible for data storage, processing, and query execution:

Core Components:

be/src/exec/ - Query execution operators (scan, join, aggregate, etc.)
be/src/storage/ - Storage engine (tablets, rowsets, segments, compaction)
be/src/exprs/ - Expression evaluation and vectorized computation
be/src/formats/ - Data format support (Parquet, ORC, CSV, JSON)
be/src/runtime/ - Runtime services (memory management, load balancing, stream processing)
be/src/connector/ - External data source connectors (Hive, Iceberg, Delta Lake)
be/src/service/ - RPC services and BE coordination
be/src/common/ - Shared utilities and common data structures

Performance Focus:

Vectorized query execution
Columnar storage format
Memory-efficient algorithms
SIMD optimizations where applicable

📋 Note: See be/.cursorrules for detailed backend component breakdown

Frontend (fe/) - Java

SQL interface and query coordination layer:

Core Components:

fe/fe-core/src/main/java/com/starrocks/
- sql/ - SQL parser, analyzer, and AST
- planner/ - Query planning and optimization (CBO)
- catalog/ - Metadata management (tables, partitions, statistics)
- scheduler/ - Query scheduling and execution coordination
- load/ - Data loading coordination (Broker Load, Stream Load, etc.)
- backup/ - Backup and restore functionality
- privilege/ - Authentication and authorization
- qe/ - Query execution coordination and session management
fe/fe-common/ - Common frontend utilities
fe/plugin-common/ - Plugin framework common components
fe/spark-dpp/ - Spark data preprocessing integration
fe/hive-udf/ - Hive UDF compatibility layer

Key Responsibilities:

Parse and validate SQL statements
Generate optimized query plans using Cost-Based Optimizer (CBO)
Manage cluster metadata and coordination
Handle user sessions and security

📋 Note: See fe/.cursorrules for detailed frontend component breakdown

Java Extensions (java-extensions/) - Java

External connectivity and extensibility:

Data Source Connectors:

hive-reader/ - Apache Hive integration
iceberg-metadata-reader/ - Apache Iceberg support
hudi-reader/ - Apache Hudi integration
paimon-reader/ - Apache Paimon support
jdbc-bridge/ - JDBC connectivity for external databases
odps-reader/ - Alibaba ODPS integration

Extension Framework:

udf-extensions/ - User-Defined Function framework
common-runtime/ - Shared runtime for extensions
hadoop-ext/ - Hadoop ecosystem integration

📋 Note: See java-extensions/.cursorrules for detailed extensions breakdown

Additional Important Directories

Generated Sources (gensrc/):

gensrc/proto/ - Protocol buffer definitions
gensrc/thrift/ - Thrift interface definitions
gensrc/script/ - Code generation scripts

Testing Framework (test/):

test/sql/ - SQL test cases organized by functionality
test/common/ - Common test utilities
test/lib/ - Test libraries and helpers

Tools and Utilities:

tools/ - Diagnostic tools, benchmarks, and utilities
bin/ - Binary executables and scripts
conf/ - Configuration files and templates
build-support/ - Build system support files
docker/ - Docker build configurations

Other Key Directories:

thirdparty/ - External dependencies and patches
fs_brokers/ - File system broker implementations
webroot/ - Web UI static files
format-sdk/ - Format SDK for data interchange

Coding Guidelines

C++ (Backend)

// Use modern C++ features (C++17/C++20)
// Follow Google C++ Style Guide conventions
// Use RAII for resource management
// Prefer smart pointers over raw pointers
// Use const-correctness

// Example: Vectorized processing pattern
Status ColumnProcessor::process_batch(const ChunkPtr& chunk) {
    const auto& column = chunk->get_column_by_name("column_name");
    auto result_column = std::make_shared<Column>();
    
    // Vectorized operation on entire column
    for (size_t i = 0; i < chunk->num_rows(); ++i) {
        // Process element
    }
    
    return Status::OK();
}

Java (Frontend)

// Follow Java coding conventions
// Use dependency injection where appropriate  
// Implement proper exception handling
// Use builder patterns for complex objects
// Follow existing naming conventions

// Example: Query planning pattern
public class ScanNodePlanner extends PlanFragment {
    @Override
    public PlanFragment visitLogicalScanOperator(
            OptExpression optExpression, ExecPlan context) {
        LogicalScanOperator scanOperator = 
            (LogicalScanOperator) optExpression.getOp();
        
        // Create physical scan node
        ScanNode scanNode = createScanNode(scanOperator);
        return new PlanFragment(scanNode);
    }
}

⚠️ CRITICAL BUILD SYSTEM WARNING

DO NOT attempt to build or run unit tests (UT) for this project unless explicitly requested by the user.

The build system is extremely resource-intensive and time-consuming. Building the full project can take hours and requires significant system resources.

Specific commands and files to AVOID:

build.sh - Main build script (extremely resource intensive)
build-in-docker.sh - Docker-based build
run-be-ut.sh / run-fe-ut.sh / run-java-exts-ut.sh - Unit test runners
docker-compose commands - Heavy resource usage
Makefile* - Make build files
pom.xml - Maven build files (for Java components)

Focus on code analysis and targeted changes instead of full builds.

Important Guidelines

Pull Request Requirements

PR Title Format: Must include category prefix:

[BugFix] - Bug fixes and error corrections
[Feature] - New features and capabilities
[Enhancement] - Improvements to existing functionality
[Refactor] - Code refactoring without functional changes
[Test] - Test-related changes
[Doc] - Documentation updates
[Build] - Build system and CI/CD changes
[Performance] - Performance optimizations

Example: [Feature] Add Apache Paimon table format support

Code Review Focus Areas

Performance Considerations:

Query execution efficiency
Memory usage patterns
Lock contention in concurrent scenarios
Network I/O optimization

Correctness Priorities:

SQL standard compliance
Data type handling accuracy
Transaction consistency
Error handling completeness

Security Considerations:

Input validation and sanitization
Authentication and authorization
Resource usage limits
Information leak prevention

Common Development Patterns

Adding New SQL Functions

Define function signature in fe/fe-core/src/main/java/com/starrocks/catalog/FunctionSet.java
Implement evaluation logic in be/src/exprs/
Add comprehensive tests in test/sql/test_functions/

Adding New Data Source Connectors

Implement connector interface in java-extensions/
Add metadata reader and schema handling
Integrate with query planner in fe/fe-core/src/main/java/com/starrocks/connector/
Add integration tests

Query Optimization Improvements

Analyze optimizer rules in fe/fe-core/src/main/java/com/starrocks/sql/optimizer/rule/
Update cost model if needed in fe/fe-core/src/main/java/com/starrocks/sql/optimizer/cost/
Add test cases in test/sql/test_optimizer/

Documentation References

Contributing Guide: CONTRIBUTING.md
Development Setup: StarRocks Documentation
Architecture Overview: README.md
PR Template: .github/PULL_REQUEST_TEMPLATE.md

Quick Reference

Key File Extensions:

.cpp, .h, .cc - C++ backend code
.java - Java frontend/extensions code
.sql - SQL test cases
.py - Python test scripts
.proto - Protocol buffer definitions
.thrift - Thrift interface definitions

Important Configuration:

conf/ - Runtime configuration templates
gensrc/ - Auto-generated code from IDL definitions
thirdparty/ - External dependencies

Testing Structure:

test/sql/ - SQL correctness tests organized by functionality
be/test/ - C++ unit tests
fe/fe-core/src/test/ - Java unit tests

Build System Files to Avoid:

build.sh - Main build script (very resource intensive)
build-in-docker.sh - Docker-based build
run-*-ut.sh - Unit test runners
Makefile* - Make build files
pom.xml - Maven build files (for Java components)

This project prioritizes performance, correctness, and scalability. When contributing, consider the impact on query performance and ensure changes maintain SQL standard compliance.

9.4 KiB Raw Blame History