starrocks/tools/starrocks-diagnostics/README.md

3.4 KiB

CelerData Doctor v0.1

Overview

The CelerData Doctor script is a comprehensive diagnostic tool designed for StarRocks clusters. It collects critical metadata and performance indicators from StarRocks clusters, providing valuable insights for troubleshooting, optimization, and capacity planning.

Key Features

  • Collects schema, partitions, and tablets for all tables in all databases.
  • Detects cluster architecture (shared-nothing, shared-data, hybrid).
  • Identifies tables with potential performance issues, including:
    • Incorrect replication factors
    • Oversized partitions
    • Empty tables and partitions
    • Too many tablets
    • Severe data skew (table and partition level)
    • Abnormal replica status
  • Generates a compressed .tar.gz file for easy sharing with support teams.
  • Summarizes findings in a single summary.txt for rapid assessment.

Installation

Prerequisites

  • Python 3.8+
  • Required packages:
pip3 install mysql-connector-python

Make sure the following libraries are installed:

import mysql.connector
import os
import csv
from datetime import datetime
import tarfile
import shutil

Clone the Repository

git clone <REPO_URL>
cd celerdata-doctor

Usage

Run the script:

./celerdata-doctor.py

You will be prompted for the following connection details:

  • Host (default: 127.0.0.1)
  • Port (default: 9030)
  • Username (default: root)
  • Password (required)

Example:

Enter StarRocks host (default: 127.0.0.1):
Enter StarRocks port (default: 9030):
Enter StarRocks username (default: root):
Enter StarRocks password:
✅ Connected to StarRocks

Output Structure

After execution, the script will generate a compressed file like:

starrocks_metadata_<timestamp>.tar.gz

Extracting this archive reveals:

starrocks_metadata_<timestamp>/
├── default_catalog/
│   ├── <database>/
│   │   ├── tables.txt
│   │   ├── <table>.sql
│   │   ├── <table>_partitions.csv
│   │   └── <table>_tablets.csv
└── performance_indicators/
    ├── backends.csv
    ├── frontends.csv
    ├── compute_nodes.csv
    ├── cluster_architecture.txt
    ├── oversized_tablets.csv
    ├── too_many_tablets.csv
    ├── data_skew.csv
    ├── partition_data_skew.csv
    ├── empty_tables.csv
    ├── empty_partitions.csv
    ├── replication_check.csv
    └── summary.txt

Key Files

  • summary.txt - High-level overview of detected issues.
  • .sql - Table schemas.
  • _partitions.csv - Partition information.
  • _tablets.csv - Tablet metadata.

    Known Limitations

    • Does not currently detect tables with incomplete or failed replicas.
    • Assumes all databases in a catalog should be included (no exclusion filter).
    • Assumes partition and tablet data is available via information_schema (requires appropriate StarRocks version).

    Future Improvements

    • Add more comprehensive health checks (e.g., data corruption, index issues).
    • Integrate with Prometheus metrics for real-time insights.
    • Support for multi-cluster environments.
    • Option to generate HTML or PDF reports for non-technical stakeholders.

    Contributing

    Contributions are welcome! Please submit pull requests or open issues on GitHub.

    License

    Apache License, Version 2.0