[Doc] add automatic partitioning (#20577)

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

* add automatic partitioning

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>

---------

Signed-off-by: hellolilyliuyi <hellolilyliuyi123@163.com>
This commit is contained in:
hellolilyliuyi 2023-03-30 17:13:40 +08:00 committed by GitHub
parent 5249bc5931
commit e6f6ebc33b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
3 changed files with 302 additions and 127 deletions

View File

@ -18,6 +18,7 @@
+ [Data models](./table_design/Data_model.md)
+ Data distribution
+ [Data distribution](./table_design/Data_distribution.md)
+ [Automatic partitioning](./table_design/automatic_partitioning.md)
+ [Dynamic partitioning](./table_design/dynamic_partitioning.md)
+ [Data compression](./table_design/data_compression.md)
+ [Sort keys and prefix indexes](./table_design/Sort_key.md)

View File

@ -168,155 +168,167 @@ If you intend to set the number of buckets, StarRocks 2.4 and later versions sup
### Create partitions
You can partition a table in threes ways by using `PARTITION BY RANGE` clause:
Since version 3.0, StarRocks supports [automatic partitioning](./automatic_partitioning.md) during data loading. You no longer need to create a large number of partitions in advance. This on-demand partition creation method can help you reduce the O&M costs.
- Partition a table with the LESS THAN clause. For more information, see [CREATE TABLE](../sql-reference/sql-statements/data-definition/CREATE%20TABLE.md).
If you need to create partitions in advance, you can use other partition creation methods, such as enabling dynamic partitioning and manually creating partitions.
```SQL
PARTITION BY RANGE (k1, k2, ...)
(
PARTITION partition_name1 VALUES LESS THAN ("value1", "value2", ...),
PARTITION partition_name2 VALUES LESS THAN ("value1", "value2", ...),
PARTITION partition_name3 VALUES LESS THAN (MAXVALUE)
)
```
- Dynamic partitioning
- Partition a table by specifying values of a fixed range. For more information, see CREATE TABLE.
StarRocks supports [dynamic partitioning](./dynamic_partitioning.md), which can automatically manage the time to live (TTL) of partitions, such as partitioning new input data in tables and deleting expired partitions. This feature significantly reduces maintenance costs.
```SQL
PARTITION BY RANGE (k1, k2, k3, ...)
(
PARTITION partition_name1 VALUES [("k1-lower1", "k2-lower1", "k3-lower1",...), ("k1-upper1", "k2-upper1", "k3-upper1", ...)],
PARTITION partition_name2 VALUES [("k1-lower1-2", "k2-lower1-2", ...), ("k1-upper1-2", MAXVALUE, )],
"k3-upper1-2", ...
)
```
- Manually create partitions
- Partition a table by specifying START, END, and EVERY. You can create multiple partitions at a time by using this method. For more information, see CREATE TABLE.
- Partition a table with the LESS THAN clause. For more information, see [CREATE TABLE](../sql-reference/sql-statements/data-definition/CREATE%20TABLE.md).
```SQL
PARTITION BY RANGE (k1, k2, ...)
(
START ("value1") END ("value2") EVERY (INTERVAL value3 day)
)
```
```SQL
PARTITION BY RANGE (k1, k2, ...)
(
PARTITION partition_name1 VALUES LESS THAN ("value1", "value2", ...),
PARTITION partition_name2 VALUES LESS THAN ("value1", "value2", ...),
PARTITION partition_name3 VALUES LESS THAN (MAXVALUE)
)
```
#### Examples
- Partition a table by specifying values of a fixed range. For more information, see [CREATE TABLE](../sql-reference/sql-statements/data-definition/CREATE%20TABLE.md).
The following examples demonstrate how to partition a table by specifying START, END, and EVERY.
```SQL
PARTITION BY RANGE (k1, k2, k3, ...)
(
PARTITION partition_name1 VALUES [("k1-lower1", "k2-lower1", "k3-lower1",...), ("k1-upper1", "k2-upper1", "k3-upper1", ...)],
PARTITION partition_name2 VALUES [("k1-lower1-2", "k2-lower1-2", ...), ("k1-upper1-2", MAXVALUE, )],
"k3-upper1-2", ...
)
```
- The data type of the partitioning column is DATE and you specify the time range of partitioning via START and END and define the time range via EVERY. Example:
- Create multiple partitions at a time.
```SQL
CREATE TABLE site_access (
datekey DATE,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey) (
START ("2021-01-01") END ("2021-01-04") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10
PROPERTIES ("replication_num" = "1"
);
```
Partition a table by specifying START, END, and EVERY. You can create multiple partitions at a time by using this method. For more information, see [CREATE TABLE](../sql-reference/sql-statements/data-definition/CREATE%20TABLE.md).
The PARTITION BY RANGE clause in this example is equal to the following:
```SQL
PARTITION BY RANGE (k1, k2, ...)
(
START ("value1") END ("value2") EVERY (INTERVAL value3 day)
)
```
```SQL
PARTITION BY RANGE (datekey) (
PARTITION p20210101 VALUES [('2021-01-01'), ('2021-01-02')),
PARTITION p20210102 VALUES [('2021-01-02'), ('2021-01-03')),
PARTITION p20210103 VALUES [('2021-01-03'), ('2021-01-04'))
)
```
**Examples**
- The data type of the partitioning column is DATE and you specify different EVERY clauses for different time ranges (which cannot overlap with each other). Example:
The following examples demonstrate how to partition a table by specifying START, END, and EVERY.
```SQL
CREATE TABLE site_access(
datekey DATE,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey)
(
START ("2019-01-01") END ("2021-01-01") EVERY (INTERVAL 1 YEAR),
START ("2021-01-01") END ("2021-05-01") EVERY (INTERVAL 1 MONTH),
START ("2021-05-01") END ("2021-05-04") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10
PROPERTIES(
"replication_num" = "1"
);
```
- The data type of the partitioning column is DATE and you specify the time range of partitioning via START and END and define the time range via EVERY. Example:
The PARTITION BY RANGE clause in this example is equal to the following:
```SQL
CREATE TABLE site_access (
datekey DATE,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey) (
START ("2021-01-01") END ("2021-01-04") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10
PROPERTIES ("replication_num" = "1"
);
```
```SQL
PARTITION BY RANGE (datekey) (
PARTITION p2019 VALUES [('2019-01-01'), ('2020-01-01')),
PARTITION p2020 VALUES [('2020-01-01'), ('2021-01-01')),
PARTITION p202101 VALUES [('2021-01-01'), ('2021-02-01')),
PARTITION p202102 VALUES [('2021-02-01'), ('2021-03-01')),
PARTITION p202103 VALUES [('2021-03-01'), ('2021-04-01')),
PARTITION p202104 VALUES [('2021-04-01'), ('2021-05-01')),
PARTITION p20210501 VALUES [('2021-05-01'), ('2021-05-02')),
PARTITION p20210502 VALUES [('2021-05-02'), ('2021-05-03')),
PARTITION p20210503 VALUES [('2021-05-03'), ('2021-05-04'))
)
```
The PARTITION BY RANGE clause in this example is equal to the following:
- The data type of the partitioning column is INT and you specify the value range of partitioning by using START and END and define the incremental value via EVERY. Example:
> Note: Do not double quote the incremental value defined by EVERY.
```SQL
PARTITION BY RANGE (datekey) (
PARTITION p20210101 VALUES [('2021-01-01'), ('2021-01-02')),
PARTITION p20210102 VALUES [('2021-01-02'), ('2021-01-03')),
PARTITION p20210103 VALUES [('2021-01-03'), ('2021-01-04'))
)
```
```SQL
CREATE TABLE site_access (
datekey INT,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey) (START ("1") END ("5") EVERY (1)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10
PROPERTIES ("replication_num" = "1"
);
```
- The data type of the partitioning column is DATE and you specify different EVERY clauses for different time ranges (which cannot overlap with each other). Example:
The PARTITION BY RANGE clause in this example is equal to the following:
```SQL
CREATE TABLE site_access(
datekey DATE,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey)
(
START ("2019-01-01") END ("2021-01-01") EVERY (INTERVAL 1 YEAR),
START ("2021-01-01") END ("2021-05-01") EVERY (INTERVAL 1 MONTH),
START ("2021-05-01") END ("2021-05-04") EVERY (INTERVAL 1 DAY)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10
PROPERTIES(
"replication_num" = "1"
);
```
```SQL
PARTITION BY RANGE (datekey) (
PARTITION p2019 VALUES [('2019-01-01'), ('2020-01-01')),
PARTITION p2020 VALUES [('2020-01-01'), ('2021-01-01')),
PARTITION p202101 VALUES [('2021-01-01'), ('2021-02-01')),
PARTITION p202102 VALUES [('2021-02-01'), ('2021-03-01')),
PARTITION p202103 VALUES [('2021-03-01'), ('2021-04-01')),
PARTITION p202104 VALUES [('2021-04-01'), ('2021-05-01')),
PARTITION p20210501 VALUES [('2021-05-01'), ('2021-05-02')),
PARTITION p20210502 VALUES [('2021-05-02'), ('2021-05-03')),
PARTITION p20210503 VALUES [('2021-05-03'), ('2021-05-04'))
)
```
The PARTITION BY RANGE clause in this example is equal to the following:
- After a table is created, you can use the [ALTER TABLE](../sql-reference/sql-statements/data-definition/ALTER%20TABLE.md) statement to add partitions for the table.
```SQL
PARTITION BY RANGE (datekey) (
PARTITION p2019 VALUES [('2019-01-01'), ('2020-01-01')),
PARTITION p2020 VALUES [('2020-01-01'), ('2021-01-01')),
PARTITION p202101 VALUES [('2021-01-01'), ('2021-02-01')),
PARTITION p202102 VALUES [('2021-02-01'), ('2021-03-01')),
PARTITION p202103 VALUES [('2021-03-01'), ('2021-04-01')),
PARTITION p202104 VALUES [('2021-04-01'), ('2021-05-01')),
PARTITION p20210501 VALUES [('2021-05-01'), ('2021-05-02')),
PARTITION p20210502 VALUES [('2021-05-02'), ('2021-05-03')),
PARTITION p20210503 VALUES [('2021-05-03'), ('2021-05-04'))
)
```
```SQL
ALTER TABLE site_access
ADD PARTITIONS START ("2021-01-04") END ("2021-01-06") EVERY (INTERVAL 1 DAY);
```
- The data type of the partitioning column is INT and you specify the value range of partitioning by using START and END and define the incremental value via EVERY. Example:
> **NOTE**
>
> Do not double quote the incremental value defined by EVERY.
```SQL
CREATE TABLE site_access (
datekey INT,
site_id INT,
city_code SMALLINT,
user_name VARCHAR(32),
pv BIGINT DEFAULT '0'
)
ENGINE=olap
DUPLICATE KEY(datekey, site_id, city_code, user_name)
PARTITION BY RANGE (datekey) (START ("1") END ("5") EVERY (1)
)
DISTRIBUTED BY HASH(site_id) BUCKETS 10
PROPERTIES ("replication_num" = "1"
);
```
The PARTITION BY RANGE clause in this example is equal to the following:
```SQL
PARTITION BY RANGE (datekey) (
PARTITION p2019 VALUES [('2019-01-01'), ('2020-01-01')),
PARTITION p2020 VALUES [('2020-01-01'), ('2021-01-01')),
PARTITION p202101 VALUES [('2021-01-01'), ('2021-02-01')),
PARTITION p202102 VALUES [('2021-02-01'), ('2021-03-01')),
PARTITION p202103 VALUES [('2021-03-01'), ('2021-04-01')),
PARTITION p202104 VALUES [('2021-04-01'), ('2021-05-01')),
PARTITION p20210501 VALUES [('2021-05-01'), ('2021-05-02')),
PARTITION p20210502 VALUES [('2021-05-02'), ('2021-05-03')),
PARTITION p20210503 VALUES [('2021-05-03'), ('2021-05-04'))
)
```
- After a table is created, you can use the [ALTER TABLE](../sql-reference/sql-statements/data-definition/ALTER%20TABLE.md) statement to add partitions for the table.
```SQL
ALTER TABLE site_access
ADD PARTITIONS START ("2021-01-04") END ("2021-01-06") EVERY (INTERVAL 1 DAY);
```
### Add a partition

View File

@ -0,0 +1,162 @@
# Automatic partitioning
This topic describes how to create a table that supports automatic partitioning. This topic also describes the usage notes and limits of automatic partitioning.
## Introduction
To make partitions creation more easy to use and flexible, StarRocks supports the partitioning expression and automatic partitioning since version 3.0. You only need to specify a partition column of the DATE or DATETIME data type and a partition granularity (year, month, day, or hour) in the partition expression, which includes a time function. With this implicit partitioning method implemented by using a expression, you do not need to create a large number of partitions in advance. StarRocks automatically creates partitions when new data is written. It is recommended that you prioritize using automatic partitioning.
## Enable automatic partitioning
### Syntax
The `PARTITION BY` clause contains a function expression that specifies the partition granularity and partition column for automatic partitioning.
```SQL
PARTITION BY date_trunc(<time_unit>,<partition_column_name>)
...
[PROPERTIES("partition_live_number" = "xxx")];
```
Or
```SQL
PARTITION BY time_slice(<partition_column_name>,INTERVAL N <time_unit>[, boundary]))
...
[PROPERTIES("partition_live_number" = "xxx")];
```
### Parameters
- Functions: Currently, only the [date_trunc](../sql-reference/sql-functions/date-time-functions/date_trunc.md) and [time_slice](../sql-reference/sql-functions/date-time-functions/time_slice.md) functions are supported. If you use the function `time_slice`, you do not need to pass the `boundary` parameter. It is because in this scenario, the default and valid value for this parameter is `floor`, and the value can not be `ceil`.
- `time_unit`: the partition granularity, which can be `hour`, `day`, `month` or `year`. The `week` partition granularity is not supported. If the partition granularity is `hour`, the partition column must be of the DATETIME data type and cannot be of the DATE data type.
- `partition_column_name`: the name of the partition column. The partition type is RANGE, and therefore the partition column can only be of the DATE or DATETIME data type. Currently, you can specify only one partition column and multiple partition columns are not supported. If the `date_trunc` function is used, the partition column can be of the DATE or DATETIME data type. If the `time_slice` function is used, the partition column must be of the DATETIME data type. The partition column allows `NULL` values. If the partition column is of the DATE data type, the supported range is [0000-01-01~9999-12-31]. If the partition column is of the DATETIME data type, the supported range is [0000-01-01 01:01:019999-12-31 23:59:59].
- `partition_live_number`: the number of the most recent partitions to be retained. "Recent" refers to sorting the partitions in chronological order, and then keeping the number of partitions that counted backwards, while deleting the rest of the partitions. StarRocks schedules tasks to manage the number of partitions, and the scheduling interval can be configured through the FE dynamic parameter `dynamic_partition_check_interval_seconds`, which defaults to 600 seconds (10 minutes).
## Examples
Example 1: Use the `date_trunc` function to create a table that supports automatic partitioning. Set the partition granularity to `day` and the partition column to `event_day`.
```SQL
CREATE TABLE site_access (
event_day DATETIME NOT NULL,
site_id INT DEFAULT '10',
city_code VARCHAR(100),
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(event_day, site_id, city_code, user_name)
PARTITION BY date_trunc('day', event_day)
DISTRIBUTED BY HASH(event_day, site_id)
PROPERTIES(
"replication_num" = "1"
);
```
When the following two data rows are inserted, StarRocks automatically creates two partitions, `p20230226` and `p20230227`, whose ranges are [2023-02-26 00:00:00, 2023-02-27 00:00:00) and [2023-02-27 00:00:00, 2023-02-28 00:00:00) respectively.
```SQL
-- insert two data rows
INSERT INTO site_access VALUES
("2023-02-26 20:12:04",002,"New York","Sam Smith",1),
("2023-02-27 21:06:54",001,"Los Angeles","Taylor Swift",1);
-- view partitions
SHOW PARTITIONS FROM site_access\G
*************************** 1. row ***************************
PartitionId: 135846228
PartitionName: p20230226
VisibleVersion: 2
VisibleVersionTime: 2023-03-22 14:50:17
VisibleVersionHash: 0
State: NORMAL
PartitionKey: event_day
Range: [types: [DATETIME]; keys: [2023-02-26 00:00:00]; ..types: [DATETIME]; keys: [2023-02-27 00:00:00]; )
DistributionKey: event_day, site_id
Buckets: 6
ReplicationNum: 1
StorageMedium: HDD
CooldownTime: 9999-12-31 23:59:59
LastConsistencyCheckTime: NULL
DataSize: 0B
IsInMemory: false
RowCount: 0
*************************** 2. row ***************************
PartitionId: 135846215
PartitionName: p20230227
VisibleVersion: 2
VisibleVersionTime: 2023-03-22 14:50:17
VisibleVersionHash: 0
State: NORMAL
PartitionKey: event_day
Range: [types: [DATETIME]; keys: [2023-02-27 00:00:00]; ..types: [DATETIME]; keys: [2023-02-28 00:00:00]; )
DistributionKey: event_day, site_id
Buckets: 6
ReplicationNum: 1
StorageMedium: HDD
CooldownTime: 9999-12-31 23:59:59
LastConsistencyCheckTime: NULL
DataSize: 0B
IsInMemory: false
RowCount: 0
2 rows in set (0.00 sec)
```
Example 2: Use the `date_trunc` function to create a table that supports automatic partitioning. Set the partition granularity to `month` and the partition column to `event_day`. Additionally, create some historical partitions before loading data and specify that the table only retains the most recent three partitions.
```SQL
CREATE TABLE site_access(
event_day DATETIME NOT NULL,
site_id INT DEFAULT '10',
city_code VARCHAR(100),
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(event_day, site_id, city_code, user_name)
PARTITION BY date_trunc('month', event_day)(
START ("2022-06-01") END ("2022-12-01") EVERY (INTERVAL 1 month)
)
DISTRIBUTED BY HASH(event_day, site_id) BUCKETS 32
PROPERTIES(
"partition_live_number" = "3",
"replication_num" = "1"
);
```
Example 3: Use the `time_slice` function to create a table that supports automatic partitioning. Set the partition granularity to 7 days and the partition column to `event_day`.
```SQL
CREATE TABLE site_access(
event_day DATETIME NOT NULL,
site_id INT DEFAULT '10',
city_code VARCHAR(100),
user_name VARCHAR(32) DEFAULT '',
pv BIGINT DEFAULT '0'
)
DUPLICATE KEY(event_day, site_id, city_code, user_name)
PARTITION BY time_slice(event_day, INTERVAL 7 day)
DISTRIBUTED BY HASH(event_day, site_id) BUCKETS 32
PROPERTIES("replication_num" = "1");
```
## Usage notes
- StarRocks automatically creates partitions and sets the start time and end time of the partitions based on the loaded data and the automatic partitioning rule configured at table creation. For example, if the value of the partition column for the data row is `2015-06-05`, and the partition granularity is `month`, then a partition named `p201506` is created with a range of [2015-06-01, 2015-07-01) rather than [2015-06-05, 2015-07-05).
- For tables that support automatic partitioning, StarRocks sets the default maximum number of automatically created partitions to 4096, which can be configured by the FE parameter `max_automatic_partition_number`. This parameter can prevent you from accidentally creating too many partitions, such as when specifying a partition column of DATETIME type with a too-fine partition granularity like `hour`, which can generate a large number of partitions.
- During data loading, StarRocks automatically creates some partitions based on the loaded data, but if the load job fails for some reason, the partitions that are automatically created by StarRocks cannot be automatically deleted.
- Note that the `PARTITION BY` clause is only used to calculate the partition range for the loaded data and does not change the values of the data. For example, if the original data is `2023-02-27 21:06:54`, the function expression in `PARTITION BY date_trunc('day', event_day)` computes it as 2023-02-27 00:00:00 and infers that it belongs to the partition range [2023-02-27 00:00:00, 2023-02-28 00:00:00), but the data is still written as `2023-02-27 21:06:54`. If you want the written data's value to be the same as the start time of the partition range, you need to use the function specified in the `PARTITION BY` clause, such as `date_trunc`, on the `event_day` column when creating a load job.
- The naming rules for automatic partitioning are consistent with the naming rules for dynamic partitioning.
## Limits
- Only the range partitioning type is supported, whereas the list partitioning type is not supported.
- When a table that supports automatic partitioning is created, it is generally not recommended to create partitions in advance. If you need to create partitions in advance, you can create multiple partitions all at a time, as shown in the preceding Example 2. The statement in Example 2 has the following limits:
- The granularity of the partitions created in advance must be consistent with that of the automatically created partitions.
- When you configure automatic partitioning, you can only use the function `date_trunc` rather than `time_slice`.
- The syntax for creating multiple partitions all at a time only supports an interval of `1`.
- After a table that supports automatic partitioning is created, you can use `ALTER TABLE ADD PARTITION` to add partitions for that table. And the statement `ALTER TABLE ADD PARTITION` also has the above limits.
- Currently, StarRocks's shared-data mode does not support this feature.
- Currently, using CTAS to create a table that supports automatic partitioning is not supported.
- Currently, using Spark Load to load data to tables that support automatic partitioning is not supported.
- If you use automatic partitioning, you can only roll back your StarRocks cluster to version 2.5.4 or later.
- To view the specific information of automatically created partitions, use the SHOW PARTITIONS FROM statement, rather than the SHOW CREATE TABLE statement.