[Doc] Modify files by feedbacks (#28284)

Signed-off-by: EsoragotoSpirit <wanglichen@starrocks.com>
This commit is contained in:
絵空事スピリット 2023-07-31 15:29:35 +08:00 committed by GitHub
parent 18abed2130
commit 8fd8e2a064
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 79 additions and 59 deletions

View File

@ -2,7 +2,7 @@
This topic describes how to load data into StarRocks by using a SQL statement - INSERT.
Similar to MySQL and many other database management systems, StarRocks supports loading data to an internal table with INSERT. You can insert one or more rows directly with the VALUES clause to test a function or a DEMO. You can also insert data defined by the results of a query into an internal table from an [external table](../data_source/External_table.md). From StarRocks v3.1 onwards, you can directly load data from files in an external source using the INSERT command and the table function [FILES()](../sql-reference/sql-functions/table-functions/files.md).
Similar to MySQL and many other database management systems, StarRocks supports loading data to an internal table with INSERT. You can insert one or more rows directly with the VALUES clause to test a function or a DEMO. You can also insert data defined by the results of a query into an internal table from an [external table](../data_source/External_table.md). From StarRocks v3.1 onwards, you can directly load data from files on cloud storage using the INSERT command and the table function [FILES()](../sql-reference/sql-functions/table-functions/files.md).
StarRocks v2.4 further supports overwriting data into a table by using INSERT OVERWRITE. The INSERT OVERWRITE statement integrates the following operations to implement the overwriting function:
@ -35,20 +35,19 @@ CREATE DATABASE IF NOT EXISTS load_test;
USE load_test;
CREATE TABLE insert_wiki_edit
(
event_time DATETIME,
channel VARCHAR(32) DEFAULT '',
user VARCHAR(128) DEFAULT '',
is_anonymous TINYINT DEFAULT '0',
is_minor TINYINT DEFAULT '0',
is_new TINYINT DEFAULT '0',
is_robot TINYINT DEFAULT '0',
is_unpatrolled TINYINT DEFAULT '0',
delta INT DEFAULT '0',
added INT DEFAULT '0',
deleted INT DEFAULT '0'
event_time DATETIME,
channel VARCHAR(32) DEFAULT '',
user VARCHAR(128) DEFAULT '',
is_anonymous TINYINT DEFAULT '0',
is_minor TINYINT DEFAULT '0',
is_new TINYINT DEFAULT '0',
is_robot TINYINT DEFAULT '0',
is_unpatrolled TINYINT DEFAULT '0',
delta INT DEFAULT '0',
added INT DEFAULT '0',
deleted INT DEFAULT '0'
)
DUPLICATE KEY
(
DUPLICATE KEY(
event_time,
channel,
user,
@ -58,8 +57,7 @@ DUPLICATE KEY
is_robot,
is_unpatrolled
)
PARTITION BY RANGE(event_time)
(
PARTITION BY RANGE(event_time)(
PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'),
PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'),
PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'),
@ -69,31 +67,28 @@ DISTRIBUTED BY HASH(user);
CREATE TABLE source_wiki_edit
(
event_time DATETIME,
channel VARCHAR(32) DEFAULT '',
user VARCHAR(128) DEFAULT '',
is_anonymous TINYINT DEFAULT '0',
is_minor TINYINT DEFAULT '0',
is_new TINYINT DEFAULT '0',
is_robot TINYINT DEFAULT '0',
is_unpatrolled TINYINT DEFAULT '0',
delta INT DEFAULT '0',
added INT DEFAULT '0',
deleted INT DEFAULT '0'
event_time DATETIME,
channel VARCHAR(32) DEFAULT '',
user VARCHAR(128) DEFAULT '',
is_anonymous TINYINT DEFAULT '0',
is_minor TINYINT DEFAULT '0',
is_new TINYINT DEFAULT '0',
is_robot TINYINT DEFAULT '0',
is_unpatrolled TINYINT DEFAULT '0',
delta INT DEFAULT '0',
added INT DEFAULT '0',
deleted INT DEFAULT '0'
)
DUPLICATE KEY
(
DUPLICATE KEY(
event_time,
channel,
user,
channel,user,
is_anonymous,
is_minor,
is_new,
is_robot,
is_unpatrolled
)
PARTITION BY RANGE(event_time)
(
PARTITION BY RANGE(event_time)(
PARTITION p06 VALUES LESS THAN ('2015-09-12 06:00:00'),
PARTITION p12 VALUES LESS THAN ('2015-09-12 12:00:00'),
PARTITION p18 VALUES LESS THAN ('2015-09-12 18:00:00'),
@ -126,7 +121,7 @@ VALUES
## Insert data via INSERT INTO SELECT
You can load the result of a query on a data source table into the target table via INSERT INTO SELECT command. INSERT INTO SELECT command performs ETL operations on the data from the data source table, and loads the data into an internal table in StarRocks. The data source can be one or more internal or external tables, or even data files from an external storage system. The target table MUST be an internal table in StarRocks. For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/data-manipulation/insert.md).
You can load the result of a query on a data source table into the target table via INSERT INTO SELECT command. INSERT INTO SELECT command performs ETL operations on the data from the data source table, and loads the data into an internal table in StarRocks. The data source can be one or more internal or external tables, or even data files on cloud storage. The target table MUST be an internal table in StarRocks. For detailed instructions and parameter references, see [SQL Reference - INSERT](../sql-reference/sql-statements/data-manipulation/insert.md).
### Insert data from an internal or external table into an internal table
@ -187,18 +182,7 @@ SELECT event_time, channel FROM source_wiki_edit;
### Insert data directly from files in an external source using FILES()
From v3.1 onwards, StarRocks supports directly loading data from files in an external source using the INSERT command and the [FILES()](../sql-reference/sql-functions/table-functions/files.md) function, saving you from the trouble of creating an external table first.
Currently, the FILES() function supports the following data sources and file formats:
- **Data sources:**
- AWS S3
- **File formats:**
- Parquet
- ORC
From v3.1 onwards, StarRocks supports directly loading data from files on cloud storage using the INSERT command and the [FILES()](../sql-reference/sql-functions/table-functions/files.md) function, thereby you do not need to create an external catalog or file external table first. Besides, FILES() can automatically infer the table schema of the files, greatly simplifying the process of data loading.
The following example inserts data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table `insert_wiki_edit`:
@ -209,7 +193,7 @@ INSERT INTO insert_wiki_edit
"format" = "parquet",
"aws.s3.access_key" = "XXXXXXXXXX",
"aws.s3.secret_key" = "YYYYYYYYYY",
"aws.s3.region" = "ap-southeast-1"
"aws.s3.region" = "us-west-2"
);
```
@ -336,7 +320,7 @@ SELECT event_time, channel FROM source_wiki_edit;
## Insert data into a table with generated columns
A generated column is a special column whose data is derived from a pre-defined expression or evaluation based on other columns. Generated columns are especially useful when your query requests involve evaluations of expensive expressions, for example, querying a certain field from a JSON value, or querying the aggregation results of the ARRAY type. StarRocks evaluates the expression and stores the results in the generated columns while data is being loaded into the table, thereby avoiding the expression evaluation during queries and improving the query performance.
A generated column is a special column whose value is derived from a pre-defined expression or evaluation based on other columns. Generated columns are especially useful when your query requests involve evaluations of expensive expressions, for example, querying a certain field from a JSON value, or calculating ARRAY data. StarRocks evaluates the expression and stores the results in the generated columns while data is being loaded into the table, thereby avoiding the expression evaluation during queries and improving the query performance.
You can load data into a table with generated columns using INSERT.

View File

@ -2,7 +2,7 @@
## Description
Reads a data file from an external data source for direct query or loading data using [INSERT](../../sql-statements/data-manipulation/insert.md). This feature is supported from v3.1.0 onwards.
Reads a data file from cloud storage. FILE() accesses the cloud storage with the path-related properties of the file, infers the table schema of the data in the file, and returns the data rows. You can directly query the data rows using [SELECT](../../sql-statements/data-manipulation/SELECT.md), load the data rows into an existing table using [INSERT](../../sql-statements/data-manipulation/insert.md), or create a new table and load the data rows into it using [CREATE TABLE AS SELECT](../../sql-statements/data-definition/CREATE%20TABLE%20AS%20SELECT.md). This feature is supported from v3.1.0 onwards.
Currently, the FILES() function supports the following data sources and file formats:
@ -18,20 +18,41 @@ Currently, the FILES() function supports the following data sources and file for
## Syntax
```SQL
FILES("key" = "value"...)
FILES( data_location , data_format [, StorageCredentialParams ] )
data_location ::=
"path" = "s3://<s3_path>"
data_format ::=
"format" = "{parquet | orc}"
```
## Parameters
All parameters are in the `"key" = "value"` pairs.
| **Key** | **Description** |
| ----------------- | ------------------------------------------------------------ |
| path | The URI used to access the file. Example: `s3://testbucket/parquet/test.parquet`. |
| format | The format of the data file. Valid values: `parquet` and `orc`. |
| aws.s3.access_key | The Access Key ID that you can use to access the Amazon S3 bucket. |
| aws.s3.secret_key | The Secret Access Key that you can use to access the Amazon S3 bucket. |
| aws.s3.region | The region in which your AWS S3 bucket resides. Example: `us-west-1`. |
| **Key** | **Required** | **Description** |
| ------- | ------------ | ------------------------------------------------------------ |
| path | Yes | The URI used to access the file. Example: `s3://testbucket/parquet/test.parquet`. |
| format | Yes | The format of the data file. Valid values: `parquet` and `orc`. |
### StorageCredentialParams
The authentication information used by StarRocks to access your storage system.
- Use the IAM user-based authentication to access AWS S3:
```SQL
"aws.s3.access_key" = "xxxxxxxxxx",
"aws.s3.secret_key" = "yyyyyyyyyy",
"aws.s3.region" = "<s3_region>"
```
| **Key** | **Required** | **Description** |
| ----------------- | ------------ | ------------------------------------------------------------ |
| aws.s3.access_key | Yes | The Access Key ID that you can use to access the Amazon S3 bucket. |
| aws.s3.secret_key | Yes | The Secret Access Key that you can use to access the Amazon S3 bucket. |
| aws.s3.region | Yes | The region in which your AWS S3 bucket resides. Example: `us-west-2`. |
## Examples
@ -43,7 +64,7 @@ MySQL > SELECT * FROM FILES(
"format" = "parquet",
"aws.s3.access_key" = "XXXXXXXXXX",
"aws.s3.secret_key" = "YYYYYYYYYY",
"aws.s3.region" = "ap-southeast-1"
"aws.s3.region" = "us-west-2"
);
+------+---------------------------------------------------------+
| c1 | c2 |
@ -63,8 +84,23 @@ MySQL > INSERT INTO insert_wiki_edit
"format" = "parquet",
"aws.s3.access_key" = "XXXXXXXXXX",
"aws.s3.secret_key" = "YYYYYYYYYY",
"aws.s3.region" = "ap-southeast-1"
"aws.s3.region" = "us-west-2"
);
Query OK, 2 rows affected (23.03 sec)
{'label':'insert_d8d4b2ee-ac5c-11ed-a2cf-4e1110a8f63b', 'status':'VISIBLE', 'txnId':'2440'}
```
Example 3: Create a table named `ctas_wiki_edit` and insert the data rows from the Parquet file **parquet/insert_wiki_edit_append.parquet** within the AWS S3 bucket `inserttest` into the table:
```Plain
MySQL > CREATE TABLE ctas_wiki_edit AS
SELECT * FROM FILES(
"path" = "s3://inserttest/parquet/insert_wiki_edit_append.parquet",
"format" = "parquet",
"aws.s3.access_key" = "XXXXXXXXXX",
"aws.s3.secret_key" = "YYYYYYYYYY",
"aws.s3.region" = "us-west-2"
);
Query OK, 2 rows affected (22.09 sec)
{'label':'insert_1a217d70-2f52-11ee-9e4a-7a563fb695da', 'status':'VISIBLE', 'txnId':'3248'}
```