displayed_sidebar | unlisted |
---|---|
English |
true |
Technical documentation typically has links to other documents all over the place. When you look at this document you may notice that there are few links out from the page, and that almost all of the links are at the bottom of the doc in the more information section. Not every keyword needs to be linked out to another page, please assume that the reader knows what CREATE TABLE
means and that is they do not know they can click in the search bar and find out. It is fine to pop a note in the docs to tell the reader that there are other options and the details are described in the more information section; this allows the people who need the information know that they can read it later, after they accomplish the task at hand.
This template is based on the process to load data from Amazon S3, some parts of it will not be applicable to loading from other sources. Please concentrate on the flow of this template and do not worry about including every section; the flow is meant to be:
Introductory text that lets the reader know what the end result will be if they follow this guide. In the case of the S3 doc, the end result is "Getting data loaded from S3 in either an asynchronous manner, or a synchronous manner."
- A description of the business problem solved with the technique
- Advantages and the disadvantages (if any) of the method(s) described
Diagrams or images can be helpful. If you are describing a technique that is complex and an image helps, then use one. If you are describing a technique that produces something visual (for example, the use of Superset to analyze data), then definitely include an image of the end product.
Use a data flow diagram if the flow is non-obvious. When a command causes StarRocks to run several processes and combine the output of those processes and then manipulate the data it is probably time for a description of the data flow. In this template there are two methods for loading data described. One of them is simple, and has no data flow section; the other is more complicated (StarRocks is handling the complex work, not the user!), and the complex option includes a data flow section.
Note that examples should come before syntax details and other deep technical details. Many readers will be coming to the docs to find a particular technique that they can copy, paste, and modify.
If possible give an example that will work and includes a dataset to use. The example in this template uses a dataset stored in S3 that anyone who has an AWS account and can authenticate with a key and secret can use. By providing a dataset the examples are more valuable to the reader because they can fully experience the described technique.
Make sure that the example works as written. This implies two things:
- you have run the commands in the order presented
- you have included the necessary prerequisites. For example, if your example refers to database
foo
, then probably you need to preface it withCREATE DATABASE foo;
,USE foo;
.
Verification is so important. If the process that you are describing includes several steps, then include a verification step whenever something should have been accomplished; this helps avoid having the reader get to the end and realizing that they had a typo in step 10. In this example Check progress and DESCRIBE user_behavior_inferred;
steps are for verification.
At the end of the template there is a spot to put links to related information including the ones to optional information that you mentioned in the main body.
The template notes are intentionally formatted differently than the way we format documentation notes to bring them to your attention when you are working through the template. Please remove the bold italic notes as you go along:
***Note: descriptive text***
Note: If there are multiple recommended choices, tell the reader this in the intro. For example, when loading from S3, there is an option for synchronous loading, and asynchronous loading:
StarRocks provides two options for loading data from S3:
- Asynchronous loading using Broker Load
- Synchronous loading using the
FILES()
table function
Note: Tell the reader WHY they would choose one choice over the other:
Small datasets are often loaded synchronously using the FILES()
table function, and large datasets are often loaded asynchronously using Broker Load. The two methods have different advantages and are described below.
NOTE
You can load data into StarRocks tables only as a user who has the INSERT privilege on those StarRocks tables. If you do not have the INSERT privilege, follow the instructions provided in GRANT to grant the INSERT privilege to the user that you use to connect to your StarRocks cluster.
An asynchronous Broker Load process handles making the connection to S3, pulling the data, and storing the data in StarRocks.
- Broker Load supports data transformation, UPSERT, and DELETE operations during loading.
- Broker Load runs in the background and clients don't need to stay connected for the job to continue.
- Broker Load is preferred for long running jobs, the default timeout is 4 hours.
- In addition to Parquet and ORC file format, Broker Load supports CSV files.
Note: Processes that involve multiple components or steps may be easier to understand with a diagram. This example includes a diagram that helps describe the steps that happen when a user chooses the Broker Load option.
- The user creates a load job.
- The frontend (FE) creates a query plan and distributes the plan to the backend nodes (BE).
- The backend (BE) nodes pull the data from the source and load the data into StarRocks.
Create a table, start a load process that pulls a Parquet file from S3, and verify the progress and success of the data loading.
NOTE
The examples use a sample dataset in Parquet format, if you want to load a CSV or ORC file, that information is linked at the bottom of this page.
Create a database for your table:
CREATE DATABASE IF NOT EXISTS project;
USE project;
Create a table. This schema matches a sample dataset in an S3 bucket hosted in a StarRocks account.
DROP TABLE IF EXISTS user_behavior;
CREATE TABLE `user_behavior` (
`UserID` int(11),
`ItemID` int(11),
`CategoryID` int(11),
`BehaviorType` varchar(65533),
`Timestamp` datetime
) ENGINE=OLAP
DUPLICATE KEY(`UserID`)
DISTRIBUTED BY HASH(`UserID`)
PROPERTIES (
"replication_num" = "1"
);
NOTE
The examples use IAM user-based authentication. Other authentication methods are available and linked at the bottom of this page.
Loading data from S3 requires having the:
- S3 bucket
- S3 object keys (object names) if accessing a specific object in the bucket. Note that the object key can include a prefix if your S3 objects are stored in sub-folders. The full syntax is linked in more information.
- S3 region
- Access key and secret
This job has four main sections:
LABEL
: A string used when querying the state of aLOAD
job.LOAD
declaration: The source URI, destination table, and the source data format.BROKER
: The connection details for the source.PROPERTIES
: Timeout value and any other properties to apply to this job.
NOTE
The dataset used in these examples is hosted in an S3 bucket in a StarRocks account. Any valid
aws.s3.access_key
andaws.s3.secret_key
can be used, as the object is readable by any AWS authenticated user. Substitute your credentials forAAA
andBBB
in the commands below.
LOAD LABEL user_behavior
(
DATA INFILE("s3://starrocks-examples/user_behavior_sample_data.parquet")
INTO TABLE user_behavior
FORMAT AS "parquet"
)
WITH BROKER
(
"aws.s3.enable_ssl" = "true",
"aws.s3.use_instance_profile" = "false",
"aws.s3.region" = "us-east-1",
"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
)
PROPERTIES
(
"timeout" = "72000"
);
Query the information_schema.loads
table to track progress. If you have multiple LOAD
jobs running you can filter on the LABEL
associated with the job. In the output below there are two entries for the load job user_behavior
. The first record shows a state of CANCELLED
; scroll to the end of the output, and you see that listPath failed
. The second record shows success with a valid AWS IAM access key and secret.
SELECT * FROM information_schema.loads;
SELECT * FROM information_schema.loads WHERE LABEL = 'user_behavior';
JOB_ID|LABEL |DATABASE_NAME|STATE |PROGRESS |TYPE |PRIORITY|SCAN_ROWS|FILTERED_ROWS|UNSELECTED_ROWS|SINK_ROWS|ETL_INFO|TASK_INFO |CREATE_TIME |ETL_START_TIME |ETL_FINISH_TIME |LOAD_START_TIME |LOAD_FINISH_TIME |JOB_DETAILS |ERROR_MSG |TRACKING_URL|TRACKING_SQL|REJECTED_RECORD_PATH|
------+-------------------------------------------+-------------+---------+-------------------+------+--------+---------+-------------+---------------+---------+--------+----------------------------------------------------+-------------------+-------------------+-------------------+-------------------+-------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------+------------+------------+--------------------+
10121|user_behavior |project |CANCELLED|ETL:N/A; LOAD:N/A |BROKER|NORMAL | 0| 0| 0| 0| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:59:30| | | |2023-08-10 14:59:34|{"All backends":{},"FileNumber":0,"FileSize":0,"InternalTableLoadBytes":0,"InternalTableLoadRows":0,"ScanBytes":0,"ScanRows":0,"TaskNumber":0,"Unfinished backends":{}} |type:ETL_RUN_FAIL; msg:listPath failed| | | |
10106|user_behavior |project |FINISHED |ETL:100%; LOAD:100%|BROKER|NORMAL | 86953525| 0| 0| 86953525| |resource:N/A; timeout(s):72000; max_filter_ratio:0.0|2023-08-10 14:50:15|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:50:19|2023-08-10 14:55:10|{"All backends":{"a5fe5e1d-d7d0-4826-ba99-c7348f9a5f2f":[10004]},"FileNumber":1,"FileSize":1225637388,"InternalTableLoadBytes":2710603082,"InternalTableLoadRows":86953525,"ScanBytes":1225637388,"ScanRows":86953525,"TaskNumber":1,"Unfinished backends":{"a5| | | | |
You can also check a subset of the data at this point.
SELECT * from user_behavior LIMIT 10;
UserID|ItemID|CategoryID|BehaviorType|Timestamp |
------+------+----------+------------+-------------------+
171146| 68873| 3002561|pv |2017-11-30 07:11:14|
171146|146539| 4672807|pv |2017-11-27 09:51:41|
171146|146539| 4672807|pv |2017-11-27 14:08:33|
171146|214198| 1320293|pv |2017-11-25 22:38:27|
171146|260659| 4756105|pv |2017-11-30 05:11:25|
171146|267617| 4565874|pv |2017-11-27 14:01:25|
171146|329115| 2858794|pv |2017-12-01 02:10:51|
171146|458604| 1349561|pv |2017-11-25 22:49:39|
171146|458604| 1349561|pv |2017-11-27 14:03:44|
171146|478802| 541347|pv |2017-12-02 04:52:39|
FILES()
can infer the data types of the columns of the Parquet data and generate the schema for a StarRocks table. This provides the ability to query the file directly from S3 with a SELECT
or to have StarRocks automatically create a table for you based on the Parquet file schema.
NOTE
Schema inference is a new feature in version 3.1 and is provided for Parquet format only and nested types are not yet supported.
There are three examples using the FILES()
table function:
- Querying the data directly from S3
- Creating and loading the table using schema inference
- Creating a table by hand and then loading the data
NOTE
The dataset used in these examples is hosted in an S3 bucket in a StarRocks account. Any valid
aws.s3.access_key
andaws.s3.secret_key
can be used, as the object is readable by any AWS authenticated user. Substitute your credentials forAAA
andBBB
in the commands below.
Querying directly from S3 using FILES()
can gives a good preview of the content of a dataset before you create a table. For example:
- Get a preview of the dataset without storing the data.
- Query for the min and max values and decide what data types to use.
- Check for nulls.
SELECT * FROM FILES(
"path" = "s3://starrocks-examples/user_behavior_sample_data.parquet",
"format" = "parquet",
"aws.s3.region" = "us-east-1",
"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
) LIMIT 10;
NOTE
Notice that the column names are provided by the Parquet file.
UserID|ItemID |CategoryID|BehaviorType|Timestamp |
------+-------+----------+------------+-------------------+
1|2576651| 149192|pv |2017-11-25 01:21:25|
1|3830808| 4181361|pv |2017-11-25 07:04:53|
1|4365585| 2520377|pv |2017-11-25 07:49:06|
1|4606018| 2735466|pv |2017-11-25 13:28:01|
1| 230380| 411153|pv |2017-11-25 21:22:22|
1|3827899| 2920476|pv |2017-11-26 16:24:33|
1|3745169| 2891509|pv |2017-11-26 19:44:31|
1|1531036| 2920476|pv |2017-11-26 22:02:12|
1|2266567| 4145813|pv |2017-11-27 00:11:11|
1|2951368| 1080785|pv |2017-11-27 02:47:08|
This is a continuation of the previous example; the previous query is wrapped in CREATE TABLE
to automate the table creation using schema inference. The column names and types are not required to create a table when using the FILES()
table function with Parquet files as the Parquet format includes the column names and types and StarRocks will infer the schema.
NOTE
The syntax of
CREATE TABLE
when using schema inference does not allow setting the number of replicas, so set it before creating the table. The example below is for a system with a single replica:
ADMIN SET FRONTEND CONFIG ('default_replication_num' ="1");
CREATE DATABASE IF NOT EXISTS project;
USE project;
CREATE TABLE `user_behavior_inferred` AS
SELECT * FROM FILES(
"path" = "s3://starrocks-examples/user_behavior_sample_data.parquet",
"format" = "parquet",
"aws.s3.region" = "us-east-1",
"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
);
DESCRIBE user_behavior_inferred;
Field |Type |Null|Key |Default|Extra|
------------+----------------+----+-----+-------+-----+
UserID |bigint |YES |true | | |
ItemID |bigint |YES |true | | |
CategoryID |bigint |YES |true | | |
BehaviorType|varchar(1048576)|YES |false| | |
Timestamp |varchar(1048576)|YES |false| | |
NOTE
Compare the inferred schema with the schema created by hand:
- data types
- nullable
- key fields
SELECT * from user_behavior_inferred LIMIT 10;
UserID|ItemID|CategoryID|BehaviorType|Timestamp |
------+------+----------+------------+-------------------+
171146| 68873| 3002561|pv |2017-11-30 07:11:14|
171146|146539| 4672807|pv |2017-11-27 09:51:41|
171146|146539| 4672807|pv |2017-11-27 14:08:33|
171146|214198| 1320293|pv |2017-11-25 22:38:27|
171146|260659| 4756105|pv |2017-11-30 05:11:25|
171146|267617| 4565874|pv |2017-11-27 14:01:25|
171146|329115| 2858794|pv |2017-12-01 02:10:51|
171146|458604| 1349561|pv |2017-11-25 22:49:39|
171146|458604| 1349561|pv |2017-11-27 14:03:44|
171146|478802| 541347|pv |2017-12-02 04:52:39|
You may want to customize the table that you are inserting into, for example the:
- column data type, nullable setting, or default values
- key types and columns
- distribution
- etc.
NOTE
Creating the most efficient table structure requires knowledge of how the data will be used and the content of the columns. This document does not cover table design, there is a link in more information at the end of the page.
In this example we are creating a table based on knowledge of how the table will be queried and the data in the Parquet file. The knowledge of the data in the Parquet file can be gained by querying the file directly in S3.
- Since a query of the file in S3 indicates that the
Timestamp
column contains data that matches adatetime
data type the column type is specified in the following DDL. - By querying the data in S3 you can find that there are no null values in the dataset, so the DDL does not set any columns as nullable.
- Based on knowledge of the expected query types, the sort key and bucketing column are set to the column
UserID
(your use case might be different for this data, you might decide to useItemID
in addition to or instead ofUserID
for the sort key:
CREATE TABLE `user_behavior_declared` (
`UserID` int(11),
`ItemID` int(11),
`CategoryID` int(11),
`BehaviorType` varchar(65533),
`Timestamp` datetime
) ENGINE=OLAP
DUPLICATE KEY(`UserID`)
DISTRIBUTED BY HASH(`UserID`)
PROPERTIES (
"replication_num" = "1"
);
After creating the table, you can load it with INSERT INTO
… SELECT FROM FILES()
:
INSERT INTO user_behavior_declared
SELECT * FROM FILES(
"path" = "s3://starrocks-examples/user_behavior_sample_data.parquet",
"format" = "parquet",
"aws.s3.region" = "us-east-1",
"aws.s3.access_key" = "AAAAAAAAAAAAAAAAAAAA",
"aws.s3.secret_key" = "BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB"
);
- For more details on synchronous and asynchronous data loading, see Loading concepts.
- Learn about how Broker Load supports data transformation during loading at Transform data at loading and Change data through loading.
- This document only covered IAM user-based authentication. For other options please see authenticate to AWS resources.
- The AWS CLI Command Reference covers the S3 URI in detail.
- Learn more about table design.
- Broker Load provides many more configuration and use options than those in the above examples, the details are in Broker Load