-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-28794][SQL][DOC] Documentation for Create table Command #26759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28794][SQL][DOC] Documentation for Create table Command #26759
Conversation
|
||
<dl> | ||
<dt><code><em>USING datasource</em></code></dt> | ||
<dd>Datasource using which the table is created.Data source can be CSV, TXT, ORC, JDBC,PARQUET, etc.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs some proofreading, here and below. Space needs to follow punctuation.
"Data source" needs to be consistent and refer to the argument above.
"using which the table is created" -> used to create the table.
But can you say any more about what this means? This isn't really adding much documentation.
This still has a lot of basic syntax, grammar and formatting problems. Please proofread per above. |
18be12a
to
70cae22
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the docs need to have a little more, well, documentation here. It's just repeating what the syntax implies already. It doesn't need to be super in depth, but, it's worth asking: what is a reader getting out of this page that they won't already know?
### Parameters | ||
|
||
<dl> | ||
<dt><code><em>USING DATASOURCE</em></code></dt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it sound like DATASOURCE
is a keyword. Don't you want to write something like USING data_source
above? and match it below?
|
||
<dl> | ||
<dt><code><em>USING DATASOURCE</em></code></dt> | ||
<dd>Data Source is the file format used to create the table. Data Source can be CSV, TXT, ORC, JDBC, PARQUET, etc. which is an implementation of DataSourceRegister in spark.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's list all the possible valid values at the moment, or link to them somehow. I don't think the implementation detail in the last clause is important.
|
||
<dl> | ||
<dt><code><em>CLUSTERED BY</em></code></dt> | ||
<dd>Partitions are created on the table will be bucketed into fixed buckets based on the column specified for bucketing.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove 'are'. Can we provide any links to what bucketing means?
|
||
<dl> | ||
<dt><code><em>TBLPROPERTIES</em></code></dt> | ||
<dd>Table properties that has to be set are specified.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awkwardly worded. "Sets key-value properties on the table, such as ..."
|
||
<dl> | ||
<dt><code><em>LOCATION</em></code></dt> | ||
<dd>Specified Location is used to store table data.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we say a little more -- it's a path to a directory, right?
--- | ||
### Description | ||
|
||
The `CREATE TABLE` statement creates a new table using Hive format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Hive format mean here? (for the reader)
### Description | ||
`CREATE TABLE` statement is used to create a table in an exsisting database. | ||
|
||
The INSERT statements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not INSERT but CREATE?
@@ -19,4 +19,9 @@ license: | | |||
limitations under the License. | |||
--- | |||
|
|||
**This page is under construction** | |||
### Description | |||
`CREATE TABLE` statement is used to create a table in an exsisting database. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about create
-> define
to avoid doubly saying create
...
USING DATASOURCE | ||
[OPTIONS (key1=val1, key2=val2, ...)] | ||
[PARTITIONED BY (col_name1, col_name2, ...)] | ||
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adds SORTED BY
:
spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Line 289 in 18e8d1d
(SORTED BY orderedIdentifierList)? |
|
||
The INSERT statements: | ||
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html) | ||
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add CREATE TABLE LIKE
, too?
spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Line 129 in 18e8d1d
| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier |
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name | ||
[(col_name1[:] col_type1 [COMMENT col_comment1], ...)] | ||
[COMMENT table_comment] | ||
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the other PARTITIONED BY
case:
spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Line 121 in 18e8d1d
PARTITIONED BY partitionColumnNames=identifierList) | |
[(col_name1[:] col_type1 [COMMENT col_comment1], ...)] | ||
[COMMENT table_comment] | ||
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)] | ||
[ROW FORMAT row_format] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SKEWED BY:
spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Line 123 in 18e8d1d
skewSpec | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu I think it is not supported.I checked SparkSqlParser class and found this.
kindly ping again, @PavithraRamachandran |
I shall update the PR , with the necessary corrections as per the review comments. |
### Examples | ||
{% highlight sql %} | ||
|
||
CREATE TABLE Student (Id INT,name STRING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not to create a Hive serde table since Spark 3.0. See #26736
70cae22
to
62502a2
Compare
ok to test |
Test build #116731 has finished for PR 26759 at commit
|
@PavithraRamachandran there are still unresolved comments from a month ago. Please address all of them. |
@PavithraRamachandran If you don't have enogh time to keep this, I can take this over. |
i shall complete today |
62502a2
to
50996f2
Compare
Test build #116818 has finished for PR 26759 at commit
|
{% highlight sql %} | ||
|
||
--Using data source | ||
CREATE TABLE Student (width INT, length INT, height INT) USING CSV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps change the column names to id, name, age to be more meaningful ? Also can you please put semi colon at the end in the examples just to be consistent with other docs ?
cc @huaxingao can you please check on the consistency part if you have some time ?
|
||
<dl> | ||
<dt><code><em>USING data_source</em></code></dt> | ||
<dd>Data Source is the file format used to create the table. Data source can be CSV, TXT, ORC, JDBC, PARQUET, etc.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we say "input format" instead of "file format". For example, JDBC is data source is not a file format, right ?
</dl> | ||
|
||
<dl> | ||
<dt><code><em>STORED</em></code></dt> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
STORED AS ?
Test build #116836 has finished for PR 26759 at commit
|
f835058
to
b7dab5d
Compare
Test build #116848 has finished for PR 26759 at commit
|
Test build #116860 has finished for PR 26759 at commit
|
The CREATE statements: | ||
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html) | ||
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html) | ||
* [CREATE TABLE LIKE](sql-ref-syntax-ddl-create-table-hiveformat.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql-ref-syntax-ddl-create-table-like.html
|
||
### Syntax | ||
{% highlight sql %} | ||
CREATE TABLE [IF NOT EXISTS] [db_name.]new_table_name LIKE [db_name.]source_table_name [LOCATION path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More options here:
spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
Line 131 in 3848999
| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier |
|
||
### Syntax | ||
{% highlight sql %} | ||
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use table_identifier
instead of [db_name.]table_name
? Put the syntax of table_identifier
in Parameters section. You can refer to any of the docs that has table_identifier
.
[LOCATION path] | ||
[COMMENT table_comment] | ||
[TBLPROPERTIES (key1=val1, key2=val2, ...)] | ||
[AS select_statement] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to make all the docs follow the same convention: put a space in between the symbols (e.g. '|', '=', '[]') and text. Refer to sql-ref-syntax-ddl-create-database as an example.
USING CSV | ||
PARTITIONED BY (age) | ||
CLUSTERED BY (Id) INTO 4 buckets | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add ;
in the end of all the sql statements in example sections?
PARTITIONED BY (age) | ||
CLUSTERED BY (Id) INTO 4 buckets | ||
|
||
{% endhighlight %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a Related Statements
section to link the related statements?
c545e9a
to
bc2aef8
Compare
Test build #116953 has finished for PR 26759 at commit
|
Test build #116955 has finished for PR 26759 at commit
|
|
||
### Related Statements | ||
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html) | ||
* [CREATE TABLE LIKE](ssql-ref-syntax-ddl-create-table-like.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is broken. You have an extra s in ssql
|
||
### Related Statements | ||
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html) | ||
* [CREATE TABLE LIKE](ssql-ref-syntax-ddl-create-table-like.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broken link
The CREATE statements: | ||
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html) | ||
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html) | ||
* [CREATE TABLE LIKE](ssql-ref-syntax-ddl-create-table-like.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
broken link
USING data_source | ||
[ OPTIONS ( key1=val1, key2=val2, ... ) ] | ||
[ PARTITIONED BY ( col_name1, col_name2, ... ) ] | ||
[ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [ ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier | ||
[ ( col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ] | ||
[ COMMENT table_comment ] | ||
[ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break the line
Test build #116961 has finished for PR 26759 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth another proofreading too
|
||
<dl> | ||
<dt><code><em>ROW FORMAT</em></code></dt> | ||
<dd>SERDE is used to specify a custom SerDe or the DELIMITED clause inorder to use the native SerDe.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inorder -> in order
|
||
<dl> | ||
<dt><code><em>LOCATION</em></code></dt> | ||
<dd>Path to the directory where table data is stored, could be filesystem, HDFS, etc.</dd> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below, better as "... data is stored, which could be a path on distributed storage like HDFS, etc."
<dl> | ||
<dt><code><em>TBLPROPERTIES</em></code></dt> | ||
<dd> | ||
Table properties that has to be set are specified,such as `created.by.user`, `owner`, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that have to be set
space after comma
Test build #117244 has finished for PR 26759 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close enough, I think.
Merged to master |
What changes were proposed in this pull request?
Document CREATE TABLE statement in SQL Reference Guide.
Why are the changes needed?
Adding documentation for SQL reference.
Does this PR introduce any user-facing change?
yes
Before:
There was no documentation for this.
How was this patch tested?
Used jekyll build and serve to verify.