Skip to content

[SPARK-28794][SQL][DOC] Documentation for Create table Command #26759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

PavithraRamachandran
Copy link
Contributor

@PavithraRamachandran PavithraRamachandran commented Dec 4, 2019

What changes were proposed in this pull request?

Document CREATE TABLE statement in SQL Reference Guide.

Why are the changes needed?

Adding documentation for SQL reference.

Does this PR introduce any user-facing change?

yes

Before:
There was no documentation for this.

How was this patch tested?

Used jekyll build and serve to verify.


<dl>
<dt><code><em>USING datasource</em></code></dt>
<dd>Datasource using which the table is created.Data source can be CSV, TXT, ORC, JDBC,PARQUET, etc.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some proofreading, here and below. Space needs to follow punctuation.
"Data source" needs to be consistent and refer to the argument above.
"using which the table is created" -> used to create the table.
But can you say any more about what this means? This isn't really adding much documentation.

@srowen
Copy link
Member

srowen commented Dec 9, 2019

This still has a lot of basic syntax, grammar and formatting problems. Please proofread per above.

@PavithraRamachandran PavithraRamachandran force-pushed the create_doc branch 2 times, most recently from 18be12a to 70cae22 Compare December 18, 2019 10:20
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs need to have a little more, well, documentation here. It's just repeating what the syntax implies already. It doesn't need to be super in depth, but, it's worth asking: what is a reader getting out of this page that they won't already know?

### Parameters

<dl>
<dt><code><em>USING DATASOURCE</em></code></dt>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like DATASOURCE is a keyword. Don't you want to write something like USING data_source above? and match it below?


<dl>
<dt><code><em>USING DATASOURCE</em></code></dt>
<dd>Data Source is the file format used to create the table. Data Source can be CSV, TXT, ORC, JDBC, PARQUET, etc. which is an implementation of DataSourceRegister in spark.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's list all the possible valid values at the moment, or link to them somehow. I don't think the implementation detail in the last clause is important.


<dl>
<dt><code><em>CLUSTERED BY</em></code></dt>
<dd>Partitions are created on the table will be bucketed into fixed buckets based on the column specified for bucketing.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove 'are'. Can we provide any links to what bucketing means?


<dl>
<dt><code><em>TBLPROPERTIES</em></code></dt>
<dd>Table properties that has to be set are specified.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awkwardly worded. "Sets key-value properties on the table, such as ..."


<dl>
<dt><code><em>LOCATION</em></code></dt>
<dd>Specified Location is used to store table data.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say a little more -- it's a path to a directory, right?

---
### Description

The `CREATE TABLE` statement creates a new table using Hive format.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does Hive format mean here? (for the reader)

@maropu maropu changed the title [SPARK-28794] [DOC] Documentation for Create table Command [SPARK-28794][SQL][DOC] Documentation for Create table Command Dec 20, 2019
### Description
`CREATE TABLE` statement is used to create a table in an exsisting database.

The INSERT statements:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not INSERT but CREATE?

@@ -19,4 +19,9 @@ license: |
limitations under the License.
---

**This page is under construction**
### Description
`CREATE TABLE` statement is used to create a table in an exsisting database.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about create -> define to avoid doubly saying create...

USING DATASOURCE
[OPTIONS (key1=val1, key2=val2, ...)]
[PARTITIONED BY (col_name1, col_name2, ...)]
[CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adds SORTED BY:


The INSERT statements:
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add CREATE TABLE LIKE, too?

| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the other PARTITIONED BY case:

PARTITIONED BY partitionColumnNames=identifierList) |

[(col_name1[:] col_type1 [COMMENT col_comment1], ...)]
[COMMENT table_comment]
[PARTITIONED BY (col_name2[:] col_type2 [COMMENT col_comment2], ...)]
[ROW FORMAT row_format]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@PavithraRamachandran PavithraRamachandran Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I think it is not supported.I checked SparkSqlParser class and found this.
Screenshot from 2020-01-16 14-12-08

@srowen
Copy link
Member

srowen commented Jan 4, 2020

Ping @PavithraRamachandran

@maropu
Copy link
Member

maropu commented Jan 8, 2020

kindly ping again, @PavithraRamachandran

@PavithraRamachandran
Copy link
Contributor Author

I shall update the PR , with the necessary corrections as per the review comments.

### Examples
{% highlight sql %}

CREATE TABLE Student (Id INT,name STRING)
Copy link
Member

@gatorsmile gatorsmile Jan 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not to create a Hive serde table since Spark 3.0. See #26736

@maropu
Copy link
Member

maropu commented Jan 14, 2020

ok to test

@SparkQA
Copy link

SparkQA commented Jan 15, 2020

Test build #116731 has finished for PR 26759 at commit 62502a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jan 15, 2020

@PavithraRamachandran there are still unresolved comments from a month ago. Please address all of them.

@maropu
Copy link
Member

maropu commented Jan 15, 2020

@PavithraRamachandran If you don't have enogh time to keep this, I can take this over.

@PavithraRamachandran
Copy link
Contributor Author

i shall complete today

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116818 has finished for PR 26759 at commit 50996f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

{% highlight sql %}

--Using data source
CREATE TABLE Student (width INT, length INT, height INT) USING CSV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps change the column names to id, name, age to be more meaningful ? Also can you please put semi colon at the end in the examples just to be consistent with other docs ?

cc @huaxingao can you please check on the consistency part if you have some time ?


<dl>
<dt><code><em>USING data_source</em></code></dt>
<dd>Data Source is the file format used to create the table. Data source can be CSV, TXT, ORC, JDBC, PARQUET, etc.</dd>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we say "input format" instead of "file format". For example, JDBC is data source is not a file format, right ?

</dl>

<dl>
<dt><code><em>STORED</em></code></dt>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STORED AS ?

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116836 has finished for PR 26759 at commit f835058.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116848 has finished for PR 26759 at commit b7dab5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116860 has finished for PR 26759 at commit c545e9a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

The CREATE statements:
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html)
* [CREATE TABLE LIKE](sql-ref-syntax-ddl-create-table-hiveformat.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sql-ref-syntax-ddl-create-table-like.html


### Syntax
{% highlight sql %}
CREATE TABLE [IF NOT EXISTS] [db_name.]new_table_name LIKE [db_name.]source_table_name [LOCATION path]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More options here:

| CREATE TABLE (IF NOT EXISTS)? target=tableIdentifier


### Syntax
{% highlight sql %}
CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use table_identifier instead of [db_name.]table_name? Put the syntax of table_identifier in Parameters section. You can refer to any of the docs that has table_identifier.

[LOCATION path]
[COMMENT table_comment]
[TBLPROPERTIES (key1=val1, key2=val2, ...)]
[AS select_statement]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to make all the docs follow the same convention: put a space in between the symbols (e.g. '|', '=', '[]') and text. Refer to sql-ref-syntax-ddl-create-database as an example.

USING CSV
PARTITIONED BY (age)
CLUSTERED BY (Id) INTO 4 buckets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add ; in the end of all the sql statements in example sections?

PARTITIONED BY (age)
CLUSTERED BY (Id) INTO 4 buckets

{% endhighlight %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a Related Statements section to link the related statements?

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116953 has finished for PR 26759 at commit bc2aef8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116955 has finished for PR 26759 at commit 2f26e55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


### Related Statements
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html)
* [CREATE TABLE LIKE](ssql-ref-syntax-ddl-create-table-like.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is broken. You have an extra s in ssql


### Related Statements
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
* [CREATE TABLE LIKE](ssql-ref-syntax-ddl-create-table-like.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link

The CREATE statements:
* [CREATE TABLE USING DATASOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
* [CREATE TABLE USING HIVE FORMAT](sql-ref-syntax-ddl-create-table-hiveformat.html)
* [CREATE TABLE LIKE](ssql-ref-syntax-ddl-create-table-like.html)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken link

USING data_source
[ OPTIONS ( key1=val1, key2=val2, ... ) ]
[ PARTITIONED BY ( col_name1, col_name2, ... ) ]
[ CLUSTERED BY ( col_name3, col_name4, ... ) [ SORTED BY ( col_name [ ASC | DESC ], ... ) ] INTO num_buckets BUCKETS ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is too long. You may want to break it to make it look better. It looks like this in my google chrome:
image

CREATE [ EXTERNAL ] TABLE [ IF NOT EXISTS ] table_identifier
[ ( col_name1[:] col_type1 [ COMMENT col_comment1 ], ... ) ]
[ COMMENT table_comment ]
[ PARTITIONED BY ( col_name2[:] col_type2 [ COMMENT col_comment2 ], ... ) | ( col_name1, col_name2, ... ) ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break the line

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116961 has finished for PR 26759 at commit 7efd7f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth another proofreading too


<dl>
<dt><code><em>ROW FORMAT</em></code></dt>
<dd>SERDE is used to specify a custom SerDe or the DELIMITED clause inorder to use the native SerDe.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inorder -> in order


<dl>
<dt><code><em>LOCATION</em></code></dt>
<dd>Path to the directory where table data is stored, could be filesystem, HDFS, etc.</dd>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below, better as "... data is stored, which could be a path on distributed storage like HDFS, etc."

<dl>
<dt><code><em>TBLPROPERTIES</em></code></dt>
<dd>
Table properties that has to be set are specified,such as `created.by.user`, `owner`, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that have to be set
space after comma

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117244 has finished for PR 26759 at commit 0ea1268.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close enough, I think.

@srowen
Copy link
Member

srowen commented Jan 23, 2020

Merged to master

@srowen srowen closed this in afe70b3 Jan 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants