[SPARK-33687][SQL] Support analyze all tables in a specific database #30648

wangyum · 2020-12-07T13:46:19Z

What changes were proposed in this pull request?

This pr add support analyze all tables in a specific database:

 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?

Why are the changes needed?

Make it easy to analyze all tables in a specific database.
PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The feature tested by unit test.
The documentation tested by regenerating the documentation:

menu-sql.yaml	sql-ref-syntax-aux-analyze-tables.md

SparkQA · 2020-12-07T14:32:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36970/

SparkQA · 2020-12-07T14:59:20Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36970/

SparkQA · 2020-12-07T16:04:16Z

Test build #132370 has finished for PR 30648 at commit 78b9ffc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class AnalyzeTables(
case class AnalyzeTablesCommand(

dongjoon-hyun · 2020-12-07T23:19:35Z

Thank you for making this contribution, @wangyum .

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala

dongjoon-hyun · 2020-12-07T23:24:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala

+          }
+        }
+      } catch {
+        case e: Exception =>


This looks like too general. Can we use more specific one instead of Exception?

Change it to:

case NonFatal(e) => logWarning(s"Failed to analyze table ${tbl.table} in the " + s"database $db because of ${e.toString}", e)

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

dongjoon-hyun

I left a few comments. In general, this looks like a good feature for Apache Spark 3.2.0.

maropu

Looks a nice feature. Btw, (just a question) Any system you referred to for defining this syntax?

maropu · 2020-12-07T23:35:40Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

@@ -134,6 +134,8 @@ statement
        (AS? query)?                                                   #replaceTable
    | ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
        (identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)?    #analyze
+    | ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS
+        (identifier)?                                                  #analyzeTables


If identifier is only for NOSCAN, how about defining a new ANTLR token for that?

analyze also uses this identifier , how about do it in a separate pr?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala

sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala

maropu · 2020-12-07T23:54:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala

+        }
+      } catch {
+        case e: Exception =>
+          logError(s"Failed to analyze table: ${tbl.identifier}.", e)


AnalysisException instead of logError, I think. And, we need tests for this code path.

This is because the current user may not have permission for some tables.

This is PostgreSQL doc:

ANALYZE processes every table and materialized view in the current database that the current user has permission to analyze.

https://www.postgresql.org/docs/12/sql-analyze.html

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

maropu · 2020-12-08T00:14:04Z

Also, I think we need a SQL doc page for this new feature.

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

wangyum · 2020-12-09T07:08:10Z

Looks a nice feature. Btw, (just a question) Any system you referred to for defining this syntax?

PostgreSQL syntax is:

ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
ANALYZE [ VERBOSE ] [ table_and_columns [, ...] ]

where option can be one of:

    VERBOSE [ boolean ]
    SKIP_LOCKED [ boolean ]

and table_and_columns is:

    table_name [ ( column_name [, ...] ) ]

This syntax ( ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?) is from

spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

Lines 135 to 136 in 1a16397

    
           | ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS 
        
               (identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)?    #analyze

and

spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

Lines 200 to 201 in 1a16397

    
           | SHOW TABLES ((FROM | IN) multipartIdentifier)? 
        
               (LIKE? pattern=STRING)?                                        #showTables

SparkQA · 2020-12-09T07:24:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37069/

maropu · 2020-12-09T07:37:12Z

docs/sql-ref-syntax-aux-analyze-tables.md

+|  database  | tableName  | isTemporary  |                    information                     |
+------------+------------+--------------+----------------------------------------------------+
+| school_db  | students   | false        | Database: school_db
+Table: students


Formatting this like the SHOW TABLE EXTENDED page looks better:
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-table.html#examples

docs/sql-ref-syntax-aux-analyze-tables.md

maropu · 2020-12-09T07:39:38Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

@@ -134,6 +134,8 @@ statement
        (AS? query)?                                                   #replaceTable
    | ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
        (identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)?    #analyze
+    | ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS
+        (identifier)?                                                  #analyzeTables


docs/sql-ref-syntax-aux-analyze-tables.md

SparkQA · 2020-12-09T07:53:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37069/

SparkQA · 2020-12-09T08:22:24Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37075/

SparkQA · 2020-12-09T08:49:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37075/

SparkQA · 2020-12-09T09:03:03Z

Test build #132467 has finished for PR 30648 at commit 198babc.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2020-12-09T09:28:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37079/

HyukjinKwon · 2020-12-09T09:41:02Z

Looks fine to me

SparkQA · 2020-12-09T10:01:54Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37079/

SparkQA · 2020-12-09T10:23:35Z

Test build #132477 has finished for PR 30648 at commit aafa658.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-09T11:29:19Z

Test build #132473 has finished for PR 30648 at commit 1a16397.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-02-24T12:03:22Z

@wangyum Could you resolve the conflict? This feature looks useful, so I wanna finish and merge it.

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala

SparkQA · 2021-02-24T16:30:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40008/

SparkQA · 2021-02-24T16:59:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40008/

SparkQA · 2021-02-24T20:06:31Z

Test build #135428 has finished for PR 30648 at commit a75a30d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

I left minor comments and it looks fine otherwise.

maropu · 2021-02-25T00:37:41Z

docs/sql-ref-syntax-aux-analyze.md

@@ -20,3 +20,4 @@ license: |
 ---

 * [ANALYZE TABLE statement](sql-ref-syntax-aux-analyze-table.html)
+ * [ANALYZE TABLES statement](sql-ref-syntax-aux-analyze-tables.html)


plz update the index page, too.

maropu · 2021-02-25T00:42:43Z

docs/sql-ref-syntax-aux-analyze-tables.md

+                                          Provider: hive
+                                          Table Properties: [transient_lastDdlTime=1607495311]
+                                          Statistics: 24 bytes, 2 rows
+                                          Location: file:/opt/spark1/spark/spark-warehouse/school_db.db/students


Could you use DESC EXTENDED to follow the ANALYZE TABLE example? https://spark.apache.org/docs/3.0.2/sql-ref-syntax-aux-analyze-table.html I think the current example has many unnecessary information. The simpler is the better.

maropu · 2021-02-25T00:43:32Z

docs/sql-ref-syntax-aux-analyze-tables.md

+
+### Description
+
+The `ANALYZE TABLES` statement collects statistics about all the tables in a database to be used by the query optimizer to find a better query execution plan.


nit: in a database -> in a specified database?

docs/sql-ref-syntax-aux-analyze-tables.md

maropu · 2021-02-25T00:50:51Z

docs/sql-ref-syntax-aux-analyze-tables.md

+
+* **[ NOSCAN ]**
+
+    Collects only the table's size in bytes ( which does not require scanning the entire table ).


nit: it seems unnecessary spaces found: ( which and table )

It also has spaces:

spark/docs/sql-ref-syntax-aux-analyze-table.md

Lines 51 to 53 in aafa658

* **NOSCAN**

Collects only the table's size in bytes ( which does not require scanning the entire table ).

yea, can you fix them, too?

SparkQA · 2021-02-25T04:32:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40029/

maropu

Looks fine cc: @HyukjinKwon @dongjoon-hyun

HyukjinKwon · 2021-02-25T05:06:26Z

I am okay with this. cc @peter-toth too if you're interested in this.

SparkQA · 2021-02-25T05:07:22Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40029/

SparkQA · 2021-02-25T07:49:55Z

Test build #135449 has finished for PR 30648 at commit 81efb9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-02-26T13:03:33Z

If no one have more comments, I'll merge this in a few days.

maropu · 2021-03-01T00:07:11Z

Thanks! Merged to master.

dongjoon-hyun · 2021-03-01T00:25:53Z

Thank you, @wangyum and all!

…E and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of #30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…E and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of #30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 07d173a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This pr add support analyze all tables in a specific database: ```g4 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)? ``` ### Why are the changes needed? 1. Make it easy to analyze all tables in a specific database. 2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The feature tested by unit test. The documentation tested by regenerating the documentation: menu-sql.yaml | sql-ref-syntax-aux-analyze-tables.md -- | -- ![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) | ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png) Closes apache#30648 from wangyum/SPARK-33687. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…E and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of apache#30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes apache#33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 07d173a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

analyze all tables in a specific database

78b9ffc

github-actions bot added the SQL label Dec 7, 2020

wangyum requested review from cloud-fan, dongjoon-hyun, HyukjinKwon and maropu December 7, 2020 23:17

dongjoon-hyun reviewed Dec 7, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 7, 2020

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 7, 2020

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 7, 2020

View reviewed changes

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 7, 2020

View reviewed changes

maropu reviewed Dec 8, 2020

View reviewed changes

Address comments

198babc

github-actions bot added the DOCS label Dec 9, 2020

Merge remote-tracking branch 'upstream/master' into SPARK-33687

1a16397

# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

maropu reviewed Dec 9, 2020

View reviewed changes

Fix doc format

aafa658

wangyum added 2 commits February 24, 2021 20:55

Merge remote-tracking branch 'upstream/master' into SPARK-33687

a71540c

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala

Merged upstream

a75a30d

maropu reviewed Feb 25, 2021

View reviewed changes

Address comments

81efb9d

maropu approved these changes Feb 25, 2021

View reviewed changes

maropu closed this in d07fc30 Mar 1, 2021

wangyum deleted the SPARK-33687 branch March 1, 2021 01:02

cloud-fan mentioned this pull request Aug 18, 2021

[SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES #33781

Closed


		### Description

		The `ANALYZE TABLES` statement collects statistics about all the tables in a database to be used by the query optimizer to find a better query execution plan.


		* [ NOSCAN ]

		Collects only the table's size in bytes ( which does not require scanning the entire table ).

	* NOSCAN

	Collects only the table's size in bytes ( which does not require scanning the entire table ).

[SPARK-33687][SQL] Support analyze all tables in a specific database #30648

[SPARK-33687][SQL] Support analyze all tables in a specific database #30648

Uh oh!

Conversation

wangyum commented Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 7, 2020

Uh oh!

SparkQA commented Dec 7, 2020

Uh oh!

SparkQA commented Dec 7, 2020

Uh oh!

dongjoon-hyun commented Dec 7, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

maropu Dec 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maropu commented Dec 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangyum commented Dec 9, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

HyukjinKwon commented Dec 9, 2020

Uh oh!

SparkQA commented Dec 9, 2020

Uh oh!

wangyum commented Dec 7, 2020 •

edited

Loading

maropu Dec 7, 2020 •

edited

Loading

maropu commented Dec 8, 2020 •

edited

Loading