Skip to content

[SPARK-33687][SQL] Support analyze all tables in a specific database #30648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

[SPARK-33687][SQL] Support analyze all tables in a specific database #30648

wants to merge 7 commits into from

Conversation

wangyum
Copy link
Member

@wangyum wangyum commented Dec 7, 2020

What changes were proposed in this pull request?

This pr add support analyze all tables in a specific database:

 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?

Why are the changes needed?

  1. Make it easy to analyze all tables in a specific database.
  2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

The feature tested by unit test.
The documentation tested by regenerating the documentation:

menu-sql.yaml  sql-ref-syntax-aux-analyze-tables.md
image image

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36970/

@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36970/

@github-actions github-actions bot added the SQL label Dec 7, 2020
@SparkQA
Copy link

SparkQA commented Dec 7, 2020

Test build #132370 has finished for PR 30648 at commit 78b9ffc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class AnalyzeTables(
  • case class AnalyzeTablesCommand(

@dongjoon-hyun
Copy link
Member

Thank you for making this contribution, @wangyum .

}
}
} catch {
case e: Exception =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like too general. Can we use more specific one instead of Exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change it to:

        case NonFatal(e) =>
          logWarning(s"Failed to analyze table ${tbl.table} in the " +
            s"database $db because of ${e.toString}", e)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments. In general, this looks like a good feature for Apache Spark 3.2.0.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks a nice feature. Btw, (just a question) Any system you referred to for defining this syntax?

@@ -134,6 +134,8 @@ statement
(AS? query)? #replaceTable
| ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
(identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)? #analyze
| ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS
(identifier)? #analyzeTables
Copy link
Member

@maropu maropu Dec 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If identifier is only for NOSCAN, how about defining a new ANTLR token for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analyze also uses this identifier , how about do it in a separate pr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

}
} catch {
case e: Exception =>
logError(s"Failed to analyze table: ${tbl.identifier}.", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalysisException instead of logError, I think. And, we need tests for this code path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the current user may not have permission for some tables.

This is PostgreSQL doc:

ANALYZE processes every table and materialized view in the current database that the current user has permission to analyze.

https://www.postgresql.org/docs/12/sql-analyze.html

@maropu
Copy link
Member

maropu commented Dec 8, 2020

Also, I think we need a SQL doc page for this new feature.

@github-actions github-actions bot added the DOCS label Dec 9, 2020
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@wangyum
Copy link
Member Author

wangyum commented Dec 9, 2020

Looks a nice feature. Btw, (just a question) Any system you referred to for defining this syntax?

PostgreSQL syntax is:

ANALYZE [ ( option [, ...] ) ] [ table_and_columns [, ...] ]
ANALYZE [ VERBOSE ] [ table_and_columns [, ...] ]

where option can be one of:

    VERBOSE [ boolean ]
    SKIP_LOCKED [ boolean ]

and table_and_columns is:

    table_name [ ( column_name [, ...] ) ]

This syntax ( ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?) is from

| ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
(identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)? #analyze

and
| SHOW TABLES ((FROM | IN) multipartIdentifier)?
(LIKE? pattern=STRING)? #showTables

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37069/

| database | tableName | isTemporary | information |
+------------+------------+--------------+----------------------------------------------------+
| school_db | students | false | Database: school_db
Table: students
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting this like the SHOW TABLE EXTENDED page looks better:
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-table.html#examples

@@ -134,6 +134,8 @@ statement
(AS? query)? #replaceTable
| ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
(identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)? #analyze
| ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS
(identifier)? #analyzeTables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37069/

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37075/

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37075/

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Test build #132467 has finished for PR 30648 at commit 198babc.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37079/

@HyukjinKwon
Copy link
Member

Looks fine to me

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37079/

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Test build #132477 has finished for PR 30648 at commit aafa658.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2020

Test build #132473 has finished for PR 30648 at commit 1a16397.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 24, 2021

@wangyum Could you resolve the conflict? This feature looks useful, so I wanna finish and merge it.

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
#	sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala
@SparkQA
Copy link

SparkQA commented Feb 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40008/

@SparkQA
Copy link

SparkQA commented Feb 24, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40008/

@SparkQA
Copy link

SparkQA commented Feb 24, 2021

Test build #135428 has finished for PR 30648 at commit a75a30d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left minor comments and it looks fine otherwise.

@@ -20,3 +20,4 @@ license: |
---

* [ANALYZE TABLE statement](sql-ref-syntax-aux-analyze-table.html)
* [ANALYZE TABLES statement](sql-ref-syntax-aux-analyze-tables.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz update the index page, too.

Provider: hive
Table Properties: [transient_lastDdlTime=1607495311]
Statistics: 24 bytes, 2 rows
Location: file:/opt/spark1/spark/spark-warehouse/school_db.db/students
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use DESC EXTENDED to follow the ANALYZE TABLE example? https://spark.apache.org/docs/3.0.2/sql-ref-syntax-aux-analyze-table.html I think the current example has many unnecessary information. The simpler is the better.


### Description

The `ANALYZE TABLES` statement collects statistics about all the tables in a database to be used by the query optimizer to find a better query execution plan.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in a database -> in a specified database?


* **[ NOSCAN ]**

Collects only the table's size in bytes ( which does not require scanning the entire table ).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it seems unnecessary spaces found: ( which and table )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also has spaces:

* **NOSCAN**
Collects only the table's size in bytes ( which does not require scanning the entire table ).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, can you fix them, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40029/

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine cc: @HyukjinKwon @dongjoon-hyun

@HyukjinKwon
Copy link
Member

I am okay with this. cc @peter-toth too if you're interested in this.

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/40029/

@SparkQA
Copy link

SparkQA commented Feb 25, 2021

Test build #135449 has finished for PR 30648 at commit 81efb9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 26, 2021

If no one have more comments, I'll merge this in a few days.

@maropu maropu closed this in d07fc30 Mar 1, 2021
@maropu
Copy link
Member

maropu commented Mar 1, 2021

Thanks! Merged to master.

@dongjoon-hyun
Copy link
Member

Thank you, @wangyum and all!

@wangyum wangyum deleted the SPARK-33687 branch March 1, 2021 01:02
cloud-fan added a commit that referenced this pull request Aug 19, 2021
…E and ANALYZE TABLES

### What changes were proposed in this pull request?

This is a followup of #30648

ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.

### Why are the changes needed?

simplify the doc

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33781 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan added a commit that referenced this pull request Aug 19, 2021
…E and ANALYZE TABLES

### What changes were proposed in this pull request?

This is a followup of #30648

ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.

### Why are the changes needed?

simplify the doc

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33781 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 07d173a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
### What changes were proposed in this pull request?

This pr add support analyze all tables in a specific database:
```g4
 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)?
```

### Why are the changes needed?

1. Make it easy to analyze all tables in a specific database.
2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The feature tested by unit test.
The documentation tested by regenerating the documentation:

menu-sql.yaml |  sql-ref-syntax-aux-analyze-tables.md
-- | --
![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) | ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png)

Closes apache#30648 from wangyum/SPARK-33687.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
…E and ANALYZE TABLES

### What changes were proposed in this pull request?

This is a followup of apache#30648

ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.

### Why are the changes needed?

simplify the doc

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes apache#33781 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 07d173a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants