-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-33687][SQL] Support analyze all tables in a specific database #30648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132370 has finished for PR 30648 at commit
|
Thank you for making this contribution, @wangyum . |
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala
Outdated
Show resolved
Hide resolved
} | ||
} | ||
} catch { | ||
case e: Exception => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like too general. Can we use more specific one instead of Exception
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change it to:
case NonFatal(e) =>
logWarning(s"Failed to analyze table ${tbl.table} in the " +
s"database $db because of ${e.toString}", e)
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments. In general, this looks like a good feature for Apache Spark 3.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks a nice feature. Btw, (just a question) Any system you referred to for defining this syntax?
@@ -134,6 +134,8 @@ statement | |||
(AS? query)? #replaceTable | |||
| ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS | |||
(identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)? #analyze | |||
| ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS | |||
(identifier)? #analyzeTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If identifier
is only for NOSCAN
, how about defining a new ANTLR token for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analyze also uses this identifier
, how about do it in a separate pr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala
Outdated
Show resolved
Hide resolved
} | ||
} catch { | ||
case e: Exception => | ||
logError(s"Failed to analyze table: ${tbl.identifier}.", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnalysisException
instead of logError
, I think. And, we need tests for this code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the current user may not have permission for some tables.
This is PostgreSQL doc:
ANALYZE processes every table and materialized view in the current database that the current user has permission to analyze.
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala
Outdated
Show resolved
Hide resolved
Also, I think we need a SQL doc page for this new feature. |
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
PostgreSQL syntax is:
This syntax ( spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Lines 135 to 136 in 1a16397
and spark/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 Lines 200 to 201 in 1a16397
|
Kubernetes integration test starting |
| database | tableName | isTemporary | information | | ||
+------------+------------+--------------+----------------------------------------------------+ | ||
| school_db | students | false | Database: school_db | ||
Table: students |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting this like the SHOW TABLE EXTENDED
page looks better:
https://spark.apache.org/docs/latest/sql-ref-syntax-aux-show-table.html#examples
@@ -134,6 +134,8 @@ statement | |||
(AS? query)? #replaceTable | |||
| ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS | |||
(identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)? #analyze | |||
| ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS | |||
(identifier)? #analyzeTables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sgtm
Kubernetes integration test status failure |
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132467 has finished for PR 30648 at commit
|
Kubernetes integration test starting |
Looks fine to me |
Kubernetes integration test status success |
Test build #132477 has finished for PR 30648 at commit
|
Test build #132473 has finished for PR 30648 at commit
|
@wangyum Could you resolve the conflict? This feature looks useful, so I wanna finish and merge it. |
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionTestBase.scala
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #135428 has finished for PR 30648 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left minor comments and it looks fine otherwise.
@@ -20,3 +20,4 @@ license: | | |||
--- | |||
|
|||
* [ANALYZE TABLE statement](sql-ref-syntax-aux-analyze-table.html) | |||
* [ANALYZE TABLES statement](sql-ref-syntax-aux-analyze-tables.html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz update the index page, too.
Provider: hive | ||
Table Properties: [transient_lastDdlTime=1607495311] | ||
Statistics: 24 bytes, 2 rows | ||
Location: file:/opt/spark1/spark/spark-warehouse/school_db.db/students |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use DESC EXTENDED
to follow the ANALYZE TABLE
example? https://spark.apache.org/docs/3.0.2/sql-ref-syntax-aux-analyze-table.html I think the current example has many unnecessary information. The simpler is the better.
|
||
### Description | ||
|
||
The `ANALYZE TABLES` statement collects statistics about all the tables in a database to be used by the query optimizer to find a better query execution plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in a database
-> in a specified database
?
|
||
* **[ NOSCAN ]** | ||
|
||
Collects only the table's size in bytes ( which does not require scanning the entire table ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it seems unnecessary spaces found: ( which
and table )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also has spaces:
spark/docs/sql-ref-syntax-aux-analyze-table.md
Lines 51 to 53 in aafa658
* **NOSCAN** | |
Collects only the table's size in bytes ( which does not require scanning the entire table ). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, can you fix them, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Kubernetes integration test starting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine cc: @HyukjinKwon @dongjoon-hyun
I am okay with this. cc @peter-toth too if you're interested in this. |
Kubernetes integration test status failure |
Test build #135449 has finished for PR 30648 at commit
|
If no one have more comments, I'll merge this in a few days. |
Thanks! Merged to master. |
Thank you, @wangyum and all! |
…E and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of #30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…E and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of #30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 07d173a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This pr add support analyze all tables in a specific database: ```g4 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)? ``` ### Why are the changes needed? 1. Make it easy to analyze all tables in a specific database. 2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The feature tested by unit test. The documentation tested by regenerating the documentation: menu-sql.yaml | sql-ref-syntax-aux-analyze-tables.md -- | --  |  Closes apache#30648 from wangyum/SPARK-33687. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…E and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of apache#30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes apache#33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 07d173a) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This pr add support analyze all tables in a specific database:
Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
The feature tested by unit test.
The documentation tested by regenerating the documentation: