[SPARK-33687][SQL] Support analyze all tables in a specific database

wangyum · maropu · commit d07fc3076b29 · 2021-03-01T09:06:47.000+09:00
### What changes were proposed in this pull request? This pr add support analyze all tables in a specific database: ```g4 ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS (identifier)? ``` ### Why are the changes needed? 1. Make it easy to analyze all tables in a specific database. 2. PostgreSQL has a similar implementation: https://www.postgresql.org/docs/12/sql-analyze.html. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The feature tested by unit test. The documentation tested by regenerating the documentation: menu-sql.yaml | sql-ref-syntax-aux-analyze-tables.md -- | -- ![image](https://user-images.githubusercontent.com/5399861/109098769-dc33a200-775c-11eb-86b1-55531e5425e0.png) | ![image](https://user-images.githubusercontent.com/5399861/109098841-02594200-775d-11eb-8588-de8da97ec94a.png) Closes #30648 from wangyum/SPARK-33687. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
diff --git a/docs/_data/menu-sql.yaml b/docs/_data/menu-sql.yaml
@@ -198,6 +198,8 @@
               subitems: 
                 - text: ANALYZE TABLE
                   url: sql-ref-syntax-aux-analyze-table.html
+                - text: ANALYZE TABLES
+                  url: sql-ref-syntax-aux-analyze-tables.html
             - text: CACHE
               url: sql-ref-syntax-aux-cache.html
               subitems:
diff --git a/docs/sql-ref-syntax-aux-analyze-table.md b/docs/sql-ref-syntax-aux-analyze-table.md
@@ -50,7 +50,7 @@ ANALYZE TABLE table_identifier [ partition_spec ]
      * If no analyze option is specified, `ANALYZE TABLE` collects the table's number of rows and size in bytes.
      * **NOSCAN**
 
-       Collects only the table's size in bytes ( which does not require scanning the entire table ).
+       Collects only the table's size in bytes (which does not require scanning the entire table).
      * **FOR COLUMNS col [ , ... ] `|` FOR ALL COLUMNS**
 
        Collects column statistics for each column specified, or alternatively for every column, as well as table statistics.
@@ -122,3 +122,7 @@ DESC EXTENDED students name;
 |     histogram|      NULL|
 +--------------+----------+
 ```
+
+### Related Statements
+
+* [ANALYZE TABLES](sql-ref-syntax-aux-analyze-tables.html)
diff --git a/docs/sql-ref-syntax-aux-analyze-tables.md b/docs/sql-ref-syntax-aux-analyze-tables.md
@@ -0,0 +1,110 @@
+---
+layout: global
+title: ANALYZE TABLES
+displayTitle: ANALYZE TABLES
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+The `ANALYZE TABLES` statement collects statistics about all the tables in a specified database to be used by the query optimizer to find a better query execution plan.
+
+### Syntax
+
+```sql
+ANALYZE TABLES [ { FROM | IN } database_name ] COMPUTE STATISTICS [ NOSCAN ]
+```
+
+### Parameters
+
+* **{ FROM `|` IN } database_name**
+
+    Specifies the name of the database to be analyzed. Without a database name, `ANALYZE` collects all tables in the current database that the current user has permission to analyze.
+
+* **[ NOSCAN ]**
+
+    Collects only the table's size in bytes (which does not require scanning the entire table).
+
+### Examples
+
+```sql
+CREATE DATABASE school_db;
+USE school_db;
+
+CREATE TABLE teachers (name STRING, teacher_id INT);
+INSERT INTO teachers VALUES ('Tom', 1), ('Jerry', 2);
+
+CREATE TABLE students (name STRING, student_id INT, age SHORT);
+INSERT INTO students VALUES ('Mark', 111111, 10), ('John', 222222, 11);
+
+ANALYZE TABLES IN school_db COMPUTE STATISTICS NOSCAN;
+
+DESC EXTENDED teachers;
++--------------------+--------------------+-------+
+|            col_name|           data_type|comment|
++--------------------+--------------------+-------+
+|                name|              string|   null|
+|          teacher_id|                 int|   null|
+|                 ...|                 ...|    ...|
+|            Provider|             parquet|       |
+|          Statistics|          1382 bytes|       |
+|                 ...|                 ...|    ...|
++--------------------+--------------------+-------+
+
+DESC EXTENDED students;
++--------------------+--------------------+-------+
+|            col_name|           data_type|comment|
++--------------------+--------------------+-------+
+|                name|              string|   null|
+|          student_id|                 int|   null|
+|                 age|            smallint|   null|
+|                 ...|                 ...|    ...|
+|          Statistics|          1828 bytes|       |
+|                 ...|                 ...|    ...|
++--------------------+--------------------+-------+
+
+ANALYZE TABLES COMPUTE STATISTICS;
+
+DESC EXTENDED teachers;
++--------------------+--------------------+-------+
+|            col_name|           data_type|comment|
++--------------------+--------------------+-------+
+|                name|              string|   null|
+|          teacher_id|                 int|   null|
+|                 ...|                 ...|    ...|
+|            Provider|             parquet|       |
+|          Statistics|  1382 bytes, 2 rows|       |
+|                 ...|                 ...|    ...|
++--------------------+--------------------+-------+
+
+DESC EXTENDED students;
++--------------------+--------------------+-------+
+|            col_name|           data_type|comment|
++--------------------+--------------------+-------+
+|                name|              string|   null|
+|          student_id|                 int|   null|
+|                 age|            smallint|   null|
+|                 ...|                 ...|    ...|
+|            Provider|             parquet|       |
+|          Statistics|  1828 bytes, 2 rows|       |
+|                 ...|                 ...|    ...|
++--------------------+--------------------+-------+
+```
+
+### Related Statements
+
+* [ANALYZE TABLE](sql-ref-syntax-aux-analyze-table.html)
diff --git a/docs/sql-ref-syntax-aux-analyze.md b/docs/sql-ref-syntax-aux-analyze.md
@@ -20,3 +20,4 @@ license: |
 ---
 
  * [ANALYZE TABLE statement](sql-ref-syntax-aux-analyze-table.html)
+ * [ANALYZE TABLES statement](sql-ref-syntax-aux-analyze-tables.html)
diff --git a/docs/sql-ref-syntax.md b/docs/sql-ref-syntax.md
@@ -77,6 +77,7 @@ Spark SQL is Apache Spark's module for working with structured data. The SQL Syn
  * [ADD FILE](sql-ref-syntax-aux-resource-mgmt-add-file.html)
  * [ADD JAR](sql-ref-syntax-aux-resource-mgmt-add-jar.html)
  * [ANALYZE TABLE](sql-ref-syntax-aux-analyze-table.html)
+ * [ANALYZE TABLES](sql-ref-syntax-aux-analyze-tables.html)
  * [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
  * [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
  * [DESCRIBE DATABASE](sql-ref-syntax-aux-describe-database.html)
diff --git a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
@@ -134,6 +134,8 @@ statement
         (AS? query)?                                                   #replaceTable
     | ANALYZE TABLE multipartIdentifier partitionSpec? COMPUTE STATISTICS
         (identifier | FOR COLUMNS identifierSeq | FOR ALL COLUMNS)?    #analyze
+    | ANALYZE TABLES ((FROM | IN) multipartIdentifier)? COMPUTE STATISTICS
+        (identifier)?                                                  #analyzeTables
     | ALTER TABLE multipartIdentifier
         ADD (COLUMN | COLUMNS)
         columns=qualifiedColTypeWithPositionList                       #addTableColumns
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -859,6 +859,8 @@ class Analyzer(override val catalogManager: CatalogManager)
         s.copy(namespace = ResolvedNamespace(currentCatalog, catalogManager.currentNamespace))
       case s @ ShowViews(UnresolvedNamespace(Seq()), _, _) =>
         s.copy(namespace = ResolvedNamespace(currentCatalog, catalogManager.currentNamespace))
+      case a @ AnalyzeTables(UnresolvedNamespace(Seq()), _) =>
+        a.copy(namespace = ResolvedNamespace(currentCatalog, catalogManager.currentNamespace))
       case UnresolvedNamespace(Seq()) =>
         ResolvedNamespace(currentCatalog, Seq.empty[String])
       case UnresolvedNamespace(CatalogAndNamespace(catalog, ns)) =>
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@@ -3654,6 +3654,25 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
     }
   }
 
+  /**
+   * Create an [[AnalyzeTables]].
+   * Example SQL for analyzing all tables in default database:
+   * {{{
+   *   ANALYZE TABLES IN default COMPUTE STATISTICS;
+   * }}}
+   */
+  override def visitAnalyzeTables(ctx: AnalyzeTablesContext): LogicalPlan = withOrigin(ctx) {
+    if (ctx.identifier != null &&
+      ctx.identifier.getText.toLowerCase(Locale.ROOT) != "noscan") {
+      throw new ParseException(s"Expected `NOSCAN` instead of `${ctx.identifier.getText}`",
+        ctx.identifier())
+    }
+    val multiPart = Option(ctx.multipartIdentifier).map(visitMultipartIdentifier)
+    AnalyzeTables(
+      UnresolvedNamespace(multiPart.getOrElse(Seq.empty[String])),
+      noScan = ctx.identifier != null)
+  }
+
   /**
    * Create a [[RepairTable]].
    *
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/v2Commands.scala
@@ -660,6 +660,15 @@ case class AnalyzeTable(
   override def children: Seq[LogicalPlan] = child :: Nil
 }
 
+/**
+ * The logical plan of the ANALYZE TABLES command.
+ */
+case class AnalyzeTables(
+    namespace: LogicalPlan,
+    noScan: Boolean) extends Command {
+  override def children: Seq[LogicalPlan] = Seq(namespace)
+}
+
 /**
  * The logical plan of the ANALYZE TABLE FOR COLUMNS command.
  */
diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/DDLParserSuite.scala
@@ -1873,6 +1873,15 @@ class DDLParserSuite extends AnalysisTest {
       "Expected `NOSCAN` instead of `xxxx`")
   }
 
+  test("SPARK-33687: analyze tables statistics") {
+    comparePlans(parsePlan("ANALYZE TABLES IN a.b.c COMPUTE STATISTICS"),
+      AnalyzeTables(UnresolvedNamespace(Seq("a", "b", "c")), noScan = false))
+    comparePlans(parsePlan("ANALYZE TABLES FROM a COMPUTE STATISTICS NOSCAN"),
+      AnalyzeTables(UnresolvedNamespace(Seq("a")), noScan = true))
+    intercept("ANALYZE TABLES IN a.b.c COMPUTE STATISTICS xxxx",
+      "Expected `NOSCAN` instead of `xxxx`")
+  }
+
   test("analyze table column statistics") {
     intercept("ANALYZE TABLE a.b.c COMPUTE STATISTICS FOR COLUMNS", "")
 
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala
@@ -373,6 +373,9 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager)
         AnalyzePartitionCommand(ident.asTableIdentifier, partitionSpec, noScan)
       }
 
+    case AnalyzeTables(DatabaseInSessionCatalog(db), noScan) =>
+      AnalyzeTablesCommand(Some(db), noScan)
+
     case AnalyzeColumn(ResolvedV1TableOrViewIdentifier(ident), columnNames, allColumns) =>
       AnalyzeColumnCommand(ident.asTableIdentifier, columnNames, allColumns)
 
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTableCommand.scala
@@ -17,49 +17,19 @@
 
 package org.apache.spark.sql.execution.command
 
-import org.apache.spark.sql.{AnalysisException, Row, SparkSession}
+import org.apache.spark.sql.{Row, SparkSession}
 import org.apache.spark.sql.catalyst.TableIdentifier
-import org.apache.spark.sql.catalyst.catalog.CatalogTableType
 
 
 /**
  * Analyzes the given table to generate statistics, which will be used in query optimizations.
  */
 case class AnalyzeTableCommand(
     tableIdent: TableIdentifier,
-    noscan: Boolean = true) extends RunnableCommand {
+    noScan: Boolean = true) extends RunnableCommand {
 
   override def run(sparkSession: SparkSession): Seq[Row] = {
-    val sessionState = sparkSession.sessionState
-    val db = tableIdent.database.getOrElse(sessionState.catalog.getCurrentDatabase)
-    val tableIdentWithDB = TableIdentifier(tableIdent.table, Some(db))
-    val tableMeta = sessionState.catalog.getTableMetadata(tableIdentWithDB)
-    if (tableMeta.tableType == CatalogTableType.VIEW) {
-      // Analyzes a catalog view if the view is cached
-      val table = sparkSession.table(tableIdent.quotedString)
-      val cacheManager = sparkSession.sharedState.cacheManager
-      if (cacheManager.lookupCachedData(table.logicalPlan).isDefined) {
-        if (!noscan) {
-          // To collect table stats, materializes an underlying columnar RDD
-          table.count()
-        }
-      } else {
-        throw new AnalysisException("ANALYZE TABLE is not supported on views.")
-      }
-    } else {
-      // Compute stats for the whole table
-      val newTotalSize = CommandUtils.calculateTotalSize(sparkSession, tableMeta)
-      val newRowCount =
-        if (noscan) None else Some(BigInt(sparkSession.table(tableIdentWithDB).count()))
-
-      // Update the metastore if the above statistics of the table are different from those
-      // recorded in the metastore.
-      val newStats = CommandUtils.compareAndGetNewStats(tableMeta.stats, newTotalSize, newRowCount)
-      if (newStats.isDefined) {
-        sessionState.catalog.alterTableStats(tableIdentWithDB, newStats)
-      }
-    }
-
+    CommandUtils.analyzeTable(sparkSession, tableIdent, noScan)
     Seq.empty[Row]
   }
 }
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/AnalyzeTablesCommand.scala
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.command
+
+import scala.util.control.NonFatal
+
+import org.apache.spark.sql.{Row, SparkSession}
+
+
+/**
+ * Analyzes all tables in the given database to generate statistics.
+ */
+case class AnalyzeTablesCommand(
+    databaseName: Option[String],
+    noScan: Boolean) extends RunnableCommand {
+
+  override def run(sparkSession: SparkSession): Seq[Row] = {
+    val catalog = sparkSession.sessionState.catalog
+    val db = databaseName.getOrElse(catalog.getCurrentDatabase)
+    catalog.listTables(db).foreach { tbl =>
+      try {
+        CommandUtils.analyzeTable(sparkSession, tbl, noScan)
+      } catch {
+        case NonFatal(e) =>
+          logWarning(s"Failed to analyze table ${tbl.table} in the " +
+            s"database $db because of ${e.toString}", e)
+      }
+    }
+    Seq.empty[Row]
+  }
+}
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
@@ -27,7 +27,7 @@ import org.apache.hadoop.fs.{FileSystem, Path, PathFilter}
 import org.apache.spark.internal.Logging
 import org.apache.spark.sql.{AnalysisException, SparkSession}
 import org.apache.spark.sql.catalyst.{InternalRow, TableIdentifier}
-import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable}
+import org.apache.spark.sql.catalyst.catalog.{CatalogStatistics, CatalogTable, CatalogTableType}
 import org.apache.spark.sql.catalyst.expressions._
 import org.apache.spark.sql.catalyst.expressions.aggregate._
 import org.apache.spark.sql.catalyst.plans.logical._
@@ -199,6 +199,41 @@ object CommandUtils extends Logging {
     newStats
   }
 
+  def analyzeTable(
+      sparkSession: SparkSession,
+      tableIdent: TableIdentifier,
+      noScan: Boolean): Unit = {
+    val sessionState = sparkSession.sessionState
+    val db = tableIdent.database.getOrElse(sessionState.catalog.getCurrentDatabase)
+    val tableIdentWithDB = TableIdentifier(tableIdent.table, Some(db))
+    val tableMeta = sessionState.catalog.getTableMetadata(tableIdentWithDB)
+    if (tableMeta.tableType == CatalogTableType.VIEW) {
+      // Analyzes a catalog view if the view is cached
+      val table = sparkSession.table(tableIdent.quotedString)
+      val cacheManager = sparkSession.sharedState.cacheManager
+      if (cacheManager.lookupCachedData(table.logicalPlan).isDefined) {
+        if (!noScan) {
+          // To collect table stats, materializes an underlying columnar RDD
+          table.count()
+        }
+      } else {
+        throw new AnalysisException("ANALYZE TABLE is not supported on views.")
+      }
+    } else {
+      // Compute stats for the whole table
+      val newTotalSize = CommandUtils.calculateTotalSize(sparkSession, tableMeta)
+      val newRowCount =
+        if (noScan) None else Some(BigInt(sparkSession.table(tableIdentWithDB).count()))
+
+      // Update the metastore if the above statistics of the table are different from those
+      // recorded in the metastore.
+      val newStats = CommandUtils.compareAndGetNewStats(tableMeta.stats, newTotalSize, newRowCount)
+      if (newStats.isDefined) {
+        sessionState.catalog.alterTableStats(tableIdentWithDB, newStats)
+      }
+    }
+  }
+
   /**
    * Compute stats for the given columns.
    * @return (row count, map from column name to CatalogColumnStats)
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/StatisticsCollectionSuite.scala

Original file line number	Diff line number	Diff line change
`@@ -20,3 +20,4 @@ license: \|`
`20`	`20`	`---`
`21`	`21`
`22`	`22`	`* [ANALYZE TABLE statement](sql-ref-syntax-aux-analyze-table.html)`
	`23`	`+ * [ANALYZE TABLES statement](sql-ref-syntax-aux-analyze-tables.html)`
Original file line number	Diff line number	Diff line change
`@@ -373,6 +373,9 @@ class ResolveSessionCatalog(val catalogManager: CatalogManager)`
`373`	`373`	`AnalyzePartitionCommand(ident.asTableIdentifier, partitionSpec, noScan)`
`374`	`374`	`}`
`375`	`375`
	`376`	`+ case AnalyzeTables(DatabaseInSessionCatalog(db), noScan) =>`
	`377`	`+ AnalyzeTablesCommand(Some(db), noScan)`
	`378`	`+`
`376`	`379`	`case AnalyzeColumn(ResolvedV1TableOrViewIdentifier(ident), columnNames, allColumns) =>`
`377`	`380`	`AnalyzeColumnCommand(ident.asTableIdentifier, columnNames, allColumns)`
`378`	`381`