[SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file #29497

anchovYu · 2020-08-20T21:03:10Z

What changes were proposed in this pull request?

Group all messages of AnalysisExcpetions created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file.

Create a new object: org.apache.spark.sql.CatalystErrors with many exception-creating functions.
When the Analyzer wants to create and throw a new AnalysisException, call functions of CatalystErrors

Why are the changes needed?

This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance.

Does this PR introduce any user-facing change?

No. Error messages remain unchanged.

How was this patch tested?

No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Naming of exception functions

All function names ended with Error.

For specific errors like groupingIDMismatch and groupingColInvalid, directly use them as name, just like groupingIDMismatchError and groupingColInvalidError.
For generic errors like dataTypeMismatch,
- if confident with the context, prefix and condition can be added, like pivotValDataTypeMismatchError
- if not sure about the context, add a For suffix of the specific component that this exception is related to, like dataTypeMismatchForDeserializerError

gatorsmile · 2020-08-20T22:17:00Z

ok to test

gatorsmile · 2020-08-20T22:33:18Z

cc @cloud-fan @maropu

maropu · 2020-08-20T23:13:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/CatalystErrors.scala

+ * org.apache.spark.sql.catalyst.analysis.Analyzer.
+ */
+object CatalystErrors {
+  def groupingIDMismatchError(groupingID: GroupingID, groupByExprs: Seq[Expression]): Throwable = {


nit: how about moving these methods into object AnalysisException?

This is a summary sheet of raw exceptions thrown in Catalyst of branch-3.0: OSS Catalyst Exception Messages: Branch-3.0. Apart from AnalysisException and its sub-exceptions, there are a lot of other exceptions, for example, IllegalArgumentException and UnsupportedOperationException.

We generally have two ways to group these messages:

By component. All messages from a single components, no matter the exception type, are grouped into one exception file.

By exception type. All messages of a single exception type are created in the corresponding exception object. It is great because by calling AnalysisException.groupingIDMismatchError we know the exception type this function throws. But this approach has a problem: those Java/Scala internal exception type cannot follow this way.

Due to the small problem of the latter one, I choose the first one of grouping all exception messages from Catalyst in one file. Or do you have any thoughts on that? Thanks!

maropu · 2020-08-20T23:18:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/CatalystErrors.scala

+/**
+ * Object for grouping all error messages in catalyst.
+ * Currently it includes all AnalysisExcpetions created and thrown directly in
+ * org.apache.spark.sql.catalyst.analysis.Analyzer.


Just a question; did you propose to group all the analysis exception here? I think we already have too many places (~500) where the analysis exceptions used though... https://gist.github.com/maropu/29bbfa9a93c41bd4ba1e0eaa038af087

We need to do all of them, I think. This is just an example PR.

This can help us do error message auditing. In the future, we can review this file, improve the error message quality, and make them unified and standard in the future.

It looks nice, @gatorsmile. I think we might be able to add a new rule in scalastyle for forbidding the use of AnalysisException in the the other files.

SparkQA · 2020-08-21T02:56:22Z

Test build #127708 has finished for PR 29497 at commit 29451f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-08-21T08:51:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/CatalystErrors.scala

+  def legacyStoreAssignmentPolicyError(): Throwable = {
+    val configKey = SQLConf.STORE_ASSIGNMENT_POLICY.key
+    new AnalysisException(
+      s"LEGACY store assignment policy is disallowed in Spark data source V2. " +


nit: let's remove leading s where it doesn't need.

HyukjinKwon · 2020-08-21T08:52:49Z

standardization of error messages and its maintenance.

How do we plan this?

gatorsmile · 2020-08-21T18:18:43Z

standardization of error messages and its maintenance.

How do we plan this?

The whole community needs to work on this together, define a guide for error messages and gradually improve the error messages in the future releases.

Grouping the messages in a single file should not be blocked by the guide and message improvement. Basically, these PRs are like code cleaning and refactoring. This effort will help us audit the error messages before each release.

cloud-fan · 2020-08-24T16:48:36Z

I'm +1 to this idea, error message is super important to end-users, as it tells them what went wrong and how to fix it. It's easier to audit them if they are grouped together.

We need a clear way to organize it. This PR proposes org.apache.spark.sql.CatalystErrors for sql/catalyst, but how about sql/core, sql/hive and thriftserver? In general, there are 2 kinds of errors: query compilation error and query execution error, shall we group them separately?

one idea is to use different package names for different modules:

org.apache.spark.sql.catalyst.QueryCompilationErrors
org.apache.spark.sql.catalyst.QueryExecutionErrors
org.apache.spark.sql.QueryCompilationErrors
org.apache.spark.sql.QueryExecutionErrors
org.apache.spark.sql.hive.QueryCompilationErrors
org.apache.spark.sql.hive.QueryExecutionErrors
...

I'm open to other ideas as well. cc @maropu @viirya @dongjoon-hyun

viirya · 2020-08-25T03:53:29Z

This idea sounds okay. I just have few questions.

Is it only for certain exceptions or all exceptions? For example, this PR targets only AnalysisException.
How to organize them? query compilation/query execution + package seems okay, but is there possibly unclear cases? For example, AnalysisException is also thrown in RunnableCommand, is it query compilation error and query execution error?
Is a top level object of error message too big to hold all exceptions in the module? For example org.apache.spark.sql.QueryCompilationErrors and org.apache.spark.sql.QueryExecutionErrors might contain many exceptions.

maropu · 2020-08-25T07:58:32Z

one idea is to use different package names for different modules:

Ah, I see. Splitting exceptions into multiple fine-graind groups looks a nice idea.

How to organize them? query compilation/query execution + package seems okay, but is there possibly unclear cases? For example, AnalysisException is also thrown in RunnableCommand, is it query compilation error and query execution error?

I agree to the @viirya suggestion and I think we need a simple but clear rule to categorize exceptions into groups so that developers do not get confused. Probably, we can follow a simple mapping based on exception classes? e.g.,

AnalyssiException => QueryCompilationErrors
SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException, ...) => QueryExecutionErrors
...

anchovYu · 2020-09-10T06:31:20Z

Is it only for certain exceptions or all exceptions? For example, this PR targets only AnalysisException.

In the ideal case, all exception messages should be grouped for easy maintenance and auditing. This PR first starts from the AnalyzerExceptions in Catalyst.

And yeah, if we want to divide these exceptions in groups, then the mapping rule, and how to divide so that contributors can easily follow the rule to commit is a problem. And, another concern of the QueryCompilationErrors and QueryExecutionErrors is that they are different package names + same object name. When developers would like to call functions of compilation errors from different components, they may have to write the full package name.

Grouping exceptions in a single file will necessarily explode the file size; but I think the process is that first we group them, then we look for ways to optimize them, for example, find the duplicate error messages and combine them in one function, or, divide them into smaller and refined groups.

Open to all ideas and great thanks for your review! cc @cloud-fan @maropu @viirya

SparkQA · 2020-09-10T07:05:03Z

Test build #128490 has finished for PR 29497 at commit 5877edf.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-11T08:24:52Z

that they are different package names + same object name.

This is a good point. Is it possible that we put all the error messages in the catalyst module? Other modules depend on catalyst and can access error messages.

gatorsmile · 2020-10-07T05:33:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/CatalystErrors.scala

+ * limitations under the License.
+ */
+
+package org.apache.spark.sql


move org.apache.spark.sql => org.apache.spark.sql.errors

gatorsmile · 2020-10-07T05:33:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/CatalystErrors.scala

+ * Currently it includes all AnalysisExcpetions created and thrown directly in
+ * org.apache.spark.sql.catalyst.analysis.Analyzer.
+ */
+object CatalystErrors {


rename CatalystErrors => QueryCompilationErrors

maropu · 2020-10-22T00:25:51Z

In the ideal case, all exception messages should be grouped for easy maintenance and auditing. This PR first starts from the AnalyzerExceptions in Catalyst.
And yeah, if we want to divide these exceptions in groups, then the mapping rule, and how to divide so that contributors can easily follow the rule to commit is a problem. And, another concern of the QueryCompilationErrors and QueryExecutionErrors is that they are different package names + same object name. When developers would like to call functions of compilation errors from different components, they may have to write the full package name.

This is a good point. Is it possible that we put all the error messages in the catalyst module? Other modules depend on catalyst and can access error messages.

Any update? I think its okay to target at grouping AnalyzerExceptions first in this PR and follow the @cloud-fan suggestion above.

SparkQA · 2020-11-16T05:23:26Z

Test build #131131 has finished for PR 29497 at commit 032c916.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-16T06:21:21Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35736/

SparkQA · 2020-11-16T06:51:31Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35736/

cloud-fan · 2020-11-16T07:44:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/QueryCompilationErrors.scala

+import org.apache.spark.sql.types.{AbstractDataType, DataType, StructType}
+
+/**
+ * Object for grouping all error messages in catalyst.


Object for grouping all error messages of the query compilation.

cloud-fan · 2020-11-16T07:45:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/QueryCompilationErrors.scala

+    )
+  }
+
+  def nonliteralPivotValError(pivotVal: Expression): Throwable = {


nonLiteral...

SparkQA · 2020-11-16T08:05:02Z

Test build #131140 has finished for PR 29497 at commit 645d81b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-16T08:05:02Z

Test build #131133 has finished for PR 29497 at commit f391721.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-16T08:23:38Z

retest this please

SparkQA · 2020-11-16T08:54:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35743/

SparkQA · 2020-11-16T09:22:23Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35743/

SparkQA · 2020-11-16T09:26:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35747/

SparkQA · 2020-11-16T09:56:37Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35747/

SparkQA · 2020-11-16T13:18:17Z

Test build #131144 has finished for PR 29497 at commit 645d81b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2020-11-20T22:35:15Z

LGTM

HyukjinKwon · 2020-11-20T23:33:31Z

Merged to master.

dongjoon-hyun · 2020-11-23T00:44:31Z

sql/catalyst/src/main/scala/org/apache/spark/sql/QueryCompilationErrors.scala

+
+/**
+ * Object for grouping all error messages of the query compilation.
+ * Currently it includes all AnalysisExcpetions created and thrown directly in


AnalysisExcpetions -> AnalysisExceptions?

dongjoon-hyun · 2020-11-23T00:50:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/QueryCompilationErrors.scala

+
+  def groupingSizeTooLargeError(sizeLimit: Int): Throwable = {
+    new AnalysisException(
+      s"Grouping sets size cannot be greater than $sizeLimit")


Although this is related to the original message, maybe, set's size or sets' size is better?

dongjoon-hyun · 2020-11-23T01:02:25Z

late LGTM. Thank you all.

…lyzer in one file ### What changes were proposed in this pull request? This PR follows up #29497. Because #29497 just give us an example to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors. This PR group other `AnalysisExcpetion` into QueryCompilationErrors. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30564 from beliefer/SPARK-32670-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

probot-autolabeler bot added the SQL label Aug 20, 2020

maropu reviewed Aug 20, 2020

View reviewed changes

HyukjinKwon reviewed Aug 21, 2020

View reviewed changes

anchovYu and others added 3 commits September 9, 2020 23:10

initial commit

65ff151

remove unnecessary function

2d76356

update string prefix

5877edf

anchovYu force-pushed the 32670 branch from 29451f9 to 5877edf Compare September 10, 2020 06:14

gatorsmile reviewed Oct 7, 2020

View reviewed changes

update package and object name

032c916

update object name and packege include

f391721

cloud-fan reviewed Nov 16, 2020

View reviewed changes

update comment and function name

645d81b

cloud-fan approved these changes Nov 16, 2020

View reviewed changes

gatorsmile changed the title ~~[WIP][SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file~~ [SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file Nov 20, 2020

HyukjinKwon approved these changes Nov 20, 2020

View reviewed changes

HyukjinKwon closed this in de0f50a Nov 20, 2020

dongjoon-hyun reviewed Nov 23, 2020

View reviewed changes

beliefer mentioned this pull request Dec 2, 2020

[SPARK-32670][SQL][FOLLOWUP] Group exception messages in Catalyst Analyzer in one file #30564

Closed

[SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file #29497

[SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file #29497

Uh oh!

Conversation

anchovYu commented Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Naming of exception functions

Uh oh!

gatorsmile commented Aug 20, 2020

Uh oh!

gatorsmile commented Aug 20, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 21, 2020

Uh oh!

gatorsmile commented Aug 21, 2020

Uh oh!

cloud-fan commented Aug 24, 2020

Uh oh!

viirya commented Aug 25, 2020

Uh oh!

maropu commented Aug 25, 2020

Uh oh!

anchovYu commented Sep 10, 2020

Uh oh!

SparkQA commented Sep 10, 2020

Uh oh!

cloud-fan commented Sep 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Oct 22, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

cloud-fan commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

SparkQA commented Nov 16, 2020

Uh oh!

gatorsmile commented Nov 20, 2020

Uh oh!

anchovYu commented Aug 20, 2020 •

edited

Loading