Skip to content

[SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file #29497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

anchovYu
Copy link
Contributor

@anchovYu anchovYu commented Aug 20, 2020

What changes were proposed in this pull request?

Group all messages of AnalysisExcpetions created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file.

  • Create a new object: org.apache.spark.sql.CatalystErrors with many exception-creating functions.
  • When the Analyzer wants to create and throw a new AnalysisException, call functions of CatalystErrors

Why are the changes needed?

This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance.

Does this PR introduce any user-facing change?

No. Error messages remain unchanged.

How was this patch tested?

No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Naming of exception functions

All function names ended with Error.

  • For specific errors like groupingIDMismatch and groupingColInvalid, directly use them as name, just like groupingIDMismatchError and groupingColInvalidError.
  • For generic errors like dataTypeMismatch,
    • if confident with the context, prefix and condition can be added, like pivotValDataTypeMismatchError
    • if not sure about the context, add a For suffix of the specific component that this exception is related to, like dataTypeMismatchForDeserializerError

@gatorsmile
Copy link
Member

ok to test

@gatorsmile
Copy link
Member

cc @cloud-fan @maropu

* org.apache.spark.sql.catalyst.analysis.Analyzer.
*/
object CatalystErrors {
def groupingIDMismatchError(groupingID: GroupingID, groupByExprs: Seq[Expression]): Throwable = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about moving these methods into object AnalysisException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a summary sheet of raw exceptions thrown in Catalyst of branch-3.0: OSS Catalyst Exception Messages: Branch-3.0. Apart from AnalysisException and its sub-exceptions, there are a lot of other exceptions, for example, IllegalArgumentException and UnsupportedOperationException.

We generally have two ways to group these messages:

  • By component. All messages from a single components, no matter the exception type, are grouped into one exception file.
  • By exception type. All messages of a single exception type are created in the corresponding exception object. It is great because by calling AnalysisException.groupingIDMismatchError we know the exception type this function throws. But this approach has a problem: those Java/Scala internal exception type cannot follow this way.

Due to the small problem of the latter one, I choose the first one of grouping all exception messages from Catalyst in one file. Or do you have any thoughts on that? Thanks!

/**
* Object for grouping all error messages in catalyst.
* Currently it includes all AnalysisExcpetions created and thrown directly in
* org.apache.spark.sql.catalyst.analysis.Analyzer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question; did you propose to group all the analysis exception here? I think we already have too many places (~500) where the analysis exceptions used though... https://gist.github.com/maropu/29bbfa9a93c41bd4ba1e0eaa038af087

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do all of them, I think. This is just an example PR.

This can help us do error message auditing. In the future, we can review this file, improve the error message quality, and make them unified and standard in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks nice, @gatorsmile. I think we might be able to add a new rule in scalastyle for forbidding the use of AnalysisException in the the other files.

@SparkQA
Copy link

SparkQA commented Aug 21, 2020

Test build #127708 has finished for PR 29497 at commit 29451f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def legacyStoreAssignmentPolicyError(): Throwable = {
val configKey = SQLConf.STORE_ASSIGNMENT_POLICY.key
new AnalysisException(
s"LEGACY store assignment policy is disallowed in Spark data source V2. " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's remove leading s where it doesn't need.

@HyukjinKwon
Copy link
Member

standardization of error messages and its maintenance.

How do we plan this?

@gatorsmile
Copy link
Member

standardization of error messages and its maintenance.

How do we plan this?

The whole community needs to work on this together, define a guide for error messages and gradually improve the error messages in the future releases.

Grouping the messages in a single file should not be blocked by the guide and message improvement. Basically, these PRs are like code cleaning and refactoring. This effort will help us audit the error messages before each release.

@cloud-fan
Copy link
Contributor

I'm +1 to this idea, error message is super important to end-users, as it tells them what went wrong and how to fix it. It's easier to audit them if they are grouped together.

We need a clear way to organize it. This PR proposes org.apache.spark.sql.CatalystErrors for sql/catalyst, but how about sql/core, sql/hive and thriftserver? In general, there are 2 kinds of errors: query compilation error and query execution error, shall we group them separately?

one idea is to use different package names for different modules:

org.apache.spark.sql.catalyst.QueryCompilationErrors
org.apache.spark.sql.catalyst.QueryExecutionErrors
org.apache.spark.sql.QueryCompilationErrors
org.apache.spark.sql.QueryExecutionErrors
org.apache.spark.sql.hive.QueryCompilationErrors
org.apache.spark.sql.hive.QueryExecutionErrors
...

I'm open to other ideas as well. cc @maropu @viirya @dongjoon-hyun

@viirya
Copy link
Member

viirya commented Aug 25, 2020

This idea sounds okay. I just have few questions.

  1. Is it only for certain exceptions or all exceptions? For example, this PR targets only AnalysisException.

  2. How to organize them? query compilation/query execution + package seems okay, but is there possibly unclear cases? For example, AnalysisException is also thrown in RunnableCommand, is it query compilation error and query execution error?

  3. Is a top level object of error message too big to hold all exceptions in the module? For example org.apache.spark.sql.QueryCompilationErrors and org.apache.spark.sql.QueryExecutionErrors might contain many exceptions.

@maropu
Copy link
Member

maropu commented Aug 25, 2020

one idea is to use different package names for different modules:

Ah, I see. Splitting exceptions into multiple fine-graind groups looks a nice idea.

How to organize them? query compilation/query execution + package seems okay, but is there possibly unclear cases? For example, AnalysisException is also thrown in RunnableCommand, is it query compilation error and query execution error?

I agree to the @viirya suggestion and I think we need a simple but clear rule to categorize exceptions into groups so that developers do not get confused. Probably, we can follow a simple mapping based on exception classes? e.g.,

AnalyssiException => QueryCompilationErrors
SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException, ...) => QueryExecutionErrors
...

@anchovYu
Copy link
Contributor Author

  1. Is it only for certain exceptions or all exceptions? For example, this PR targets only AnalysisException.

In the ideal case, all exception messages should be grouped for easy maintenance and auditing. This PR first starts from the AnalyzerExceptions in Catalyst.

And yeah, if we want to divide these exceptions in groups, then the mapping rule, and how to divide so that contributors can easily follow the rule to commit is a problem. And, another concern of the QueryCompilationErrors and QueryExecutionErrors is that they are different package names + same object name. When developers would like to call functions of compilation errors from different components, they may have to write the full package name.

Grouping exceptions in a single file will necessarily explode the file size; but I think the process is that first we group them, then we look for ways to optimize them, for example, find the duplicate error messages and combine them in one function, or, divide them into smaller and refined groups.

Open to all ideas and great thanks for your review! cc @cloud-fan @maropu @viirya

@SparkQA
Copy link

SparkQA commented Sep 10, 2020

Test build #128490 has finished for PR 29497 at commit 5877edf.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

that they are different package names + same object name.

This is a good point. Is it possible that we put all the error messages in the catalyst module? Other modules depend on catalyst and can access error messages.

* limitations under the License.
*/

package org.apache.spark.sql
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move org.apache.spark.sql => org.apache.spark.sql.errors

* Currently it includes all AnalysisExcpetions created and thrown directly in
* org.apache.spark.sql.catalyst.analysis.Analyzer.
*/
object CatalystErrors {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename CatalystErrors => QueryCompilationErrors

@maropu
Copy link
Member

maropu commented Oct 22, 2020

In the ideal case, all exception messages should be grouped for easy maintenance and auditing. This PR first starts from the AnalyzerExceptions in Catalyst.
And yeah, if we want to divide these exceptions in groups, then the mapping rule, and how to divide so that contributors can easily follow the rule to commit is a problem. And, another concern of the QueryCompilationErrors and QueryExecutionErrors is that they are different package names + same object name. When developers would like to call functions of compilation errors from different components, they may have to write the full package name.

This is a good point. Is it possible that we put all the error messages in the catalyst module? Other modules depend on catalyst and can access error messages.

Any update? I think its okay to target at grouping AnalyzerExceptions first in this PR and follow the @cloud-fan suggestion above.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131131 has finished for PR 29497 at commit 032c916.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35736/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35736/

import org.apache.spark.sql.types.{AbstractDataType, DataType, StructType}

/**
* Object for grouping all error messages in catalyst.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object for grouping all error messages of the query compilation.

)
}

def nonliteralPivotValError(pivotVal: Expression): Throwable = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nonLiteral...

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131140 has finished for PR 29497 at commit 645d81b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131133 has finished for PR 29497 at commit f391721.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35743/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35743/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35747/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35747/

@SparkQA
Copy link

SparkQA commented Nov 16, 2020

Test build #131144 has finished for PR 29497 at commit 645d81b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile gatorsmile changed the title [WIP][SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file [SPARK-32670][SQL]Group exception messages in Catalyst Analyzer in one file Nov 20, 2020
@gatorsmile
Copy link
Member

LGTM

@HyukjinKwon
Copy link
Member

Merged to master.


/**
* Object for grouping all error messages of the query compilation.
* Currently it includes all AnalysisExcpetions created and thrown directly in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalysisExcpetions -> AnalysisExceptions?


def groupingSizeTooLargeError(sizeLimit: Int): Throwable = {
new AnalysisException(
s"Grouping sets size cannot be greater than $sizeLimit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this is related to the original message, maybe, set's size or sets' size is better?

@dongjoon-hyun
Copy link
Member

late LGTM. Thank you all.

cloud-fan pushed a commit that referenced this pull request Dec 10, 2020
…lyzer in one file

### What changes were proposed in this pull request?
This PR follows up #29497.
Because #29497 just give us an example to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors.
This PR group other `AnalysisExcpetion` into QueryCompilationErrors.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #30564 from beliefer/SPARK-32670-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants