Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32018][FOLLOWUP][Doc] Add migration guide for decimal value overflow in sum aggregation #29458

Closed
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,10 @@ license: |

- In Spark 3.1, NULL elements of structures, arrays and maps are converted to "null" in casting them to strings. In Spark 3.0 or earlier, NULL elements are converted to empty strings. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.castComplexTypesToString.enabled` to `true`.

- In Spark 3.1, when `spark.sql.ansi.enabled` is false, sum aggregation of decimal type column always returns `null` on decimal value overflow. In Spark 3.0 or earlier, when `spark.sql.ansi.enabled` is false and decimal value overflow happens in sum aggregation of decimal type column:
- If it is hash aggregation with `group by` clause, a runtime exception is thrown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not many users know the physical nodes. How about

In Spark 3.1, Spark always returns null if the sum of decimal overflows under non-ANSI
mode (`spark.sql.ansi.enabled` is false). In Spark 3.0 or earlier, the sum of decimal may
fail at runtime under non-ANSI mode (when the query has GROUP BY and is planned as hash aggregate)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not many users know the physical nodes. How about

In Spark 3.1, Spark always returns null if the sum of decimal overflows under non-ANSI
mode (`spark.sql.ansi.enabled` is false). In Spark 3.0 or earlier, the sum of decimal may
fail at runtime under non-ANSI mode (when the query has GROUP BY and is planned as hash aggregate)

Copy link
Member Author

@gengliangwang gengliangwang Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name "non-ANSI mode" is a bit wired.
Also, we have to mention that Spark 3.0 or earlier returns null under certain conditions.

Copy link
Contributor

@cloud-fan cloud-fan Aug 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use "default mode".

I don't see a difference between "may fail at runtime" or "may return null". They are mutually exclusive.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have updated the doc and screenshot

- Otherwise, null is returned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we need to describe spark.sql.ansi.enabled is false two times? I think its okay just to describe it like this;

In Spark 3.0 or earlier, the sum of...

or

In Spark 3.0 or earlier, in the case, the sum of...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks

## Upgrading from Spark SQL 3.0 to 3.0.1

- In Spark 3.0, JSON datasource and JSON function `schema_of_json` infer TimestampType from string values if they match to the pattern defined by the JSON option `timestampFormat`. Since version 3.0.1, the timestamp type inference is disabled by default. Set the JSON option `inferTimestamp` to `true` to enable such type inference.
Expand Down