Skip to content

[SPARK-29837][SQL] PostgreSQL dialect: cast to boolean #26463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from

Conversation

Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Nov 11, 2019

What changes were proposed in this pull request?

Make SparkSQL's cast to boolean behavior be consistent with PostgreSQL when
spark.sql.dialect is configured as PostgreSQL.

Why are the changes needed?

SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL.

Does this PR introduce any user-facing change?

Yes. If user switches to PostgreSQL dialect now, they will

  • get an exception if they input a invalid string, e.g "erut", while they get null before;

  • get an exception if they input TimestampType, DateType, LongType, ShortType, ByteType, DecimalType, DoubleType, FloatType values, while they get true or false result before.

And here're evidences for those unsupported types from PostgreSQL:

timestamp:

postgres=# select cast(cast('2019-11-11' as timestamp) as boolean);
ERROR:  cannot cast type timestamp without time zone to boolean

date:

postgres=# select cast(cast('2019-11-11' as date) as boolean);
ERROR:  cannot cast type date to boolean

bigint:

postgres=# select cast(cast('20191111' as bigint) as boolean);
ERROR:  cannot cast type bigint to boolean

smallint:

postgres=# select cast(cast(2019 as smallint) as boolean);
ERROR:  cannot cast type smallint to boolean

bytea:

postgres=# select cast(cast('2019' as bytea) as boolean);
ERROR:  cannot cast type bytea to boolean

decimal:

postgres=# select cast(cast('2019' as decimal) as boolean);
ERROR:  cannot cast type numeric to boolean

float:

postgres=# select cast(cast('2019' as float) as boolean);
ERROR:  cannot cast type double precision to boolean

How was this patch tested?

Added and tested manually.

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #113579 has finished for PR 26463 at commit 75d51ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51 Ngone51 changed the title [SPARK-29837][SLQ] PostgreSQL dialect: cast to boolean [SPARK-29837][SQL] PostgreSQL dialect: cast to boolean Nov 11, 2019
@Ngone51
Copy link
Member Author

Ngone51 commented Nov 11, 2019

cc @cloud-fan

})
case TimestampType | DateType | LongType | ShortType |
ByteType | DecimalType() | DoubleType | FloatType =>
_ => throw new UnsupportedOperationException(s"cannot cast type $from to boolean")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we will get exception at runtime. It's better to detect this kind of type errors and fail at analysis time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved into checkInputDataTypes()


override def dataType: DataType = BooleanType

override def nullable: Boolean = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the result can be null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. After this pr, it won't be null.

@cloud-fan
Copy link
Contributor

cc @gengliangwang

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #113591 has finished for PR 26463 at commit 2ab1847.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
import org.apache.spark.sql.catalyst.expressions.{CastBase, Expression, TimeZoneAwareExpression}
import org.apache.spark.sql.catalyst.expressions.codegen.Block._
import org.apache.spark.sql.catalyst.expressions.codegen.JavaCode
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: remove this unused import.

@SparkQA
Copy link

SparkQA commented Nov 12, 2019

Test build #113606 has finished for PR 26463 at commit 6ad100c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -273,7 +273,7 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
private[this] def needsTimeZone: Boolean = Cast.needsTimeZone(child.dataType, dataType)

// [[func]] assumes the input is no longer null because eval already does the null check.
@inline private[this] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T])
@inline private[catalyst] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change it to private should be good enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but protected[this] could be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is minor but just want to make it clear. In PostgreCastToBoolean we call super.castToBoolean, why the modifier of buildCast becomes a problem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because case StringType inside PostgreCastToBoolean.castToBoolean() needs to call buildCast.

@@ -791,7 +791,7 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit

// The function arguments are: `input`, `result` and `resultIsNull`. We don't need `inputIsNull`
// in parameter list, because the returned code will be put in null safe evaluation region.
private[this] type CastFunction = (ExprValue, ExprValue, ExprValue) => Block
protected[this] type CastFunction = (ExprValue, ExprValue, ExprValue) => Block
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

copy(timeZoneId = Option(timeZoneId))

override def checkInputDataTypes(): TypeCheckResult = child.dataType match {
case StringType | IntegerType => super.checkInputDataTypes()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why call super.checkInputDataTypes() here? shall we just return success?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want the StringType and IntegerType to do Cast.canCast() check in super.checkInputDataTypes(), though I know it's not necessary.

I'll return success directly.


override def dataType: DataType = BooleanType

override def nullable: Boolean = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if child is null? should it be child.nullable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense.

@@ -36,7 +36,8 @@ class PostgreSQLDialectQuerySuite extends QueryTest with SharedSparkSession {
}

Seq("o", "abc", "").foreach { input =>
checkAnswer(sql(s"select cast('$input' as boolean)"), Row(null))
intercept[IllegalArgumentException](
checkAnswer(sql(s"select cast('$input' as boolean)"), Row(null)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we expect error here, we can just run sql(s"select cast('$input' as boolean)").collect

@@ -273,7 +273,7 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit
private[this] def needsTimeZone: Boolean = Cast.needsTimeZone(child.dataType, dataType)

// [[func]] assumes the input is no longer null because eval already does the null check.
@inline private[this] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T])
@inline protected[this] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I don't think [this] here can buy us anything. How about just use protected?

case class PostgreCastToBoolean(child: Expression, timeZoneId: Option[String])
extends CastBase {

override protected def ansiEnabled = SQLConf.get.ansiEnabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be safe, shall we throw exception here? We don't expect to see this method get called. pgsql dialect should not be affected by the ansi mode config.


override def checkInputDataTypes(): TypeCheckResult = child.dataType match {
case StringType | IntegerType => TypeCheckResult.TypeCheckSuccess
case dt => throw new UnsupportedOperationException(s"cannot cast type $dt to boolean")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should follow other expressions and return TypeCheckResult.TypeCheckFailure here.

checkPostgreCastStringToBoolean("abc", null)
checkPostgreCastStringToBoolean("", null)
test("unsupported data types to cast to boolean") {
intercept[Exception](checkPostgreCastToBoolean(new Timestamp(1), null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect to see AnalysisException for these cases.

@SparkQA
Copy link

SparkQA commented Nov 12, 2019

Test build #113616 has finished for PR 26463 at commit 0b9a78d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case TimestampType | DateType | LongType | ShortType |
ByteType | DecimalType() | DoubleType | FloatType =>
TypeCheckResult.TypeCheckFailure(s"cannot cast type ${child.dataType} to boolean")
case _ => TypeCheckResult.TypeCheckSuccess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's simpler and safer to list the supported types, isn't it? For example, binary type can pass the check which is wrong.

test("unsupported data types to cast to boolean") {
intercept[Exception](checkPostgreCastToBoolean(new Timestamp(1), null))
intercept[Exception](checkPostgreCastToBoolean(new Date(1), null))
intercept[Exception](checkPostgreCastToBoolean(1.toLong, null))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be intercept[AnalysisException]...

@SparkQA
Copy link

SparkQA commented Nov 12, 2019

Test build #113632 has finished for PR 26463 at commit b7c6baf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 13, 2019

Filed the flaky test: ImageFileFormatTest.test_read_images

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113677 has finished for PR 26463 at commit 68c7a13.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 13, 2019

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113685 has finished for PR 26463 at commit 68c7a13.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113710 has finished for PR 26463 at commit 999ec61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 13, 2019

Test build #113712 has finished for PR 26463 at commit a178b79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in fe1f456 Nov 14, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@Ngone51
Copy link
Member Author

Ngone51 commented Nov 14, 2019

Thanks a lot ! @cloud-fan

extends CastBase {

override protected def ansiEnabled =
throw new UnsupportedOperationException("PostgreSQL dialect doesn't support ansi mode")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks dangerous. Whenever the code checked the flag ansiEnabled, we will throw an exception. We are removing the pgSQL dialect, this will be updated anyway.

cloud-fan pushed a commit that referenced this pull request Dec 10, 2019
### What changes were proposed in this pull request?
Reprocess all PostgreSQL dialect related PRs, listing in order:

- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part]

### Why are the changes needed?
As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to maintain.
2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.

### Does this PR introduce any user-facing change?
Yes, the config `spark.sql.dialect` will be removed.

### How was this patch tested?
Existing UT.

Closes #26763 from xuanyuanking/SPARK-30125.

Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants