-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-29837][SQL] PostgreSQL dialect: cast to boolean #26463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #113579 has finished for PR 26463 at commit
|
cc @cloud-fan |
}) | ||
case TimestampType | DateType | LongType | ShortType | | ||
ByteType | DecimalType() | DoubleType | FloatType => | ||
_ => throw new UnsupportedOperationException(s"cannot cast type $from to boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we will get exception at runtime. It's better to detect this kind of type errors and fail at analysis time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved into checkInputDataTypes()
|
||
override def dataType: DataType = BooleanType | ||
|
||
override def nullable: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when the result can be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. After this pr, it won't be null.
Test build #113591 has finished for PR 26463 at commit
|
import org.apache.spark.sql.catalyst.analysis.TypeCheckResult | ||
import org.apache.spark.sql.catalyst.expressions.{CastBase, Expression, TimeZoneAwareExpression} | ||
import org.apache.spark.sql.catalyst.expressions.codegen.Block._ | ||
import org.apache.spark.sql.catalyst.expressions.codegen.JavaCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: remove this unused import.
Test build #113606 has finished for PR 26463 at commit
|
@@ -273,7 +273,7 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
private[this] def needsTimeZone: Boolean = Cast.needsTimeZone(child.dataType, dataType) | |||
|
|||
// [[func]] assumes the input is no longer null because eval already does the null check. | |||
@inline private[this] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T]) | |||
@inline private[catalyst] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change it to private
should be good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, but protected[this]
could be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is minor but just want to make it clear. In PostgreCastToBoolean
we call super.castToBoolean
, why the modifier of buildCast
becomes a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because case StringType
inside PostgreCastToBoolean.castToBoolean()
needs to call buildCast
.
@@ -791,7 +791,7 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
|
|||
// The function arguments are: `input`, `result` and `resultIsNull`. We don't need `inputIsNull` | |||
// in parameter list, because the returned code will be put in null safe evaluation region. | |||
private[this] type CastFunction = (ExprValue, ExprValue, ExprValue) => Block | |||
protected[this] type CastFunction = (ExprValue, ExprValue, ExprValue) => Block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
copy(timeZoneId = Option(timeZoneId)) | ||
|
||
override def checkInputDataTypes(): TypeCheckResult = child.dataType match { | ||
case StringType | IntegerType => super.checkInputDataTypes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why call super.checkInputDataTypes()
here? shall we just return success?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want the StringType
and IntegerType
to do Cast.canCast()
check in super.checkInputDataTypes()
, though I know it's not necessary.
I'll return success directly.
|
||
override def dataType: DataType = BooleanType | ||
|
||
override def nullable: Boolean = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if child
is null? should it be child.nullable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense.
@@ -36,7 +36,8 @@ class PostgreSQLDialectQuerySuite extends QueryTest with SharedSparkSession { | |||
} | |||
|
|||
Seq("o", "abc", "").foreach { input => | |||
checkAnswer(sql(s"select cast('$input' as boolean)"), Row(null)) | |||
intercept[IllegalArgumentException]( | |||
checkAnswer(sql(s"select cast('$input' as boolean)"), Row(null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we expect error here, we can just run sql(s"select cast('$input' as boolean)").collect
@@ -273,7 +273,7 @@ abstract class CastBase extends UnaryExpression with TimeZoneAwareExpression wit | |||
private[this] def needsTimeZone: Boolean = Cast.needsTimeZone(child.dataType, dataType) | |||
|
|||
// [[func]] assumes the input is no longer null because eval already does the null check. | |||
@inline private[this] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T]) | |||
@inline protected[this] def buildCast[T](a: Any, func: T => Any): Any = func(a.asInstanceOf[T]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I don't think [this]
here can buy us anything. How about just use protected
?
case class PostgreCastToBoolean(child: Expression, timeZoneId: Option[String]) | ||
extends CastBase { | ||
|
||
override protected def ansiEnabled = SQLConf.get.ansiEnabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be safe, shall we throw exception here? We don't expect to see this method get called. pgsql dialect should not be affected by the ansi mode config.
|
||
override def checkInputDataTypes(): TypeCheckResult = child.dataType match { | ||
case StringType | IntegerType => TypeCheckResult.TypeCheckSuccess | ||
case dt => throw new UnsupportedOperationException(s"cannot cast type $dt to boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should follow other expressions and return TypeCheckResult.TypeCheckFailure
here.
checkPostgreCastStringToBoolean("abc", null) | ||
checkPostgreCastStringToBoolean("", null) | ||
test("unsupported data types to cast to boolean") { | ||
intercept[Exception](checkPostgreCastToBoolean(new Timestamp(1), null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We expect to see AnalysisException
for these cases.
Test build #113616 has finished for PR 26463 at commit
|
case TimestampType | DateType | LongType | ShortType | | ||
ByteType | DecimalType() | DoubleType | FloatType => | ||
TypeCheckResult.TypeCheckFailure(s"cannot cast type ${child.dataType} to boolean") | ||
case _ => TypeCheckResult.TypeCheckSuccess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's simpler and safer to list the supported types, isn't it? For example, binary type can pass the check which is wrong.
test("unsupported data types to cast to boolean") { | ||
intercept[Exception](checkPostgreCastToBoolean(new Timestamp(1), null)) | ||
intercept[Exception](checkPostgreCastToBoolean(new Date(1), null)) | ||
intercept[Exception](checkPostgreCastToBoolean(1.toLong, null)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be intercept[AnalysisException]...
Test build #113632 has finished for PR 26463 at commit
|
Filed the flaky test: ImageFileFormatTest.test_read_images |
Test build #113677 has finished for PR 26463 at commit
|
Jenkins, retest this please. |
Test build #113685 has finished for PR 26463 at commit
|
Test build #113710 has finished for PR 26463 at commit
|
Test build #113712 has finished for PR 26463 at commit
|
thanks, merging to master! |
Thanks a lot ! @cloud-fan |
extends CastBase { | ||
|
||
override protected def ansiEnabled = | ||
throw new UnsupportedOperationException("PostgreSQL dialect doesn't support ansi mode") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks dangerous. Whenever the code checked the flag ansiEnabled, we will throw an exception. We are removing the pgSQL dialect, this will be updated anyway.
### What changes were proposed in this pull request? Reprocess all PostgreSQL dialect related PRs, listing in order: - #25158: PostgreSQL integral division support [revert] - #25170: UT changes for the integral division support [revert] - #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert] - #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert] - #26112: Date substraction support [keep the ANSI-compliant part] - #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert] - #26463: Cast to boolean support for PostgreSQL dialect [revert] - #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part] ### Why are the changes needed? As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons: 1. The current approach makes the codebase complicated and hard to maintain. 2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now. ### Does this PR introduce any user-facing change? Yes, the config `spark.sql.dialect` will be removed. ### How was this patch tested? Existing UT. Closes #26763 from xuanyuanking/SPARK-30125. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Make SparkSQL's
cast to boolean
behavior be consistent with PostgreSQL whenspark.sql.dialect is configured as PostgreSQL.
Why are the changes needed?
SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when
spark.sql.dialect
is configured as PostgreSQL.Does this PR introduce any user-facing change?
Yes. If user switches to PostgreSQL dialect now, they will
get an exception if they input a invalid string, e.g "erut", while they get
null
before;get an exception if they input
TimestampType
,DateType
,LongType
,ShortType
,ByteType
,DecimalType
,DoubleType
,FloatType
values, while they gettrue
orfalse
result before.And here're evidences for those unsupported types from PostgreSQL:
timestamp:
date:
bigint:
smallint:
bytea:
decimal:
float:
How was this patch tested?
Added and tested manually.