Skip to content

Gate non-default StringTypeWithCollation inputs on Spark 4.0 datetime expressions #4646

@andygrove

Description

@andygrove

Describe the bug

Spark 4.0 widens many string-typed inputTypes on datetime expressions to StringTypeWithCollation(supportsTrimCollation = true). The affected datetime expressions include convert_timezone, date_format, date_trunc, from_unixtime, make_timestamp, next_day, to_unix_timestamp, trunc, and unix_timestamp.

Today the Comet serdes for these expressions accept those string inputs without distinguishing the collation, so non-default collations are silently treated as compatible. Per the audit-comet-expression skill (rule 11), a non-default collation on a string input should flip the support level to Incompatible(Some(...)) so the divergence is visible in EXPLAIN and the auto-generated compatibility guide, and so the projection falls back rather than producing potentially divergent results.

Steps to reproduce

On Spark 4.0, apply a non-default collation (for example UTF8_LCASE or UNICODE_CI) to a string argument of one of the datetime expressions above and observe that Comet still runs the expression natively without distinguishing the collation.

Expected behavior

Non-default collations on string inputs to these datetime expressions should report Incompatible(Some(...)) (falling back unless explicitly opted in), consistent with how other expressions gate collation.

Additional context

Split out from the high-priority list in #4502 (item 5, originally tracked as medium priority) so that #4502 can be closed once the remaining fixes land. Cross-references #2190 and #4496.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions