[Bug] The Hive connector encounters time offset issues when reading and writing data of the timestamp with local time zone field type. #5571
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
Linked issue: open #5568
This PR fixes time zone offset issues in the Hive connector when handling the
timestamp with local time zone
data type. To explain the problem, we first introduce two semantic interpretations of timestamps: Instant semantics and Local semantics.Instant Semantics
Instant semantics corresponds to timestamp with local time zone (equivalent to Flink's TIMESTAMP_LTZ type). It stores a fixed UTC timestamp, which is displayed according to the user's current process or session time zone. Users in different time zones see different local times, but they all represent the same absolute moment. The following Java code demonstrates Instant semantics:
Output (same moment, different time zones):
Local Semantics
Local semantics corresponds to a timezone-agnostic
timestamp
type. It displays local time without adjusting for time zones. Users in different time zones see the same displayed time, but these times do not represent the same absolute moment. The following Java code demonstrates Local semantics:Output (same displayed time, different moments):
Root Cause Analysis
The Hive connector incorrectly treated timestamp with local time zone data using Local semantics, while engines like Spark and Flink correctly followed Instant semantics. This discrepancy caused time zone offsets in GMT+8:
Hive writes + Spark/Flink reads: Timestamps appeared 8 hours ahead
Spark writes + Hive reads: Timestamps appeared 8 hours behind
Solution
The fix ensures the Hive connector adheres to Instant semantics for both reading and writing
timestamp with local time zone
data. This aligns its behavior with Spark and Flink, resolving the time zone offset mismatch.References:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/timezone/#timestamp_ltz-type
https://spark.apache.org/docs/latest/sql-ref-datatypes.html#TimestampType
https://hive.apache.org/docs/latest/different-timestamp-types_103091503/
Tests
API and Format
Documentation