Skip to content

[Bug] The Hive connector encounters time offset issues when reading and writing data of the timestamp with local time zone field type. #5571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Jack1007
Copy link
Contributor

@Jack1007 Jack1007 commented May 7, 2025

Purpose

Linked issue: open #5568
This PR fixes time zone offset issues in the Hive connector when handling the timestamp with local time zone data type. To explain the problem, we first introduce two semantic interpretations of timestamps: Instant semantics and Local semantics.

Instant Semantics

Instant semantics corresponds to timestamp with local time zone (equivalent to Flink's TIMESTAMP_LTZ type). It stores a fixed UTC timestamp, which is displayed according to the user's current process or session time zone. Users in different time zones see different local times, but they all represent the same absolute moment. The following Java code demonstrates Instant semantics:

// Instant semantics
long nowMs = System.currentTimeMillis();
System.out.println(Instant.ofEpochMilli(nowMs).atZone(TimeZone.getTimeZone("Asia/Shanghai").toZoneId()));
System.out.println(Instant.ofEpochMilli(nowMs).atZone(TimeZone.getTimeZone("Asia/Tokyo").toZoneId()));

Output (same moment, different time zones):

2025-05-07T10:40:32.107+08:00[Asia/Shanghai]
2025-05-07T11:40:32.107+09:00[Asia/Tokyo]

Local Semantics

Local semantics corresponds to a timezone-agnostic timestamp type. It displays local time without adjusting for time zones. Users in different time zones see the same displayed time, but these times do not represent the same absolute moment. The following Java code demonstrates Local semantics:

// Local semantics
LocalDateTime localDateTime = LocalDateTime.now();
System.out.println(localDateTime.atZone(ZoneId.of("Asia/Shanghai")));
System.out.println(localDateTime.atZone(ZoneId.of("Asia/Tokyo")));

Output (same displayed time, different moments):

2025-05-07T10:40:32.222+08:00[Asia/Shanghai]
2025-05-07T10:40:32.222+09:00[Asia/Tokyo]

Root Cause Analysis

The Hive connector incorrectly treated timestamp with local time zone data using Local semantics, while engines like Spark and Flink correctly followed Instant semantics. This discrepancy caused time zone offsets in GMT+8:

  • Hive writes + Spark/Flink reads: Timestamps appeared 8 hours ahead

  • Spark writes + Hive reads: Timestamps appeared 8 hours behind

Solution

The fix ensures the Hive connector adheres to Instant semantics for both reading and writing timestamp with local time zone data. This aligns its behavior with Spark and Flink, resolving the time zone offset mismatch.

References:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/timezone/#timestamp_ltz-type
https://spark.apache.org/docs/latest/sql-ref-datatypes.html#TimestampType
https://hive.apache.org/docs/latest/different-timestamp-types_103091503/

Tests

API and Format

Documentation

@Jack1007
Copy link
Contributor Author

Jack1007 commented May 8, 2025

@LsomeYeah Please take another look at this PR. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant