Skip to content

[Bug] throw Hive DDL and paimon schema mismatched excetion when show the table's schema with the timestamp type #5501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 8, 2025

Conversation

Jack1007
Copy link
Contributor

@Jack1007 Jack1007 commented Apr 21, 2025

Purpose

To fix issue #5450

This PR addresses three issues related to Hive3 compatibility with Paimon tables:

  1. Type Validation Failure for Timestamp LTZ
    Hive3 stores the timestamp with local time zone type as timestamp in its metastore, while Paimon explicitly uses the type name timestamp with local time zone during type conversion. Although Hive3 natively supports this type, the metastore's type name mismatch causes validation failures. We resolve this by allowing type compatibility validation to bypass the naming inconsistency.

  2. -8-Hour Timezone Offset in Query Results
    After fixing the first issue, Hive3 successfully executes DDL operations but returns timestamp values with an incorrect -8-hour offset for the Beijing timezone. The root cause lies in the PaimonTimestampLocalTZObjectInspector.getPrimitiveJavaObject method (in the paimon-hive-connector-3.1 module), which fails to apply the session timezone during deserialization. According to the Parquet specification, timestamp with local time zone fields store UTC timestamps and require conversion to the local timezone during processing. We fix this by ensuring proper UTC-to-session-timezone conversion in the deserialization logic.

  3. +8-Hour Offset in SparkSQL Queries After INSERT
    When inserting Beijing local time into a Hive3-created table with a timestamp with local time zone field, querying via SparkSQL reveals an 8-hour offset. This is caused by Hive incorrectly treating timestamp data as local time during serialization, violating the timestamp with local time zone semantics. Specifically, the PaimonTimestampLocalTZObjectInspector.convert method erroneously converts Hive time data to a local Timestamp type instead of adhering to UTC. The fix involves converting the Hive time data to a UTC timestamp before storing it in Paimon’s timestamp type, ensuring alignment with the expected UTC-based storage semantics.

References:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#instant-semantics-timestamps-normalized-to-utc
https://nightlies.apache.org/flink/flink-docs-master/docs/dev/table/timezone/#timestamp_ltz-type
https://spark.apache.org/docs/latest/sql-ref-datatypes.html#TimestampType
https://hive.apache.org/docs/latest/different-timestamp-types_103091503/

Linked issue: open #5450

Tests

-- issue 1: show the schema of the table created by spark is ok
-- show create table is ok
hive> show create table test.data_type_table_ts;
OK
CREATE TABLE `test.data_type_table_ts`(
  `id` int COMMENT 'from deserializer', 
  `ts` timestamp with local time zone COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'org.apache.paimon.hive.PaimonSerDe' 
STORED BY 
  'org.apache.paimon.hive.PaimonStorageHandler' 

LOCATION
  'hdfs://nameservice1/user/hive/warehouse/test.db/data_type_table_ts'
TBLPROPERTIES (
  'table_type'='PAIMON')

-- issue 2: select rows inserted by spark in hive is ok.
-- select is ok
hive> select * from test.data_type_table_ts;
OK
1       2025-04-13 16:00:00.0 PRC
2       2025-04-13 17:05:38.11 PRC
3       2025-04-15 13:46:30.0 PRC

-- select with functions is ok
hive> select id,from_unixtime(unix_timestamp(ts), 'yyyy-MM-dd HH:mm:ss') from test.data_type_table_ts;
OK
1       2025-04-13 16:00:00
2       2025-04-13 17:05:38
3       2025-04-15 13:46:30

-- issue 3: insert rows in hive and then select rows in hive and spark is ok.
-- create table using hive3 and insert into rows
hive> show create table test.hive_test_table;
OK
CREATE TABLE `test.hive_test_table`(
  `a` int COMMENT 'The a field', 
  `ts` timestamp with local time zone COMMENT 'The ts field')
ROW FORMAT SERDE 
  'org.apache.paimon.hive.PaimonSerDe' 
STORED BY 
  'org.apache.paimon.hive.PaimonStorageHandler' 
WITH SERDEPROPERTIES ( 
  'serialization.format'='1')
LOCATION
  'hdfs://nameservice1/user/hive/warehouse/test.db/hive_test_table'
TBLPROPERTIES (
  'bucketing_version'='2', 
  'last_modified_by'='hive', 
  'last_modified_time'='1745292917', 
  'transient_lastDdlTime'='1745292917')

hive> insert into test.hive_test_table values (9, '2023-01-12 20:00:01.12345');


-- query with spark and hive
spark-sql (default)> select * from test.hive_test_table where a = 9;
9       2023-01-12 20:00:01.12345

hive> select * from test.hive_test_table where a = 9;
OK
9       2023-01-12 20:00:01.12345 PRC

API and Format

no

Documentation

no

@Jack1007 Jack1007 changed the title fix issue 5450 [Bug] throw Hive DDL and paimon schema mismatched excetion when show the table's schema with the timestamp type Apr 21, 2025
@wwj6591812
Copy link
Contributor

The ut case failed.
HiveWriteITCase#testInsertLocalZonedTimestamp

@Jack1007
Copy link
Contributor Author

The ut case failed. HiveWriteITCase#testInsertLocalZonedTimestamp

Has fixed, please review again. Thanks.

@wwj6591812
Copy link
Contributor

@zhuangchong help cc,thx

@LsomeYeah
Copy link
Contributor

LsomeYeah commented Apr 28, 2025

For type validation failure for Timestamp LTZ, I think the key problem is not the validation failure perhaps. Summarize the main steps of creating table which metastore is hms with spark sql:

  1. Transform spark type to paimon type(org.apache.paimon.spark.SparkTypeUtils). timestamp in spark will be transformed to TIMESTAMP(6) WITH LOCAL TIME ZONE.
  2. Create paimon table in HiveCatalog(org.apache.paimon.hive.HiveCatalog#createTableImpl). When transforming paimon type to hive type, TIMESTAMP(6) WITH LOCAL TIME ZONE was transformed to timestamp in hive and causing the subsequent validation to fail.
    // org.apache.paimon.hive.HiveTypeUtils.PaimonToHiveTypeVisitor
    @Override
    public TypeInfo visit(LocalZonedTimestampType localZonedTimestampType) {
        return LocalZonedTimestampTypeUtils.hiveLocalZonedTimestampType();
    }
    // org.apache.paimon.hive.LocalZonedTimestampTypeUtils
    public static TypeInfo hiveLocalZonedTimestampType() {
        try {
            Class<?> typeInfoFactoryClass =
                    Class.forName("org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory");
            Field field = typeInfoFactoryClass.getField("timestampLocalTZTypeInfo");
            return (TypeInfo) field.get(null);
        } catch (Exception e) {
            return TypeInfoFactory.timestampTypeInfo;
        }
    }

In pr #4571, it determines timestamp ltz type based on hive runtime version. Therefore, I think the key problem is to find why TIMESTAMP(6) WITH LOCAL TIME ZONE is tranformed to timestamp in hive, not just skipping the validation.

Maybe you can test this in org.apache.paimon.spark.sql.DDLWithHiveCatalogTest in paimon-spark-3.5.

@Jack1007
Copy link
Contributor Author

Jack1007 commented Apr 30, 2025

For type validation failure for Timestamp LTZ, I think the key problem is not the validation failure perhaps. Summarize the main steps of creating table which metastore is hms with spark sql:

1. Transform spark type to paimon type(org.apache.paimon.spark.SparkTypeUtils). `timestamp` in spark will be transformed to `TIMESTAMP(6) WITH LOCAL TIME ZONE`.

2. Create paimon table in HiveCatalog(org.apache.paimon.hive.HiveCatalog#createTableImpl). When transforming paimon type to hive type, `TIMESTAMP(6) WITH LOCAL TIME ZONE` was transformed to `timestamp` in hive and causing the subsequent validation to fail.
    // org.apache.paimon.hive.HiveTypeUtils.PaimonToHiveTypeVisitor
    @Override
    public TypeInfo visit(LocalZonedTimestampType localZonedTimestampType) {
        return LocalZonedTimestampTypeUtils.hiveLocalZonedTimestampType();
    }
    // org.apache.paimon.hive.LocalZonedTimestampTypeUtils
    public static TypeInfo hiveLocalZonedTimestampType() {
        try {
            Class<?> typeInfoFactoryClass =
                    Class.forName("org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory");
            Field field = typeInfoFactoryClass.getField("timestampLocalTZTypeInfo");
            return (TypeInfo) field.get(null);
        } catch (Exception e) {
            return TypeInfoFactory.timestampTypeInfo;
        }
    }

In pr #4571, it determines timestamp ltz type based on hive runtime version. Therefore, I think the key problem is to find why TIMESTAMP(6) WITH LOCAL TIME ZONE is tranformed to timestamp in hive, not just skipping the validation.

Maybe you can test this in org.apache.paimon.spark.sql.DDLWithHiveCatalogTest in paimon-spark-3.5.

First, thank you for your previous response. The two-step process you mentioned about Spark table creation is correct. The first step of converting Spark types to Paimon types works properly. However, an issue occurs during the second step when using HiveCatalog (specifically org.apache.paimon.hive.HiveCatalog#createTableImpl) to create the Hive table. The error happens during the convertToFieldSchema conversion to metastore types, particularly at the LocalZonedTimestampTypeUtils#hiveLocalZonedTimestampType method.

As shown in the code snippet you referenced:

public static TypeInfo hiveLocalZonedTimestampType() {
    try {
        Class<?> typeInfoFactoryClass =
                Class.forName("org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory");
        Field field = typeInfoFactoryClass.getField("timestampLocalTZTypeInfo");
        return (TypeInfo) field.get(null);
    } catch (Exception e) {
        // Exception occurs: java.lang.NoSuchFieldException: timestampLocalTZTypeInfo
        return TypeInfoFactory.timestampTypeInfo;
    }
}

The code itself is correct. It attempts to retrieve the timestampLocalTZTypeInfo property from Hive's TypeInfoFactory class in the hive-serde package. In Hive 3, this property exists to support the timestamp with local time zone type, which is a new feature compared to Hive 2. However, when executing the table creation via Spark SQL (Spark 3.5), the issue arises because Spark 3.5 internally bundles Hive 2.3.9. The Hive 2.x hive-serde dependency lacks the timestampLocalTZTypeInfo field, causing the code to fall back to returning timestampTypeInfo. Consequently, the metastore stores the field type as regular timestamp instead of timestamp with local time zone.

When using a native Hive 3 client to create Paimon tables, the timestamp with local time zone type is correctly preserved. To fundamentally resolve this Spark-related issue, one would need to upgrade Spark's built-in Hive dependency to Hive 3. However, this is highly challenging due to Spark's deep integration with Hive 2.x at the code level. Modifying Spark's internal dependencies would be non-trivial and falls outside Paimon's scope of responsibility.

Therefore, a more practical solution for Paimon users—as implemented in my PR—is to relax the type validation checks in Paimon's code. This adjustment allows the system to gracefully handle Spark's Hive 2.x limitations while maintaining compatibility with Hive 3 environments.

@LsomeYeah
Copy link
Contributor

@Jack1007 Thanks for your explanation! It seems like a compatible problem between hive 2 and hive 3. Could you please separate this pr to two PRs, one for type validation and one for time offset? And create a new issue to describe the problem of time offset in detail, as the handling of time offsets in Hive is complex.

@Jack1007
Copy link
Contributor Author

Jack1007 commented May 6, 2025

@Jack1007 Thanks for your explanation! It seems like a compatible problem between hive 2 and hive 3. Could you please separate this pr to two PRs, one for type validation and one for time offset? And create a new issue to describe the problem of time offset in detail, as the handling of time offsets in Hive is complex.

OK, This PR will focus on fixing the timestamp type validation issue. I will create a new PR specifically to address and explain the timezone offset problems along with the proposed resolution steps.

@Jack1007
Copy link
Contributor Author

Jack1007 commented May 7, 2025

I have split this PR into two separate ones. This PR retains only the code fixes for Hive timestamp type validation, while the other PR specifically addresses and explains the time zone offset issue with the timestamp with local time zone data type in Hive.

The other PR can be found here: 5571

@LsomeYeah Please review this PR again. Thank you!

Comment on lines 258 to 267
if (schemaFieldTypeInfos
.get(i)
.getTypeName()
.equals("timestamp with local time zone")) {
// Hive timestamp is compatible with paimon timestamp
// with local time zone
LOG.info(
"Hive DDL and paimon schema mismatched, but Hive timestamp is compatible with paimon timestamp with local time zone");
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not check whether the name of the paimon field is equal to the name of the hive field. And it does not verify whether the two types are compatible (e.g., schemaFieldTypeInfo is ltz, but hiveFieldTypeInfo is int, which is not compatible). We can allow ltz in paimon to be compatible with timestamp and ltz in hive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there was a logical flaw in this section. I have adjusted the if conditions to maintain the validity of higher-level checks while enforcing stricter field validation—now requiring both matching field names and restricting compatibility solely between Hive's timestamp and Paimon's timestamp with local time zone types.

Copy link
Contributor

@LsomeYeah LsomeYeah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@yuzelin yuzelin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@yuzelin yuzelin merged commit ca2b76c into apache:master May 8, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants