get error value if timestamp represented by the INT96 in the parquet file #9981

liukun4515 · 2024-04-07T08:48:07Z

Describe the bug

We use the INT96 physical type to store the timestamp column which is written by the Spark ETL engine.

We create the logical/physical plan to read the parquet file including above timestamp column using the UTC-7 timezone, the result of value for the timestamp column will be adjusted.

The field of timestamp column in the schema for the physical/logical plan is timestamp(nanosecond, "utc-7") or timestamp(nonosecond, "-07:00")

@waitingkuo @alamb

Do you have any insight for that?

To Reproduce

No response

Expected behavior

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

alamb · 2024-04-08T19:01:22Z

Hi @liukun4515

Can you share an example parquet file?

I didn't quite follow your description -- is the issue that your data is already in UTC-7 time (and thus should not be adjusted) but that DataFusion is adjusting the timezone anyways?

liukun4515 · 2024-04-13T12:36:17Z

I didn't quite follow your description -- is the issue that your data is already in UTC-7 time (and thus should not be adjusted) but that DataFusion is adjusting the timezone anyways?

@alamb

Sorry for the bad description, I will share my parquet file in the next week.

Our ETL engine(spark) write the parquet file to HDFS and as we all know the spark use the UTC/UNIX epoch/time.

But int arrow-rs, when we meet the INT96 and will get the the arrow datatype of DataType::Timestamp(TimeUnit::Nanosecond, None) by this code https://github.com/apache/arrow-rs/blob/a999fb86764e9310bb4822c7e7c6551f247e0e0b/parquet/src/arrow/schema/primitive.rs#L99

In the definition of timestamp in the arrow data type https://github.com/apache/arrow/blob/main/format/Schema.fbs#L303, if the type of timestamp without the timezone value, it means we don't know the reference of the timestamp https://github.com/apache/arrow/blob/main/format/Schema.fbs#L318

alamb · 2024-04-13T13:02:22Z

I see -- so the expected behavior is that the timestamp should be read as a Timestamp(TimeUnit::Nanosecond, Some("UTC")) rather than Timestamp(TimeUnit::Nanosecond, None)

liukun4515 · 2024-04-16T03:06:54Z

I see -- so the expected behavior is that the timestamp should be read as a Timestamp(TimeUnit::Nanosecond, Some("UTC")) rather than Timestamp(TimeUnit::Nanosecond, None)

@alamb thanks for your feedback.

Yes,
But not only this issue, but also the value of timezone and how to use the timezone in compute of the arrow-rs/datafusion.

When we read the timestamp column from the parquet using the arrow-parquet crate, I will get the column with the data type which will be timestamp(unit, None) or timestamp(unit, "UTC"), but we need to use the timestamp with 'UTC-7' timezone or other timezone specified by the customer when we do the compute or operation on the timestamp column.

I have clarify the issue in the comment/pr apache/arrow-rs#5605 (comment)

liukun4515 · 2024-04-19T04:03:41Z

If we could input the schema into the parquet reader, we type of column will be defined by the input schema and not totally inferred by the parquet metadata.

We need to wait the next release of the arrow apache/arrow-rs#5657

liukun4515 · 2024-04-19T04:07:07Z

when we can input the schema in the ParquetRecordBatchStreamBuilder, we need to change the usage of

            let mut builder =
                ParquetRecordBatchStreamBuilder::new_with_options(reader, options)
                    .await?;

in the ParquetExec.

liukun4515 added the bug Something isn't working label Apr 7, 2024

liukun4515 mentioned this issue Apr 17, 2024

Account for Timezone when Casting Timestamp to Date32 apache/arrow-rs#5605

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get error value if timestamp represented by the INT96 in the parquet file #9981

get error value if timestamp represented by the INT96 in the parquet file #9981

liukun4515 commented Apr 7, 2024 •

edited

Loading

alamb commented Apr 8, 2024

liukun4515 commented Apr 13, 2024

alamb commented Apr 13, 2024

liukun4515 commented Apr 16, 2024

liukun4515 commented Apr 19, 2024 •

edited

Loading

liukun4515 commented Apr 19, 2024

get error value if timestamp represented by the INT96 in the parquet file #9981

get error value if timestamp represented by the INT96 in the parquet file #9981

Comments

liukun4515 commented Apr 7, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Additional context

alamb commented Apr 8, 2024

liukun4515 commented Apr 13, 2024

alamb commented Apr 13, 2024

liukun4515 commented Apr 16, 2024

liukun4515 commented Apr 19, 2024 • edited Loading

liukun4515 commented Apr 19, 2024

liukun4515 commented Apr 7, 2024 •

edited

Loading

liukun4515 commented Apr 19, 2024 •

edited

Loading