-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get error value if timestamp represented by the INT96 in the parquet file #9981
Comments
Hi @liukun4515 Can you share an example parquet file? I didn't quite follow your description -- is the issue that your data is already in UTC-7 time (and thus should not be adjusted) but that DataFusion is adjusting the timezone anyways? |
Sorry for the bad description, I will share my parquet file in the next week. Our ETL engine(spark) write the parquet file to HDFS and as we all know the spark use the UTC/UNIX epoch/time. But int arrow-rs, when we meet the INT96 and will get the the arrow datatype of In the definition of timestamp in the arrow data type https://github.com/apache/arrow/blob/main/format/Schema.fbs#L303, if the type of timestamp without the timezone value, it means we don't know the reference of the timestamp https://github.com/apache/arrow/blob/main/format/Schema.fbs#L318 |
I see -- so the expected behavior is that the timestamp should be read as a |
@alamb thanks for your feedback. Yes, When we read the timestamp column from the parquet using the arrow-parquet crate, I will get the column with the data type which will be timestamp(unit, None) or timestamp(unit, "UTC"), but we need to use the timestamp with 'UTC-7' timezone or other timezone specified by the customer when we do the compute or operation on the timestamp column. I have clarify the issue in the comment/pr apache/arrow-rs#5605 (comment) |
If we could input the schema into the We need to wait the next release of the arrow apache/arrow-rs#5657 |
when we can input the schema in the
in the ParquetExec. |
Describe the bug
We use the
INT96
physical type to store the timestamp column which is written by the Spark ETL engine.We create the logical/physical plan to read the parquet file including above timestamp column using the UTC-7 timezone, the result of value for the timestamp column will be adjusted.
The field of timestamp column in the schema for the physical/logical plan is
timestamp(nanosecond, "utc-7")
ortimestamp(nonosecond, "-07:00")
@waitingkuo @alamb
Do you have any insight for that?
To Reproduce
No response
Expected behavior
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: