-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[WIP] Upgrade to arrow/parquet 57.0.0 #17888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Many of the current failures are due because this used to work: select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))' or SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM t After the arrow 57 upgrade it fails with errors like
# arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns) I believe the problem is that the format of the timezone has changed into I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue |
ed43cc0
to
ee2de0c
Compare
|
||
// Create Flight client | ||
let mut client = FlightServiceClient::connect("http://localhost:50051").await?; | ||
let endpoint = Endpoint::new("http://localhost:50051")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is due to new version of tonic
|
||
// add an initial FlightData message that sends schema | ||
let options = arrow::ipc::writer::IpcWriteOptions::default(); | ||
let mut compression_context = CompressionContext::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
let validate = | ||
T::validate_decimal_precision(new_value, self.target_precision); | ||
let validate = T::validate_decimal_precision( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this (get better messages)
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) | ||
List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many of the diffs in this file are related to improvements in DataType
display, tracked in this ticket
I will try and call out individual changes when I see them. Lists are way nicer now:
05)--------ProjectionExec: expr=[] | ||
06)----------CoalesceBatchesExec: target_batch_size=8192 | ||
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }]) | ||
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SELECT arrow_typeof(now()::timestamp) | ||
---- | ||
Timestamp(Nanosecond, None) | ||
Timestamp(ns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
## Timestamps: Create a table | ||
|
||
statement ok |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp format has changed (improved!) so let's also add tests for the new format
pbjson-types = { workspace = true } | ||
prost = { workspace = true } | ||
substrait = { version = "0.58", features = ["serde"] } | ||
substrait = { version = "0.59", features = ["serde"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since prost is updated, we also must update substrait
9d06200
to
1b7b559
Compare
8ecbbed
to
d3b328b
Compare
f61623e
to
9f6a390
Compare
d5bd26e
to
7709acc
Compare
| alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false | | ||
| alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true | | ||
| lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false | | ||
| alltypes_plain.parquet | 1851 | 10309 | 2 | page_index=false | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why the metadata size has increased. I will investigate
let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \ | ||
Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }"; | ||
assert_eq!(expected, arrow_schema.to_string()); | ||
insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
many many diffs are due to the changes in formatting of Fields and DataTypes (see below)
+----------------------+ | ||
| arrow_typeof(test.l) | | ||
+----------------------+ | ||
| List(nullable Int32) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new display is much easier to read in my opinion
Ok, the tests are now looking good enough to test with the new thrift decoder |
7709acc
to
5e1ea80
Compare
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
🤖 |
🤖: Benchmark completed Details
|
Which issue does this PR close?
57.0.0
(October 2025) arrow-rs#7835TODO:
Rationale for this change
Upgrade to the latest arrow
Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new metadata reader.
What changes are included in this PR?
Are these changes tested?
By CI
Are there any user-facing changes?
New arrow