Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Oct 2, 2025

Which issue does this PR close?

TODO:

  • Fix enforce_sorting.rs tests

Rationale for this change

Upgrade to the latest arrow

Also, there are several new features in arrow-57 that I want to be able to test including Variant, arrow-avro, and a new metadata reader.

What changes are included in this PR?

  1. Update arrow/parquet
  2. Update prost
  3. Update substrait
  4. Update pbjson
  5. Make API changes to avoid deprecated APIs

Are these changes tested?

By CI

Are there any user-facing changes?

New arrow

@github-actions github-actions bot added the common Related to common crate label Oct 2, 2025
@github-actions github-actions bot added substrait Changes to the substrait crate proto Related to proto crate labels Oct 2, 2025
@alamb
Copy link
Contributor Author

alamb commented Oct 2, 2025

Many of the current failures are due because this used to work:

select arrow_cast('2021-01-01T00:00:00', 'Timestamp(Nanosecond, Some("-05:00"))'

or

SELECT arrow_cast(secs, 'Timestamp(Millisecond, None)') FROM t

After the arrow 57 upgrade it fails with errors like

statement error DataFusion error: Execution error: Unsupported type 'Timestamp\(Nanosecond, None\)'\. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'\. Error expected double quoted string for Timezone, got 'None'
# arrow_typeof_timestamp
query T
SELECT arrow_typeof(now()::timestamp)
----
Timestamp(ns)

I believe the problem is that the format of the timezone has changed into Timestamp(ns) and then the FromStr method doesn't handle that. I will work on filing an update

I think what we need to do is support both formats for backwards compatibility. I will work on an upstream issue


// Create Flight client
let mut client = FlightServiceClient::connect("http://localhost:50051").await?;
let endpoint = Endpoint::new("http://localhost:50051")?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is due to new version of tonic


// add an initial FlightData message that sends schema
let options = arrow::ipc::writer::IpcWriteOptions::default();
let mut compression_context = CompressionContext::default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


let validate =
T::validate_decimal_precision(new_value, self.target_precision);
let validate = T::validate_decimal_precision(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(Field { name: "item", data_type: List(Field { name: "item", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of the diffs in this file are related to improvements in DataType display, tracked in this ticket

I will try and call out individual changes when I see them. Lists are way nicer now:

05)--------ProjectionExec: expr=[]
06)----------CoalesceBatchesExec: target_batch_size=8192
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} } }])
07)------------FilterExec: substr(md5(CAST(value@0 AS Utf8View)), 1, 32) IN ([Literal { value: Utf8View("7f4b18de3cfeb9b4ac78c381ee2ad278"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("a"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("b"), field: Field { name: "lit", data_type: Utf8View } }, Literal { value: Utf8View("c"), field: Field { name: "lit", data_type: Utf8View } }])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SELECT arrow_typeof(now()::timestamp)
----
Timestamp(Nanosecond, None)
Timestamp(ns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Timestamps: Create a table

statement ok
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp format has changed (improved!) so let's also add tests for the new format

pbjson-types = { workspace = true }
prost = { workspace = true }
substrait = { version = "0.58", features = ["serde"] }
substrait = { version = "0.59", features = ["serde"] }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since prost is updated, we also must update substrait

@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from f61623e to 9f6a390 Compare October 3, 2025 16:04
@github-actions github-actions bot added sql SQL Planner physical-expr Changes to the physical-expr crates optimizer Optimizer rules functions Changes to functions implementation physical-plan Changes to the physical-plan crate labels Oct 3, 2025
@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from d5bd26e to 7709acc Compare October 3, 2025 20:26
| alltypes_plain.parquet | 1851 | 10181 | 2 | page_index=false |
| alltypes_tiny_pages.parquet | 454233 | 881418 | 2 | page_index=true |
| lz4_raw_compressed_larger.parquet | 380836 | 2939 | 2 | page_index=false |
| alltypes_plain.parquet | 1851 | 10309 | 2 | page_index=false |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why the metadata size has increased. I will investigate

let expected = "Field { name: \"c0\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, \
Field { name: \"c1\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }";
assert_eq!(expected, arrow_schema.to_string());
insta::assert_snapshot!(arrow_schema.to_string(), @r#"Field { "c0": nullable Boolean }, Field { "c1": nullable Boolean }"#);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many many diffs are due to the changes in formatting of Fields and DataTypes (see below)

+----------------------+
| arrow_typeof(test.l) |
+----------------------+
| List(nullable Int32) |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new display is much easier to read in my opinion

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2025

Ok, the tests are now looking good enough to test with the new thrift decoder

@alamb alamb force-pushed the alamb/upgrade_arrow_57 branch from 7709acc to 5e1ea80 Compare October 3, 2025 21:11
@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: tpch_mem
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 172.82 ms │              170.20 ms │     no change │
│ QQuery 2     │  25.32 ms │               26.80 ms │  1.06x slower │
│ QQuery 3     │  40.67 ms │               35.01 ms │ +1.16x faster │
│ QQuery 4     │  28.11 ms │               27.70 ms │     no change │
│ QQuery 5     │  75.67 ms │               74.89 ms │     no change │
│ QQuery 6     │  19.71 ms │               19.40 ms │     no change │
│ QQuery 7     │ 211.76 ms │              211.24 ms │     no change │
│ QQuery 8     │  33.37 ms │               30.89 ms │ +1.08x faster │
│ QQuery 9     │ 102.87 ms │               93.38 ms │ +1.10x faster │
│ QQuery 10    │  58.60 ms │               57.87 ms │     no change │
│ QQuery 11    │  16.66 ms │               17.12 ms │     no change │
│ QQuery 12    │  51.03 ms │               50.17 ms │     no change │
│ QQuery 13    │  45.56 ms │               46.20 ms │     no change │
│ QQuery 14    │  13.72 ms │               13.95 ms │     no change │
│ QQuery 15    │  24.12 ms │               23.72 ms │     no change │
│ QQuery 16    │  24.35 ms │               24.28 ms │     no change │
│ QQuery 17    │ 147.33 ms │              145.37 ms │     no change │
│ QQuery 18    │ 316.29 ms │              315.38 ms │     no change │
│ QQuery 19    │  36.14 ms │               36.46 ms │     no change │
│ QQuery 20    │  47.59 ms │               48.14 ms │     no change │
│ QQuery 21    │ 327.74 ms │              294.32 ms │ +1.11x faster │
│ QQuery 22    │  20.95 ms │               19.61 ms │ +1.07x faster │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1840.37ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 1782.09ms │
│ Average Time (HEAD)                   │   83.65ms │
│ Average Time (alamb_upgrade_arrow_57) │   81.00ms │
│ Queries Faster                        │         5 │
│ Queries Slower                        │         1 │
│ Queries with No Change                │        16 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: clickbench_partitioned
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.22 ms │                2.14 ms │     no change │
│ QQuery 1     │    49.54 ms │               49.79 ms │     no change │
│ QQuery 2     │   133.80 ms │              141.22 ms │  1.06x slower │
│ QQuery 3     │   154.13 ms │              158.96 ms │     no change │
│ QQuery 4     │   999.02 ms │             1029.05 ms │     no change │
│ QQuery 5     │  1457.78 ms │             1484.94 ms │     no change │
│ QQuery 6     │     2.10 ms │                2.23 ms │  1.06x slower │
│ QQuery 7     │    53.99 ms │               54.98 ms │     no change │
│ QQuery 8     │  1378.95 ms │             1413.04 ms │     no change │
│ QQuery 9     │  1743.92 ms │             1760.12 ms │     no change │
│ QQuery 10    │   378.81 ms │              391.56 ms │     no change │
│ QQuery 11    │   435.88 ms │              445.91 ms │     no change │
│ QQuery 12    │  1318.89 ms │             1368.96 ms │     no change │
│ QQuery 13    │  2094.11 ms │             2109.04 ms │     no change │
│ QQuery 14    │  1255.34 ms │             1275.76 ms │     no change │
│ QQuery 15    │  1162.03 ms │             1224.07 ms │  1.05x slower │
│ QQuery 16    │  2635.18 ms │             2656.01 ms │     no change │
│ QQuery 17    │  2586.47 ms │             2645.02 ms │     no change │
│ QQuery 18    │  5483.36 ms │             4868.30 ms │ +1.13x faster │
│ QQuery 19    │   124.80 ms │              126.32 ms │     no change │
│ QQuery 20    │  2104.33 ms │             1991.58 ms │ +1.06x faster │
│ QQuery 21    │  2473.34 ms │             2322.82 ms │ +1.06x faster │
│ QQuery 22    │  4514.85 ms │             3935.77 ms │ +1.15x faster │
│ QQuery 23    │ 14978.72 ms │            12767.28 ms │ +1.17x faster │
│ QQuery 24    │   214.97 ms │              214.15 ms │     no change │
│ QQuery 25    │   518.82 ms │              516.03 ms │     no change │
│ QQuery 26    │   223.15 ms │              214.72 ms │     no change │
│ QQuery 27    │  2948.86 ms │             2848.18 ms │     no change │
│ QQuery 28    │ 23563.88 ms │            24515.81 ms │     no change │
│ QQuery 29    │   986.38 ms │              966.06 ms │     no change │
│ QQuery 30    │  1301.06 ms │             1316.39 ms │     no change │
│ QQuery 31    │  1344.98 ms │             1342.06 ms │     no change │
│ QQuery 32    │  4990.45 ms │             4519.35 ms │ +1.10x faster │
│ QQuery 33    │  5926.40 ms │             5679.44 ms │     no change │
│ QQuery 34    │  5836.64 ms │             6371.95 ms │  1.09x slower │
│ QQuery 35    │  1978.57 ms │             1959.59 ms │     no change │
│ QQuery 36    │   124.02 ms │              120.34 ms │     no change │
│ QQuery 37    │    51.09 ms │               54.29 ms │  1.06x slower │
│ QQuery 38    │   120.80 ms │              121.73 ms │     no change │
│ QQuery 39    │   198.05 ms │              198.87 ms │     no change │
│ QQuery 40    │    43.84 ms │               43.57 ms │     no change │
│ QQuery 41    │    36.81 ms │               38.73 ms │  1.05x slower │
│ QQuery 42    │    31.34 ms │               31.72 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 97961.63ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 95297.87ms │
│ Average Time (HEAD)                   │  2278.18ms │
│ Average Time (alamb_upgrade_arrow_57) │  2216.23ms │
│ Queries Faster                        │          6 │
│ Queries Slower                        │          6 │
│ Queries with No Change                │         31 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57 (0cfb693) to 0f3cf27 diff using: clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2025

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2748.60 ms │             2599.96 ms │ +1.06x faster │
│ QQuery 1     │  1344.48 ms │             1212.98 ms │ +1.11x faster │
│ QQuery 2     │  2453.12 ms │             2329.05 ms │ +1.05x faster │
│ QQuery 3     │  1166.94 ms │             1181.82 ms │     no change │
│ QQuery 4     │  2255.61 ms │             2220.57 ms │     no change │
│ QQuery 5     │ 27120.42 ms │            27691.74 ms │     no change │
│ QQuery 6     │  4325.56 ms │             4126.85 ms │     no change │
│ QQuery 7     │  3581.20 ms │             3545.46 ms │     no change │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 44995.92ms │
│ Total Time (alamb_upgrade_arrow_57)   │ 44908.43ms │
│ Average Time (HEAD)                   │  5624.49ms │
│ Average Time (alamb_upgrade_arrow_57) │  5613.55ms │
│ Queries Faster                        │          3 │
│ Queries Slower                        │          0 │
│ Queries with No Change                │          5 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant