Skip to content

Conversation

@ddupg
Copy link
Contributor

@ddupg ddupg commented Sep 17, 2025

This PR is to update the following dependencies:

  • DataFusion to 50.0.0
  • Arrow to 56.1

I am working on supporting GEO type #4678 , which is blocked by the DataFusion upgrade. Recently, DataFusion released 50 apache/datafusion#16799 , so follow up on the upgrade

Closes #4753

@LuQQiu
Copy link
Contributor

LuQQiu commented Sep 17, 2025

@ddupg Could you help fix the build issues?

@jackye1995 jackye1995 linked an issue Sep 17, 2025 that may be closed by this pull request
@ddupg
Copy link
Contributor Author

ddupg commented Sep 18, 2025

@ddupg Could you help fix the build issues?

@LuQQiu Sure, now blocked in datafusion-python is not published to pypi, but it may take a few weeks apache/datafusion#16799 (comment)

@westonpace
Copy link
Member

Sure, now blocked in datafusion-python is not published to pypi

Hmm...we use the FFI for our boundary between datafusion-python and lance rust which in theory has a stable ABI. Do we know why we have a hard pin on the datafusion-python version? Do the tests pass if we loosen the pin?

@codecov-commenter
Copy link

codecov-commenter commented Sep 19, 2025

Codecov Report

❌ Patch coverage is 48.00000% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.38%. Comparing base (e464c0a) to head (77f4aa7).

Files with missing lines Patch % Lines
rust/lance-datafusion/src/planner.rs 13.33% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4751      +/-   ##
==========================================
+ Coverage   80.85%   81.38%   +0.52%     
==========================================
  Files         333      333              
  Lines      131944   130020    -1924     
  Branches   131944   130020    -1924     
==========================================
- Hits       106687   105811     -876     
+ Misses      21508    20653     -855     
+ Partials     3749     3556     -193     
Flag Coverage Δ
unittests 81.38% <48.00%> (+0.52%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

let mut added_byte_size = stats
.num_rows
.map(|n| (n * 8).max(64))
.map(|n| n * 8)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ddupg
Copy link
Contributor Author

ddupg commented Sep 19, 2025

Do we know why we have a hard pin on the datafusion-python version? Do the tests pass if we loosen the pin?

Related to #4577, maybe Json type has a requirement for datafusion version.

Do the tests pass if we loosen the pin?

The currently failing tests are not related to datafusion version, I'm trying to fix it.

@ddupg
Copy link
Contributor Author

ddupg commented Sep 20, 2025

The failing test python/tests/test_scalar_index.py::test_null_handling is seems to be caused by a bug of DataFusion rust 50.0.0, and I have raised an issue apache/datafusion#17681 to the DataFusion

@ddupg
Copy link
Contributor Author

ddupg commented Sep 22, 2025

The failing test python/tests/test_scalar_index.py::test_null_handling is seems to be caused by a bug of DataFusion rust 50.0.0, and I have raised an issue apache/datafusion#17681 to the DataFusion

This is not a bug with DataFusion, it seems to be because Lance using indexes and DataFusion have different logic on how to handle NULL. After apache/datafusion#17088, datafusion simplifies the handling of NULL expressions, which affects the results of the test python/tests/test_scalar_index.py::test_null_handling.

In rust/lance-index/src/scalar/expression.rs:create_filter_plan https://github.com/lancedb/lance/blob/5c60975b2c032314304ca1d38865d6eefde4d790/rust/lance-index/src/scalar/expression.rs#L1560, if the filter executed is IN (1, NULL), then the passed Expr is looks like this:

BinaryExpr(
    BinaryExpr {
        left: BinaryExpr(
            BinaryExpr {
                left: Column(
                    Column {
                        relation: None,
                        name: "x",
                    },
                ),
                op: Eq,
                right: Literal(
                    Int32(1),
                    None,
                ),
            },
        ),
        op: Or,
        right: BinaryExpr(
            BinaryExpr {
                left: Column(
                    Column {
                        relation: None,
                        name: "x",
                    },
                ),
                op: Eq,
                right: Cast(
                    Cast {
                        expr: Literal(
                            NULL,
                            None,
                        ),
                        data_type: Int32,
                    },
                ),
            },
        ),
    },
)

but after self.optimize_expr(filter)?, the Eq Expr about NULL is simplified, it looks like this:

BinaryExpr(
    BinaryExpr {
        left: BinaryExpr(
            BinaryExpr {
                left: Column(
                    Column {
                        relation: None,
                        name: "x",
                    },
                ),
                op: Eq,
                right: Literal(
                    Int32(1),
                    None,
                ),
            },
        ),
        op: Or,
        right: Literal(
            Boolean(NULL),
            None,
        ),
    },
)

and then Literal sub Expr cannot apply indexes, resulting in the entire Expr not being able to use indexes, which eventually causes the test to fail.

Can @westonpace please give me some help on how to fix this test?

@timsaucer
Copy link
Contributor

This appears to be a duplicate of my PR: #4778 but since you're actively working on the issue I was just looking at with the null unit test failure, I can close my PR in favor of this one.

@timsaucer
Copy link
Contributor

timsaucer commented Sep 24, 2025

A little more from my own investigation comparing DF 49 to 50:

Screenshot 2025-09-24 at 9 59 41 AM Screenshot 2025-09-24 at 10 00 37 AM

I suspect the issue may not be in the change to the DF optimizer. I am not sure why the prior test did work right now.

Correction: I am beginning to suspect that because the new plan has a Boolean(NULL) that the index read filters are no longer passing these through. It also appears that the output now is closer to what DataFusion expects to give by accident not intention. By that I mean it the optimizer were enhanced to give Int64(NULL) we would go back to the prior result, which is not expected from a DF perspective.

@jackye1995
Copy link
Contributor

Thank you @timsaucer for sharing this. I also chatted with @ddupg offline a bit, sorry I forgot to share the analysis here.

I think we did not handle the case of a Literal(Boolean(Null)) in visit_node.

In the past because the right hand side is a X = NULL, it is handled by visit_comparison, but now it directly returns None, indicating there is no index match which is where the wrong logic is.

@timsaucer
Copy link
Contributor

Thank you @timsaucer for sharing this. I also chatted with @ddupg offline a bit, sorry I forgot to share the analysis here.

I think we did not handle the case of a Literal(Boolean(Null)) in visit_node.

In the past because the right hand side is a X = NULL, it is handled by visit_comparison, but now it directly returns None, indicating there is no index match which is where the wrong logic is.

I do want to point out that "fixing" the issue by handling the literal null boolean will still give different results than an equivalent SQL command. From a SQL perspective not returning the row with the null IS the correct behavior, even though I think it is counter intuitive.

@ddupg
Copy link
Contributor Author

ddupg commented Sep 25, 2025

Thank @jackye1995 @timsaucer both for sharing and thinking, which helped me a lot

I do want to point out that "fixing" the issue by handling the literal null boolean will still give different results than an equivalent SQL command. From a SQL perspective not returning the row with the null IS the correct behavior, even though I think it is counter intuitive.

Agreed.

My personal opinion is that root cause stems from the different semantics of handling NULL when indexing is enabled or not. This is weird behavior, so I think we should make semantics consistent first. I proposed a draft #4812 to solve this problem. This will fix the currently failed test.

Secondly, the DataFusion upgrade exposed a situation where the index is not working as expected. DF 50 optimizes Expr of NULL handling, for example, the Expr x = 1 or x = NULL is optimized to x = 1 or NULL, where the sub Expr NULL cannot infer indexes, resulting in the entire query being unable to apply indexes. This may be something we need to optimize in the future.

@westonpace
Copy link
Member

I agree that we should migrate to the SQL behavior and I agree with @ddupg's summary. Shall I make a PR to fix the second issue?

@westonpace
Copy link
Member

Also, there is a third issue, reported here, which will unfortunately be quite a bit trickier to fix. I don't think that one needs to block the DF upgrade however.

@westonpace
Copy link
Member

I ended up creating #4815 which complements #4812 although maybe #4812 is no longer necessary.

@ddupg ddupg force-pushed the chore/upgrade-df-50 branch from c6493af to a28bf28 Compare September 26, 2025 00:37
@jackye1995
Copy link
Contributor

looks like there are some additional errors, any clue?

@ddupg
Copy link
Contributor Author

ddupg commented Sep 26, 2025

looks like there are some additional errors, any clue?

called `Result::unwrap()` on an `Err` value: IO { source: Custom { kind: Other, error: Generic { store: "LocalFileSystem", source: InvalidUrl { url: Url { scheme: "file", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/C:%5CUsers%5CRUNNER~1%5CAppData%5CLocal%5CTemp%5C.tmpIGGVXH/some_file.lance", query: None, fragment: None } } } }, location: Location { file: "rust\\lance-file\\src\\v2\\writer.rs", line: 191, column: 9 } }

This unusual local path "/C:%5CUsers%5CRUNNER~1%5CAppData%5CLocal%5CTemp%5C.tmpIGGVXH/some_file.lance" appears to have a parsing issue in Windows.

@jackye1995
Copy link
Contributor

hmm this is quite strange, I don't see any change that could affect the windows path handling... any clue? @westonpace @wjones127 @timsaucer

@timsaucer
Copy link
Contributor

hmm this is quite strange, I don't see any change that could affect the windows path handling... any clue? @westonpace @wjones127 @timsaucer

I think I've seen this happen in CI in this repo before, but having trouble finding an exact example from my PRs. In the past rerunning CI has resolved it.

@ddupg ddupg force-pushed the chore/upgrade-df-50 branch 2 times, most recently from 1597a63 to ccac35f Compare September 29, 2025 02:42
@timsaucer
Copy link
Contributor

These new failures I'm also seeing on PRs that are completely unrelated. example: https://github.com/lancedb/lance/actions/runs/18080818538/job/51443808660?pr=4834

@westonpace
Copy link
Member

These new failures I'm also seeing on PRs that are completely unrelated. example: https://github.com/lancedb/lance/actions/runs/18080818538/job/51443808660?pr=4834

Agreed, the test_zonemap_zone_size test was fixed on main

@westonpace
Copy link
Member

I'll see if I can make sense of the Windows failures or at least see what changed

@ddupg ddupg force-pushed the chore/upgrade-df-50 branch 3 times, most recently from 9ed1a11 to 0cd5af9 Compare September 29, 2025 22:14
@ddupg
Copy link
Contributor Author

ddupg commented Sep 29, 2025

I rebased the main branch code and reran the tests multiple times. This error consistently occurs in Windows.

IO { source: Custom { kind: Other, error: Generic { store: "LocalFileSystem", source: InvalidUrl { url: Url { scheme: "file", cannot_be_a_base: false, username: "", password: None, host: None, port: None, path: "/C:%5CUsers%5CRUNNER~1%5CAppData%5CLocal%5CTemp%5C.tmpllPc01/some_file.lance", query: None, fragment: None } } } }, location: Location { file: "rust\\lance-file\\src\\v2\\writer.rs", line: 191, column: 9 } }

@westonpace
Copy link
Member

Tracked down the root cause at least. Still working on fix. The upgrade to DF caused a transitive dependency (url) to upgrade from 2.5.4 to 2.5.7 and this is the root of the failure. My guess is we are hitting this change: servo/rust-url#1060

Anyways, I'm sure there is just some magic incantation of path manipulation that has to happen. My trial and error continues 🫠

Comment on lines 8929 to 8892
version = "2.5.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32f8b686cadd1473f4bd0117a5d28d36b1ade384ea9b5069a1c40aefed7fda60"
checksum = "08bc136a29a3d1758e07a9cca267be308aeebf5cfd5a10f3f67ab2097683ef5b"
dependencies = [
"form_urlencoded",
"idna",
"percent-encoding",
"serde",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are issues filed upstream in both object-store and url crates, can we simply revert these lines of the lock file? Or we can pin url in the cargo.toml to 2.5.4

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datafusion asks for 2.5.7 specifically so we can't revert sadly 😢

@westonpace
Copy link
Member

I've got a fix (more of a workaround) in progress that avoids these tricky paths by reworking how we do temporary directories: #4860

enable_options_value_normalization: false,
collect_spans: false,
map_string_types_to_utf8view: false,
default_null_ordering: NullOrdering::NullsMax,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if this was the same as before?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the PR apache/datafusion#16963, nulls_max is the default behavior.

Comment on lines 8929 to 8892
version = "2.5.7"
source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "32f8b686cadd1473f4bd0117a5d28d36b1ade384ea9b5069a1c40aefed7fda60"
checksum = "08bc136a29a3d1758e07a9cca267be308aeebf5cfd5a10f3f67ab2097683ef5b"
dependencies = [
"form_urlencoded",
"idna",
"percent-encoding",
"serde",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Datafusion asks for 2.5.7 specifically so we can't revert sadly 😢

@westonpace
Copy link
Member

westonpace commented Oct 2, 2025

The Windows issue is fixed in #4860 if anyone is up for a review (I'm shamelessly 🎣 for reviewers)

@westonpace
Copy link
Member

Ok, merged. If you rebase then hopefully you are good to go.

ddupg added 4 commits October 3, 2025 13:09
Change-Id: Iedc2a3133d2caf8f9630edde5723618a244cb958
Change-Id: Ic0a4e492eb128316df95d04e3780547100b93de1
Change-Id: Iddf384fafe1a59157b780eea0ffa8e037599d372
Change-Id: I833aa60aee64fd06907c1de10a973d3db7c93ec3
@ddupg ddupg force-pushed the chore/upgrade-df-50 branch from 0cd5af9 to 77f4aa7 Compare October 3, 2025 05:11
@ddupg
Copy link
Contributor Author

ddupg commented Oct 3, 2025

Ok, merged. If you rebase then hopefully you are good to go.

Thank @westonpace very much for the fix, all the tests have passed after I rebased

@ddupg
Copy link
Contributor Author

ddupg commented Oct 3, 2025

Could @jackye1995 @westonpace @timsaucer please help review this?

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for pushing this through!

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉 🎉 🎉

@westonpace westonpace merged commit 8a0f525 into lance-format:main Oct 3, 2025
29 checks passed
@ddupg ddupg deleted the chore/upgrade-df-50 branch November 7, 2025 02:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump datafusion (-> 50.0) and arrow (-> 56.0)

6 participants