Skip to content

Conversation

@houqp
Copy link
Member

@houqp houqp commented Sep 8, 2021

Which issue does this PR close?

closes #968

Rationale for this change

A Cargo.lock will be useful for projects like datafusion python binding and ballista since they both produce binaries. Right now, our source releases are not reproducible for these two subprojects because we don't ship a lock file to pin dependencies to exact versions. This means if users use our official source release to build binaries, their output could be different than the ones we tested when prepping for the release.

What changes are included in this PR?

Added auto generated lock file.

Are there any user-facing changes?

No

@houqp houqp added the development-process Related to development process of DataFusion label Sep 8, 2021
@github-actions github-actions bot added the python label Sep 8, 2021
@houqp houqp added this to the 5.1.0 milestone Sep 8, 2021
@jorgecarleitao
Copy link
Member

My understanding of cargo is that libraries should not contain a cargo.lock in their git - only binaries should - https://doc.rust-lang.org/cargo/faq.html#why-do-binaries-have-cargolock-in-version-control-but-not-libraries

I did not place Python bindings as a member of the workspace because a workspace by definition shares the cargo.lock across all its members.

So, I think the resolution here is to generate a python/Cargo.lock from the Python bindings alone, which maturin will use. This is also how polars does it.

@houqp
Copy link
Member Author

houqp commented Sep 8, 2021

@jorgecarleitao yes, I noticed the python binding was not part of the workspace later, so I have generated another lock file inside the python folder as well. The root level lockfile is for ballista, not for datafusion.

jorgecarleitao
jorgecarleitao previously approved these changes Sep 8, 2021
@Dandandan
Copy link
Contributor

Nice idea @houqp !

Some thoughts:

One concern I have with using it for a library is that users of datafusion (as a library) will still update dependencies, and any build errors (with newer, failing dependencies) are not caught by having a fixed lock file (whereas we mostly get those errors in CI now in PRs if the build breaks based on incompatible dependencies).

Next to that, do we have any users actually using the source release (e.g. with included Cargo.lock file), instead of only using datafusion in the Cargo.toml file?

On the other hand, it still might have benefit if someone wants to do an (automated) build for distributing the binaries. But I am not sure if currently that's done somewhere?

Also, before we go ahead with this, I think it makes sense to have a plan / document how to update the cargo lock file.

Some options I can think of:

  • Manual/automated frequent update (e.g. weekly / bi-weekly)
  • Update on need
  • Automated, like dependabot (with a lot of dependencies, can cause quite some PR noise)

@jorgecarleitao jorgecarleitao dismissed their stale review September 8, 2021 07:07

Dandan comment.

@houqp
Copy link
Member Author

houqp commented Sep 9, 2021

Good points @Dandandan. Having a lock file for datafusion definitely prevents us from catching build errors from latest dependency releases. I just took a look at the cargo project itself, it's interesting that even though they release binaries, they don't check-in a Cargo.lock file. I am going to do more research on this topic.

Next to that, do we have any users actually using the source release (e.g. with included Cargo.lock file), instead of only using datafusion in the Cargo.toml file?

I am certainly not expecting datafusion users to use the ASF source release. They should always use cargo to manage library dependencies, which will ignore lock files.

This is mainly for the python binding and ballista binaries. The problem right now is ballista and datafusion are in the same workspace, so they are sharing the same lock file :( But now that I took a second look at ballista crates, all of them including the scheduler and the executor are expected to be used as upstream dependencies in the client crate as libraries. So we should hold on adding lock file for ballista crates for now until we move its binaries into their own crates.

The lack of access to reproducible builds from the same source tree has impact to our development and release process too. For example, random build failures like #961 could have been avoided with a lock file. In that case, our CI broke without any code change. Another potential issue it could address is to make sure the final binary wheels we publish to pypi are built with the exact same set of dependencies that are used in our automated and manual tests. The test run and wheel release build could happen at very different times.

In short, I propose we keep the lock file to the python binding for now.

Also, before we go ahead with this, I think it makes sense to have a plan / document how to update the cargo lock file.

Definitely something good to document in the developer doc. I think on demand update would make the most sense because cargo.lock file would change if a new commit adds a new dependency, the lock file should be updated in the same commit, otherwise the build won't be reproducible anymore. I also found it useful to use lock file diff to look for red flags on dependency bloats.

@houqp houqp marked this pull request as draft September 9, 2021 07:29
@alamb
Copy link
Contributor

alamb commented Sep 9, 2021

FWIW we use Cargo.lock in IOx successfully (it works great!), though IOx is definitely an application rather than a library.

Using Cargo.lock to avoid situations where upstream library changes break DataFusion CI like #968 sounds appealing, but then we may just be kicking the problems downstream (and farther from the root of the problem) to the users of DataFusion.

What if we removed datafusion-cli and whatever crates create ballista binaries (executor and core)? out of the main workspace (aka remove them from arrow-datafusion/Cargo.toml)? Then we can have separate Cargo.lock for them but avoid adding Cargo.lock for the datafusion crate?

@houqp houqp marked this pull request as ready for review September 9, 2021 16:36
@houqp houqp requested a review from jorgecarleitao September 9, 2021 16:36
@houqp
Copy link
Member Author

houqp commented Sep 9, 2021

@jorgecarleitao @Dandandan @alamb I have updated to PR to only add the lock file for the python binding crate.

I think it's better for us to skip datafusion-cli and balllista scheduler/executor for now because we are not releasing any official binaries for them. We can come back to move them out of the workspace when we are ready to maintain official binaries for them.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @houqp !

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, @houqp !

@Dandandan
Copy link
Contributor

Looks good @houqp , one minor comment

@Dandandan Dandandan merged commit 75fc80c into apache:master Sep 9, 2021
@Dandandan
Copy link
Contributor

🎉

@houqp houqp deleted the qp_lock branch September 9, 2021 17:07
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
H0TB0X420 pushed a commit to H0TB0X420/datafusion that referenced this pull request Oct 7, 2025
* feat: reads using global ctx

* Add text to io methods to describe the context they are using

---------

Co-authored-by: Tim Saucer <timsaucer@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development-process Related to development process of DataFusion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Cargo.lock to the repo root

4 participants