Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: remove automated ballista CI checks from DataFusion #2679

Closed
alamb opened this issue Jun 2, 2022 · 7 comments · Fixed by #2684
Closed

Proposal: remove automated ballista CI checks from DataFusion #2679

alamb opened this issue Jun 2, 2022 · 7 comments · Fixed by #2684
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 2, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Basically the tests added in #2582 to keep ballista and datafusion in sync add significant burden to DataFusion development and I propose removing them, at least temporarily

Here is a description of the process:
https://github.com/apache/arrow-datafusion/blob/907504c/.github/pull_request_template.md?plain=1#L31-L47

I think the rationale for the new CI was to add friction on DataFusion API changes to encourage a more stable API; However, with the currently ongoing efforts to rework the object store and parquet reading, I think all we are doing with the process is slowing things down.

The alternative, to have Ballista keep up with changes in DataFusion, sounds daunting at first, but my firsthand experience suggests it is not that bad. Specifically, https://github.com/influxdata/influxdb_iox, my project, uses DataFusion similarly to Ballista (as the core query engine) and uses a DataFusion pin directly from master. Instead of impinging on the DataFusion development process, we keep IOx up with DataFusion by manually updating the DataFusion pin in IOX about once a week , and sorting out any API changes.

This does take time, but it is mostly mechanical. We do occasionally find bugs that were introduced into DataFusion such as when we tried most recently with https://github.com/influxdata/influxdb_iox/pull/4743 and we then contribute a fix back upstream (e.g. #2674)

I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)

Describe the solution you'd like
I propose removing the Ballista CI check in DataFusion

Specifically this check: https://github.com/apache/arrow-datafusion/blob/907504c5aa768601f9d70ad2c8f928bedfa9b069/.github/workflows/rust.yml#L128-L172

And writing up instructions (maybe even automation) on how to upgrade the datafusion pin in Ballista manually

Describe alternatives you've considered

  • Do nothing
  • Bring ballsita back into DataFusion repositoru

Additional context
The move of Ballista to a new repo is tracked in: #2502

There are several discussions about this pain:

cc @andygrove @thinkharderdev @ming535 @Ted-Jiang @xudong963 @tustvold @korowa

@alamb alamb added the enhancement New feature or request label Jun 2, 2022
@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2022

I think it is worth also highlighting that the arrow releases every 2 weeks are often breaking, at least on paper, and currently require going through this process.

I put some thoughts on this - #2583 (comment)

@andygrove
Copy link
Member

I'm fine with removing these checks.

@ovr
Copy link
Contributor

ovr commented Jun 2, 2022

I would be interested to hear how others keep up with pre-release DataFusion as well (maybe @ovr and cube-js?)

We have two separate projects in the company which use DF as a Query Engine, and we came to the decision to maintain two different versions in the forked repository because it's impossible to use the latest version, there are some problems:

It's hard to contribute back to the upstream. We had subqueries before DF introduced it, We have ANY and UDTFs, etc. We tried to push it back to the upstream, but It's not merged and probably, will not be merged soon.

#2549
#2177

Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny 😄
Don't forget that DF relies on the SQL parser & Arrow which we already forked. The same story, it requires rebasing each time :D

For my project, we rebase every 1-2 months. While rebasing we choose the latest commit from the master branch (We don't wait for releases).

@tustvold
Copy link
Contributor

tustvold commented Jun 2, 2022

Rebasing PRs/Forks when someone loves to move functions/doing breaking changes

Yeah, I do not envy you this task... I have enough difficulty keeping some of my longer-running PRs up to date 😅

FWIW if there is some way I could help you move onto tracking master more directly I would be happy to help out. In particular I notice your fork of arrow-rs is about 9 months out of date, which is likely to become quite painful as DataFusion comes to rely on it even more... 😅

It's hard to contribute back to the upstream

I'm sorry to hear this has been your experience. As mentioned above I'll happily help get contributions shepherded through, but the further things get from arrow-rs the less context I have to be able to meaningfully review things...

@andygrove
Copy link
Member

Rebasing PRs/Forks when someone loves to move functions/doing breaking changes on daily basis is funny smile

If it's any consolation, once #2675 and #2682 are merged I have no more refactoring planned other than removing some of the re-exports in the crate (#2683) in an effort to stabilize the public API for DataFusion.

We're also moving to monthly releases now, with the next release planned for ~2 weeks from now (#2676) so re-basing monthly might be worth considering.

With the refactoring work complete I will have more time for reviews and to help get PRs merged.

@alamb
Copy link
Contributor Author

alamb commented Jun 2, 2022

Also @ovr in IOx, we have found that by using the extension mechanisms in DataFusion (user defined Exprs, User Defined Plan Nodes, etc) we have been able to add IOx specific functionality / plans without a fork of DataFusion, which minimizes the amount of maintenance effort we need to expend to use master datafusion directly

@thinkharderdev
Copy link
Contributor

Seems reasonable. It does create a sort of cyclic dependency :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants