-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge dask, distributed and dask-expr repos? #402
Comments
I think merging Merging distributed in sounds super painful given the long git history, open issues and PRs. Also the distributed CI is very slow and flaky, so I would expect this is going to cause pain for dask/dask contributors. We would need to set up a lot more rules to only trigger certain workflows on certain file changes which would increase CI complexity even further. It's less clear to me that this is a good idea. |
As long as the two packages are still separated this should be easy with two distinct workflow files that target the respective directories |
Sure if the source was completely separate then you could do that, but what value do you get from bringing things together if they are still separate? I guess you don't need to make two-part PRs any more and can change things in both packages in a single PR. I can definitely see the appeal of that. When working on distirbuted I blame and bisect a lot to figure things out, so we would have to be careful not to lose the history. But bringing everything into one repo would definitely make bisecting easier. Bringing the issues over from |
Yes, that's currently the primary motivation. Eventually I might also be interested to talk about nuking distributed as a dedicated package but I'm not there, yet and I figure this is a small step that does not bar any future direction.
Yes, I do that a lot, too. I would also only want to do this if we preserve the commit history of both repos. From the top of my head, I don't know how but I've done something similar in the past so it should be possible.
Transferring issues wouldn't be a problem. Although I'm not sure if that's a sensible thing to do. We have 1.3k open issues in distributed. I bet that only a fraction isn't stale and actionable. (I'm also happy to introduce a stale bot first if that's a concern)
Indeed but I don't think that'd be a terrible thing. I would assume that anybody who has decent knowledge with git could salvage a PR. |
Merging git repos is definitely possible, see this SO answer for example. I gave it a quick try and much of it was successful, notably and not unexpectedly there are conflicts with files that exist in both repos, such as GH files, CI, some docs, Merge conflicts
So it would probably take someone knowledgeable of both repos at least a few hours to carefully go through conflicts carefully to prevent breaking anything, plus renaming files in their own directories. With all this said, merging doesn't look impossible for Dask+Distributed, if all the other aspects (like open issues, PRs, etc.) are resolved in a satisfactorily manner for everyone. |
Yeah stalebot and then transfer the rest would be a good move. |
+1 on this. Recently, we've had an uptick in PR or issues that two or even all three repos. Having everything bundled up in a single repo would facilitate these changes, and it sounds like there is a path forward that has little downside. |
I frequently feel pain from having two distinct repositories with dask/dask and dask/distributed. Lately we've been working much more on changes that affect both repos and synchronizing PRs across repos is painful and cumbersome. With the addition of dask-expr this adds to a third repo and there are occasionally changes that span all three repos (e.g. sending Expr classes to the scheduler without materializing client side).
Additionally, documentation, maintenance and release procedures add additional work per repo.
The code is currently hard locked anyhow so we essentially sacrificed almost all flexibility of having multiple repos already and are pretty much paying for the disadvantage.
I would like to propose to merge the two (three) repos into a single one. We should still maintain multiple python packages so nothing would change for the end user other than having a single issue tracker to report issues to.
The problems I suspect we'll be running into are
pyproject.toml
filesAre there problems I haven't thought about? Any other reasons why the two code bases should remain separate? I'm not very familiar with packaging. Is there anything in this realm that needs consideration?
cc @mrocklin @jacobtomlinson @quasiben @jrbourbeau @rjzamora @charlesbluca @hendrikmakait @phofl
The text was updated successfully, but these errors were encountered: