-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] Allow to pass Arrow table as training data #6034
[python-package] Allow to pass Arrow table as training data #6034
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting to break this into smaller pieces! I only had a few minutes to quickly scan this, so left some minor comments. I'll leave a more thorough review when I have time.
Could you please describe here why the approach you're recommending is to use pyarrow
+ converting in that library to corresponding C API functions, instead of having LightGBM link to libarrow
and using its entrypoints directly?
Linking to libarrow
might complicate LightGBM's build process and it'd introduce some conversations about how we want to handle packaging, but it'd also mean that Arrow support would also be available to projects integrating directly with LightGBM's C API. It'd also reduce the amount of additional effort needed to support this in the R package, for example.
I'm NOT saying that we definitely should take that approach, I just would like to hear why you chose not to involve linking to libarrow
here.
My primary thought was that including Since we essentially only need an iterator over C-style arrays, I reckoned that there is no harm in adding this implementation myself.
That's a fair point, I haven't thought about this tbh. In my mind, LightGBM is only used from Python and R. I don't think I can contribute much to such a conversation though. In any case, I'm happy to update the PR if you come to the conclusion that linking against |
As pointed out in #6022 (comment), nanoarrow might be an alternative. |
Linking to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lorentzenchr and @xhochy for your input, and to @borchero for your thorough answer in #6034 (comment).
Let's proceed with the approach already outlined in this PR (no nanoarrow
or `libarrow). It's the lightest-weight, and I really like that we'd be able to switch to a different approach in the future without breaking users of the Python package, since it doesn't introduce new required dependencies or new code in the public interface (at least of the Python package).
Please see my first small set of additional review comments. We'll try to provide another review in the next few days.
Anybody got an idea what's wrong with the |
In the future when contributing here, please provide a link to the logs when you're asking about a CI job. There are multiple I clicked through the various CI jobs and saw multiple Linux Python jobs failing like this:
Those probably indicate a segmentation fault, invalid memory access, or other process-crashing issue caused by the current state of this branch. I saw that the Windows
That looks like the type of thing that is temporary and usually resolved by a new CI run. You can trigger a new run by pushing an empty commit. git commit --allow-empty -m "empty commit"
git push origin HEAD |
I'm very sorry, I did not realize that there is a bunch more jobs running on Azure, I was only really paying attention to the ones showing up directly on GitHub 👀 will do in the future! Do you have an idea why the CI still requires approval @jameslamb? Would be nice to be able to push more often to fix the remaining issue(s) 😬 |
I'm not sure why it's still requiring maintainer approval for your CI runs. Maybe that status gets assigned on GitHub's backend based on the time when you open the PR? I'll try to keep clicking the button whenever I see notifications here. |
I just updated this with the latest changes from |
@borchero could you please fix the failing tests? |
@jameslamb yes, definitely, I will finally take care of this on the weekend! 🙏🏼 |
Sorry it took so long to continue working on this! Everything I can (easily) run locally passes now, let's see how the CI is doing 😬 |
Yeah something's still not right with one of the tests... I can't replicate the failure locally (on Apple Silicon using gcc/g++) though. Any guidance how I can properly debug this? |
Don't think the |
The |
In this project, please be very very specific about what CI job you're talking about, and include a link to the logs. There are multiple CI jobs that could be described as "the For example, the Windows
|
Yes, I'd like to do another release soon. But I wouldn't support holding up that PR waiting for this or other Arrow PRs... I think we should try to get the fix to quantized training (#6092) and support for CUDA 12 (#6099) out soon. |
Yeah, I wasn't expecting that 😄 I'm not sure how much work a release is, I was only eager to get a new release with Arrow support once everything has been finished ;) |
@shiyu1994 could you please restart the VM we run the CUDA tests on? None have been started in the last 12 hours. https://github.com/microsoft/LightGBM/actions/workflows/cuda.yml They've all been stuck with messages like this:
|
Ah well, squashing and force-pushing didn't help with the CI @jameslamb 🙄 should I create yet another PR branching off from |
Yes please, could you? If that doesn't work, I'll go search the GitHub forums. I really dont want to turn off this setting... this repo has been the target of spam and abuse in the past, and this mechanism helps prevent some specific forms of it. |
Ahh tried in #6165, no success 😬
Totally understandable. There are two different options though, i.e. whether to require approvals for first-time contributors only or for everyone, right? Are you sure that the correct setting is enabled? |
@jameslamb I'm trying to restart the VM from my own laptop with the Linux command line but so far no luck. Perhaps I'll need to restart it from Azure tomorrow. Sorry for the inconvenience. |
Thanks @shiyu1994 , no problem. |
The machine has been restarted from within Linux. But it seems that the CI jobs are still queued. |
I just tried manually marking all of the other CUDA CI jobs at https://github.com/microsoft/LightGBM/actions/workflows/cuda.yml Looks like it's still queued after doing that, like this:
I can't access the VM so I don't know, but things I'd investigate:
|
Just restarted the runner service on the VM. And now it works. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick help @shiyu1994 ! Glad they're working again. I'll merge this.
@borchero please update thd other Arrow PRs and I'll provide reviews when I can.
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Motivation
This is the first of a set of replacement PRs to simplify #6022.
Changes
lgb.Dataset.data