Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Allow to pass Arrow table as training data #6034

Merged
merged 1 commit into from
Nov 1, 2023
Merged

[python-package] Allow to pass Arrow table as training data #6034

merged 1 commit into from
Nov 1, 2023

Conversation

borchero
Copy link
Collaborator

Motivation

This is the first of a set of replacement PRs to simplify #6022.

Changes

  • Allow to Arrow table to lgb.Dataset.data
  • Add tests for Arrow C++ implementation and Python binding

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting to break this into smaller pieces! I only had a few minutes to quickly scan this, so left some minor comments. I'll leave a more thorough review when I have time.

Could you please describe here why the approach you're recommending is to use pyarrow + converting in that library to corresponding C API functions, instead of having LightGBM link to libarrow and using its entrypoints directly?

Linking to libarrow might complicate LightGBM's build process and it'd introduce some conversations about how we want to handle packaging, but it'd also mean that Arrow support would also be available to projects integrating directly with LightGBM's C API. It'd also reduce the amount of additional effort needed to support this in the R package, for example.

I'm NOT saying that we definitely should take that approach, I just would like to hear why you chose not to involve linking to libarrow here.

.gitignore Outdated Show resolved Hide resolved
python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
@borchero
Copy link
Collaborator Author

borchero commented Aug 12, 2023

Could you please describe here why the approach you're recommending is to use pyarrow + converting in that library to corresponding C API functions, instead of having LightGBM link to libarrow and using its entrypoints directly?

My primary thought was that including libarrow needlessly blows up the size of the dependencies (in particular, for the Python package where libarrow would need to be included). Using the C data interface follows the path that XGBoost has taken to add Arrow support (see dmlc/xgboost#7512). This happens to align with the design choice in other popular software such as DuckDB.

Since we essentially only need an iterator over C-style arrays, I reckoned that there is no harm in adding this implementation myself.

Linking to libarrow might complicate LightGBM's build process and it'd introduce some conversations about how we want to handle packaging, but it'd also mean that Arrow support would also be available to projects integrating directly with LightGBM's C API.

That's a fair point, I haven't thought about this tbh. In my mind, LightGBM is only used from Python and R. I don't think I can contribute much to such a conversation though.


In any case, I'm happy to update the PR if you come to the conclusion that linking against libarrow is preferable :)

@lorentzenchr
Copy link
Contributor

As pointed out in #6022 (comment), nanoarrow might be an alternative.

@xhochy
Copy link

xhochy commented Sep 6, 2023

I'm NOT saying that we definitely should take that approach, I just would like to hear why you chose not to involve linking to libarrow here.

Linking to libarrow is only worth it if you use additional functionality of it that goes beyond passing Arrow structures around. The additional functionality of libarrow comes at the cost of dealing with complex dependencies. If you have the luxury of having conda, this is fine but in a pip/CRAN-only world, it is a big hassle.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lorentzenchr and @xhochy for your input, and to @borchero for your thorough answer in #6034 (comment).

Let's proceed with the approach already outlined in this PR (no nanoarrow or `libarrow). It's the lightest-weight, and I really like that we'd be able to switch to a different approach in the future without breaking users of the Python package, since it doesn't introduce new required dependencies or new code in the public interface (at least of the Python package).

Please see my first small set of additional review comments. We'll try to provide another review in the next few days.

.gitignore Outdated Show resolved Hide resolved
python-package/lightgbm/arrow.py Outdated Show resolved Hide resolved
.github/workflows/python_package.yml Outdated Show resolved Hide resolved
@borchero
Copy link
Collaborator Author

Anybody got an idea what's wrong with the bdist job? The CI failure message is not particularly helpful and I can't replicate environment creation issues locally 👀

.ci/test.sh Outdated Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

Anybody got an idea what's wrong with the bdist job?

In the future when contributing here, please provide a link to the logs when you're asking about a CI job. There are multiple bdist CI jobs in this project, and you could save us some time and effort by linking to exactly what you're talking about.

I clicked through the various CI jobs and saw multiple Linux Python jobs failing like this:

tests/c_api_test/test_.py ..                                             [  0%]
...
tests/python_package_test/test_engine.py ..............................s [ 53%]
...s....................................................../__w/1/s/.ci/test.sh: line 201: 10546 Killed                  pytest $BUILD_DIRECTORY/tests

(example build)

Those probably indicate a segmentation fault, invalid memory access, or other process-crashing issue caused by the current state of this branch.

I saw that the Windows bdist job is failing like this:

pytest : The term 'pytest' is not recognized as the name of a cmdlet, function, script file, or operable program. 
Check the spelling of the name, or if a path was included, verify that the path is correct and try again.
At D:\a\1\s\.ci\test_windows.ps1:123 char:1
+ pytest $tests ; Check-Output $?
+ ~~~~~~
    + CategoryInfo          : ObjectNotFound: (pytest:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

(build link)

That looks like the type of thing that is temporary and usually resolved by a new CI run. You can trigger a new run by pushing an empty commit.

git commit --allow-empty -m "empty commit"
git push origin HEAD

@borchero
Copy link
Collaborator Author

In the future when contributing here, please provide a link to the logs when you're asking about a CI job

I'm very sorry, I did not realize that there is a bunch more jobs running on Azure, I was only really paying attention to the ones showing up directly on GitHub 👀 will do in the future!

Do you have an idea why the CI still requires approval @jameslamb? Would be nice to be able to push more often to fix the remaining issue(s) 😬

@jameslamb
Copy link
Collaborator

Do you have an idea why the CI still requires approval @jameslamb? Would be nice to be able to push more often to fix the remaining issue(s)

I'm not sure why it's still requiring maintainer approval for your CI runs. Maybe that status gets assigned on GitHub's backend based on the time when you open the PR?

I'll try to keep clicking the button whenever I see notifications here.

@jameslamb
Copy link
Collaborator

I just updated this with the latest changes from master. Once CI builds, I'll give it another review.

@jameslamb
Copy link
Collaborator

@borchero could you please fix the failing tests? @ me if you need help.

@borchero
Copy link
Collaborator Author

@jameslamb yes, definitely, I will finally take care of this on the weekend! 🙏🏼

@borchero
Copy link
Collaborator Author

borchero commented Oct 12, 2023

Sorry it took so long to continue working on this! Everything I can (easily) run locally passes now, let's see how the CI is doing 😬

@borchero
Copy link
Collaborator Author

Yeah something's still not right with one of the tests... I can't replicate the failure locally (on Apple Silicon using gcc/g++) though. Any guidance how I can properly debug this?

@borchero
Copy link
Collaborator Author

borchero commented Oct 12, 2023

Don't think the R-package failures are related to this PR (or at least it's not obvious to me how)

@borchero
Copy link
Collaborator Author

The bdist Python-package job seems to be failing because the conda env cannot be solved. I cannot yet replicate this locally though (even for osx-64)

@jameslamb
Copy link
Collaborator

The bdist Python-package job seems to be failing because the conda env cannot be solved

In this project, please be very very specific about what CI job you're talking about, and include a link to the logs. There are multiple CI jobs that could be described as "the bdist Python-package job".

For example, the Windows bdist job running on Windows is not failing because of a conda env resolution failure.... one of the unit tests is failing.

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=15326&view=logs&j=ea56812e-e7ae-55d0-6abc-4a217857fa9f&t=e61701f8-4e1b-5007-20cb-a000ab3587fb

FAILED ..\tests\python_package_test\test_engine.py::test_model_size - lightgb...

E lightgbm.basic.LightGBMError: bad allocation

@jmoralez
Copy link
Collaborator

Hey, sorry. I think @borchero meant this one but I reran it.

@jameslamb
Copy link
Collaborator

do you think it is viable to release a new version of LightGBM once this is done

Yes, I'd like to do another release soon. But I wouldn't support holding up that PR waiting for this or other Arrow PRs... I think we should try to get the fix to quantized training (#6092) and support for CUDA 12 (#6099) out soon.

@borchero
Copy link
Collaborator Author

I wouldn't support holding up that PR waiting for this or other Arrow PRs...

Yeah, I wasn't expecting that 😄 I'm not sure how much work a release is, I was only eager to get a new release with Arrow support once everything has been finished ;)

@jameslamb
Copy link
Collaborator

@shiyu1994 could you please restart the VM we run the CUDA tests on?

None have been started in the last 12 hours.

image

https://github.com/microsoft/LightGBM/actions/workflows/cuda.yml

They've all been stuck with messages like this:

Requested labels: self-hosted, linux
Job defined at: microsoft/LightGBM/.github/workflows/cuda.yml@refs/pull/6161/merge
Waiting for a runner to pick up this job...

@borchero
Copy link
Collaborator Author

Ah well, squashing and force-pushing didn't help with the CI @jameslamb 🙄 should I create yet another PR branching off from master? 👀

@jameslamb
Copy link
Collaborator

should I create yet another PR branching off from master? 👀

Yes please, could you? If that doesn't work, I'll go search the GitHub forums.

I really dont want to turn off this setting... this repo has been the target of spam and abuse in the past, and this mechanism helps prevent some specific forms of it.

@borchero
Copy link
Collaborator Author

Ahh tried in #6165, no success 😬

I really dont want to turn off this setting...

Totally understandable. There are two different options though, i.e. whether to require approvals for first-time contributors only or for everyone, right? Are you sure that the correct setting is enabled?

@shiyu1994
Copy link
Collaborator

@jameslamb I'm trying to restart the VM from my own laptop with the Linux command line but so far no luck. Perhaps I'll need to restart it from Azure tomorrow. Sorry for the inconvenience.

@jameslamb
Copy link
Collaborator

Thanks @shiyu1994 , no problem.

@shiyu1994
Copy link
Collaborator

The machine has been restarted from within Linux. But it seems that the CI jobs are still queued.

@jameslamb
Copy link
Collaborator

I just tried manually marking all of the other CUDA CI jobs at https://github.com/microsoft/LightGBM/actions/workflows/cuda.yml failed, and then restarting them for this PR: https://github.com/microsoft/LightGBM/actions/runs/6699716369

Looks like it's still queued after doing that, like this:

Requested labels: self-hosted, linux
Job defined at: microsoft/LightGBM/.github/workflows/cuda.yml@refs/pull/6034/merge
Waiting for a runner to pick up this job...

I can't access the VM so I don't know, but things I'd investigate:

  • does the VM have the correct labels in Azure
  • is it reachable? (like did the networking configuration change?)
  • did something else change in that Azure account, or do we need to re-authorize Azure in this repo's settings somehow?

@shiyu1994
Copy link
Collaborator

Just restarted the runner service on the VM. And now it works.

@jameslamb jameslamb self-requested a review November 1, 2023 13:19
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick help @shiyu1994 ! Glad they're working again. I'll merge this.

@borchero please update thd other Arrow PRs and I'll provide reviews when I can.

Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants