[python-package] add support for pandas nullable types #4927

jmoralez · 2022-01-06T02:25:51Z

This broadens the accepted pandas dataframes dtypes to include the pandas nullable dtypes (Int, boolean and Float).

Closes #4173.

python-package/lightgbm/basic.py

jameslamb

Thanks for this! I left a few small suggestions.

python-package/lightgbm/basic.py

tests/python_package_test/test_basic.py

…test for regular numpy dtypes

jmoralez · 2022-01-11T05:14:50Z

Hmm, got this error /home/runner/miniconda/envs/test-env/lib/R/bin/exec/R: symbol lookup error: /home/runner/miniconda/envs/test-env/lib/R/bin/exec/../../lib/../../libreadline.so.8: undefined symbol: tputs in the lint task logs.

StrikerRUS · 2022-01-11T15:13:03Z

python-package/lightgbm/basic.py

-                                                           and (not is_dtype_sparse(dtype)
-                                                                or dtype.subtype.name not in pandas_dtype_mapper))]
-    return bad_indices
+    return [i for i, dtype in enumerate(dtypes) if not is_numeric_dtype(dtype)]


By switching from fixed mapper to general is_numeric_dtype() function we are allowing to pass new types that were not allowed previously. For example, np.complex64 and np.float128. In other words, every subclass of numpy.number except numpy.timedelta64:

>>> subdtypes(np.generic) [numpy.generic, [[numpy.number, [[numpy.integer, [[numpy.signedinteger, [numpy.int8, numpy.int16, numpy.int32, numpy.int64, numpy.longlong, numpy.timedelta64]], [numpy.unsignedinteger, [numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.ulonglong]]]], [numpy.inexact, [[numpy.floating, [numpy.float16, numpy.float32, numpy.float64, numpy.float128]], [numpy.complexfloating, [numpy.complex64, numpy.complex128, numpy.complex256]]]]]], [numpy.flexible, [[numpy.character, [numpy.bytes_, numpy.str_]], [numpy.void, [numpy.record]]]], numpy.bool_, numpy.datetime64, numpy.object_]]

https://pandas.pydata.org/docs/user_guide/basics.html#:~:text=All%20NumPy%20dtypes%20are%20subclasses%20of%20numpy.generic%3A

Oh you're right, I hadn't thought about that. Do you think it'd be better to explicitly list all the allowed dtypes and just do something like: [i for i, dtype in enumerate(dtypes) if isinstance(dtype, allowed_dtypes)

TBH, I'm not sure... But if there is no any better solution, I'm OK with explicitly list all the allowed dtypes. Also, we can mix concrete dtypes and some subclasses there.

I used the idea of what is_numeric_dtype does and checked for more specific subclasses (np.integer, np.floating and np.bool_) in 98325e9. Running {dtype: subdtypes(dtype) for dtype in (np.integer, np.floating, np.bool_)} with the subdtypes function in the pandas link you posted returns:

{numpy.integer: [numpy.integer, [[numpy.signedinteger, [numpy.int8, numpy.int16, numpy.int32, numpy.int64, numpy.longlong, numpy.timedelta64]], [numpy.unsignedinteger, [numpy.uint8, numpy.uint16, numpy.uint32, numpy.uint64, numpy.ulonglong]]]], numpy.floating: [numpy.floating, [numpy.float16, numpy.float32, numpy.float64, numpy.float128]], numpy.bool_: numpy.bool_}

and I removed the np.float128 and np.timedelta64. Let me know what you think

StrikerRUS · 2022-01-11T15:19:07Z

python-package/lightgbm/basic.py

@@ -546,9 +551,8 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
            raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
                             "Did not expect the data types in the following fields: "
                             f"{bad_index_cols_str}")
-        data = data.values
-        if data.dtype != np.float32 and data.dtype != np.float64:


This if statement was introduced with the aim to save memory in case of original data already have float[32|64] type: #2383. With new unconditional data.astype(target_dtype).values expression and default argument copy=True of astype() function, we are loosing that efficiency improvement. Am I wrong?

The np.ndarray.astypes copy argument also defaults to True (docs) so I think the behaviour is the same as it previously was. We could set copy=False here if we won't modify the values, and if the dtype matches (all of the columns are either f32 or f64) there won't be any copies.

It'd be nice to test if any copies are made.

The np.ndarray.astypes copy argument also defaults to True (docs) so I think the behaviour is the same as it previously was.

Previously, np.ndarray.astype() wasn't executed at all if dtype is float already. So, I think the behaviour is actually changed here.

I added the copy=False argument and added a test to check that no copies are made for a single float dtype df in 98325e9

Looking at this a bit more closely I think I could add np.float32 to

LightGBM/python-package/lightgbm/basic.py

Line 549 in 2db4d75

target_dtype = np.find_common_type((dtype.type for dtype in data.dtypes), [])

so that it considers floats in the target data type, because currently if you have 2**31-1 in your data as int32 the common dtype will be int32 and once it reaches any of those lines that turn it to float32 there will be a loss of precision (which is something that I think could happen with the current implementation, although I don't think a lot of people use numbers that big). Example:

import numpy as np X = np.array([2**31-1], dtype=np.int32) print(X[0]) # 2147483647 print(X.astype(np.float32)[0]) # 2147483600.0

WDYT @StrikerRUS?

Looking at this a bit more closely I think I could add np.float32 to ...

Sorry, didn't get how adding np.float32 will help to avoid a loss of precision...

Sorry. If we add floats to that function it will cast to float64, i.e.

import numpy as np np.find_common_type([np.int32, np.float32], []) # float64

That also avoids a copy if we have like [int16, int32], the common dtype would be int32 so a copy would be made and then when casting to float another copy would be made. By including float32 in there we can cast to the target dtype only once.

uhu, thanks for the explanation! I think it's great idea!

added in 530828b

jmoralez · 2022-01-14T03:20:34Z

Hmm seems like there's no np.float128 in windows haha logs.

StrikerRUS · 2022-01-14T13:57:24Z

Hmm seems like there's no np.float128 in windows

Seems that's true: winpython/winpython#613.

UPD: and not only in Windows: pymc-devs/pymc-resources#90 (comment).

tests/python_package_test/test_basic.py

StrikerRUS

Thank you so much for this feature! I left two very minor stylish comments and one suggestion for checking model trained on nullable dtypes in test for a support of nullable dtypes.

Also, I think we should mark this PR as breaking due to new smart casting algorithm (#4927 (comment)). Right now anything non-float is casted to float32 unconditionally.

tests/python_package_test/test_engine.py

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

StrikerRUS · 2022-02-12T21:53:28Z

@jmoralez Please update this branch to unblock merging button for this PR.

@jameslamb Would you like to take a look at this PR?

jameslamb · 2022-02-14T06:26:28Z

@jameslamb Would you like to take a look at this PR?

Yes please. I would like to review this.

jameslamb

Amazing work, thank you! I read through the conversations and reviewed the diff, and agree with the decisions that were made.

I also learned some new numpy features from you through this PR! np.shares_memory() to detect if copies were made is awesome, I'll definitely use that in other projects in the future.

jameslamb · 2022-02-24T04:26:37Z

Merging this since both @StrikerRUS and I have approved. @jmoralez sorry I took so long to get back to this and provide a review.

github-actions · 2023-08-23T14:10:20Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

map nullable dtypes to regular float dtypes

facf766

jmoralez added the feature label Jan 6, 2022

jmoralez requested review from henry0312, hzy46, jameslamb, shiyu1994, StrikerRUS and tongwu-sh as code owners January 6, 2022 02:25

cast x3 to float after introducing missing values

169a891

StrikerRUS reviewed Jan 11, 2022

View reviewed changes

python-package/lightgbm/basic.py Show resolved Hide resolved

jameslamb requested changes Jan 11, 2022

View reviewed changes

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

jmoralez added 2 commits January 10, 2022 21:27

add test for regular dtypes

d6fe9c8

use .astype and then values. update nullable_dtypes test and include …

e1cf6c9

…test for regular numpy dtypes

StrikerRUS reviewed Jan 11, 2022

View reviewed changes

more specific allowed dtypes. test no copy when single float dtype df

98325e9

jmoralez added 2 commits January 17, 2022 19:56

use np.find_common_type. set np.float128 to None when it isn't supported

b89bf06

set default as type(None)

2db4d75

StrikerRUS reviewed Jan 23, 2022

View reviewed changes

tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved

StrikerRUS mentioned this pull request Jan 23, 2022

[tests][python] remove compatibility code for old versions in tests #4978

Merged

jmoralez added 2 commits January 25, 2022 13:50

move tests that use lgb.train to test_engine

8bf1617

include np.float32 when finding common dtype

530828b

StrikerRUS approved these changes Jan 26, 2022

View reviewed changes

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

StrikerRUS requested a review from jameslamb January 26, 2022 17:58

Apply suggestions from code review

fb5160a

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Merge branch 'master' into nullable-dtypes

b799847

jmoralez and others added 3 commits February 17, 2022 10:11

merge master

a1535dd

add linebreak

19e73aa

Merge branch 'master' into nullable-dtypes

c070ce2

jameslamb approved these changes Feb 24, 2022

View reviewed changes

jameslamb merged commit f185695 into microsoft:master Feb 24, 2022

jmoralez deleted the nullable-dtypes branch February 24, 2022 04:32

jameslamb mentioned this pull request Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

jameslamb mentioned this pull request Oct 31, 2022

Add support for pandas nullable types to the sklearn api #4173

Closed

jmoralez mentioned this pull request Nov 30, 2022

[python-package] replace .values usage with .to_numpy() #5612

Merged

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] add support for pandas nullable types #4927

[python-package] add support for pandas nullable types #4927

jmoralez commented Jan 6, 2022 •

edited

Loading

jameslamb left a comment

jmoralez commented Jan 11, 2022

StrikerRUS Jan 11, 2022 •

edited

Loading

jmoralez Jan 11, 2022

StrikerRUS Jan 11, 2022

jmoralez Jan 14, 2022

StrikerRUS Jan 11, 2022

jmoralez Jan 11, 2022

jmoralez Jan 11, 2022

StrikerRUS Jan 11, 2022 •

edited

Loading

jmoralez Jan 14, 2022

jmoralez Jan 24, 2022 •

edited

Loading

StrikerRUS Jan 24, 2022

jmoralez Jan 25, 2022

StrikerRUS Jan 25, 2022

jmoralez Jan 25, 2022

jmoralez commented Jan 14, 2022 •

edited

Loading

StrikerRUS commented Jan 14, 2022 •

edited

Loading

StrikerRUS left a comment

StrikerRUS commented Feb 12, 2022

jameslamb commented Feb 14, 2022

jameslamb left a comment

jameslamb commented Feb 24, 2022

github-actions bot commented Aug 23, 2023

[python-package] add support for pandas nullable types #4927

[python-package] add support for pandas nullable types #4927

Conversation

jmoralez commented Jan 6, 2022 • edited Loading

jameslamb left a comment

Choose a reason for hiding this comment

jmoralez commented Jan 11, 2022

StrikerRUS Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StrikerRUS Jan 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmoralez Jan 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmoralez commented Jan 14, 2022 • edited Loading

StrikerRUS commented Jan 14, 2022 • edited Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

StrikerRUS commented Feb 12, 2022

jameslamb commented Feb 14, 2022

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb commented Feb 24, 2022

github-actions bot commented Aug 23, 2023

jmoralez commented Jan 6, 2022 •

edited

Loading

StrikerRUS Jan 11, 2022 •

edited

Loading

StrikerRUS Jan 11, 2022 •

edited

Loading

jmoralez Jan 24, 2022 •

edited

Loading

jmoralez commented Jan 14, 2022 •

edited

Loading

StrikerRUS commented Jan 14, 2022 •

edited

Loading