Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] add support for pandas nullable types #4927

Merged
merged 14 commits into from
Feb 24, 2022

Conversation

jmoralez
Copy link
Collaborator

@jmoralez jmoralez commented Jan 6, 2022

This broadens the accepted pandas dataframes dtypes to include the pandas nullable dtypes (Int, boolean and Float).

Closes #4173.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I left a few small suggestions.

python-package/lightgbm/basic.py Outdated Show resolved Hide resolved
tests/python_package_test/test_basic.py Outdated Show resolved Hide resolved
@jmoralez
Copy link
Collaborator Author

Hmm, got this error /home/runner/miniconda/envs/test-env/lib/R/bin/exec/R: symbol lookup error: /home/runner/miniconda/envs/test-env/lib/R/bin/exec/../../lib/../../libreadline.so.8: undefined symbol: tputs in the lint task logs.

and (not is_dtype_sparse(dtype)
or dtype.subtype.name not in pandas_dtype_mapper))]
return bad_indices
return [i for i, dtype in enumerate(dtypes) if not is_numeric_dtype(dtype)]
Copy link
Collaborator

@StrikerRUS StrikerRUS Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By switching from fixed mapper to general is_numeric_dtype() function we are allowing to pass new types that were not allowed previously. For example, np.complex64 and np.float128. In other words, every subclass of numpy.number except numpy.timedelta64:

>>> subdtypes(np.generic)

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int64,
        numpy.longlong,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint64,
        numpy.ulonglong]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

https://pandas.pydata.org/docs/user_guide/basics.html#:~:text=All%20NumPy%20dtypes%20are%20subclasses%20of%20numpy.generic%3A

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're right, I hadn't thought about that. Do you think it'd be better to explicitly list all the allowed dtypes and just do something like: [i for i, dtype in enumerate(dtypes) if isinstance(dtype, allowed_dtypes)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm not sure... But if there is no any better solution, I'm OK with explicitly list all the allowed dtypes. Also, we can mix concrete dtypes and some subclasses there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the idea of what is_numeric_dtype does and checked for more specific subclasses (np.integer, np.floating and np.bool_) in 98325e9. Running {dtype: subdtypes(dtype) for dtype in (np.integer, np.floating, np.bool_)} with the subdtypes function in the pandas link you posted returns:

{numpy.integer: [numpy.integer,
  [[numpy.signedinteger,
    [numpy.int8,
     numpy.int16,
     numpy.int32,
     numpy.int64,
     numpy.longlong,
     numpy.timedelta64]],
   [numpy.unsignedinteger,
    [numpy.uint8,
     numpy.uint16,
     numpy.uint32,
     numpy.uint64,
     numpy.ulonglong]]]],
 numpy.floating: [numpy.floating,
  [numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
 numpy.bool_: numpy.bool_}

and I removed the np.float128 and np.timedelta64. Let me know what you think

@@ -546,9 +551,8 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica
raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n"
"Did not expect the data types in the following fields: "
f"{bad_index_cols_str}")
data = data.values
if data.dtype != np.float32 and data.dtype != np.float64:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement was introduced with the aim to save memory in case of original data already have float[32|64] type: #2383. With new unconditional data.astype(target_dtype).values expression and default argument copy=True of astype() function, we are loosing that efficiency improvement. Am I wrong?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The np.ndarray.astypes copy argument also defaults to True (docs) so I think the behaviour is the same as it previously was. We could set copy=False here if we won't modify the values, and if the dtype matches (all of the columns are either f32 or f64) there won't be any copies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to test if any copies are made.

Copy link
Collaborator

@StrikerRUS StrikerRUS Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The np.ndarray.astypes copy argument also defaults to True (docs) so I think the behaviour is the same as it previously was.

Previously, np.ndarray.astype() wasn't executed at all if dtype is float already. So, I think the behaviour is actually changed here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the copy=False argument and added a test to check that no copies are made for a single float dtype df in 98325e9

Copy link
Collaborator Author

@jmoralez jmoralez Jan 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more closely I think I could add np.float32 to

target_dtype = np.find_common_type((dtype.type for dtype in data.dtypes), [])
so that it considers floats in the target data type, because currently if you have 2**31-1 in your data as int32 the common dtype will be int32 and once it reaches any of those lines that turn it to float32 there will be a loss of precision (which is something that I think could happen with the current implementation, although I don't think a lot of people use numbers that big). Example:

import numpy as np

X = np.array([2**31-1], dtype=np.int32)
print(X[0])  # 2147483647
print(X.astype(np.float32)[0])  # 2147483600.0

WDYT @StrikerRUS?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more closely I think I could add np.float32 to ...

Sorry, didn't get how adding np.float32 will help to avoid a loss of precision...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. If we add floats to that function it will cast to float64, i.e.

import numpy as np

np.find_common_type([np.int32, np.float32], [])  # float64

That also avoids a copy if we have like [int16, int32], the common dtype would be int32 so a copy would be made and then when casting to float another copy would be made. By including float32 in there we can cast to the target dtype only once.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhu, thanks for the explanation! I think it's great idea!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 530828b

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jan 14, 2022

Hmm seems like there's no np.float128 in windows haha logs.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Jan 14, 2022

Hmm seems like there's no np.float128 in windows

Seems that's true: winpython/winpython#613.

UPD: and not only in Windows: pymc-devs/pymc-resources#90 (comment).

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this feature! I left two very minor stylish comments and one suggestion for checking model trained on nullable dtypes in test for a support of nullable dtypes.

Also, I think we should mark this PR as breaking due to new smart casting algorithm (#4927 (comment)). Right now anything non-float is casted to float32 unconditionally.

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved
@StrikerRUS StrikerRUS requested a review from jameslamb January 26, 2022 17:58
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@StrikerRUS
Copy link
Collaborator

@jmoralez Please update this branch to unblock merging button for this PR.

@jameslamb Would you like to take a look at this PR?

@jameslamb
Copy link
Collaborator

@jameslamb Would you like to take a look at this PR?

Yes please. I would like to review this.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, thank you! I read through the conversations and reviewed the diff, and agree with the decisions that were made.

I also learned some new numpy features from you through this PR! np.shares_memory() to detect if copies were made is awesome, I'll definitely use that in other projects in the future.

@jameslamb
Copy link
Collaborator

Merging this since both @StrikerRUS and I have approved. @jmoralez sorry I took so long to get back to this and provide a review.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for pandas nullable types to the sklearn api
3 participants