-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] add support for pandas nullable types #4927
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I left a few small suggestions.
…test for regular numpy dtypes
Hmm, got this error |
python-package/lightgbm/basic.py
Outdated
and (not is_dtype_sparse(dtype) | ||
or dtype.subtype.name not in pandas_dtype_mapper))] | ||
return bad_indices | ||
return [i for i, dtype in enumerate(dtypes) if not is_numeric_dtype(dtype)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By switching from fixed mapper to general is_numeric_dtype()
function we are allowing to pass new types that were not allowed previously. For example, np.complex64
and np.float128
. In other words, every subclass of numpy.number
except numpy.timedelta64
:
>>> subdtypes(np.generic)
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.longlong,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.ulonglong]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you're right, I hadn't thought about that. Do you think it'd be better to explicitly list all the allowed dtypes and just do something like: [i for i, dtype in enumerate(dtypes) if isinstance(dtype, allowed_dtypes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm not sure... But if there is no any better solution, I'm OK with explicitly list all the allowed dtypes. Also, we can mix concrete dtypes and some subclasses there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the idea of what is_numeric_dtype does and checked for more specific subclasses (np.integer, np.floating and np.bool_) in 98325e9. Running {dtype: subdtypes(dtype) for dtype in (np.integer, np.floating, np.bool_)}
with the subdtypes function in the pandas link you posted returns:
{numpy.integer: [numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.longlong,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.ulonglong]]]],
numpy.floating: [numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
numpy.bool_: numpy.bool_}
and I removed the np.float128 and np.timedelta64. Let me know what you think
@@ -546,9 +551,8 @@ def _data_from_pandas(data, feature_name, categorical_feature, pandas_categorica | |||
raise ValueError("DataFrame.dtypes for data must be int, float or bool.\n" | |||
"Did not expect the data types in the following fields: " | |||
f"{bad_index_cols_str}") | |||
data = data.values | |||
if data.dtype != np.float32 and data.dtype != np.float64: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if
statement was introduced with the aim to save memory in case of original data already have float[32|64]
type: #2383. With new unconditional data.astype(target_dtype).values
expression and default argument copy=True
of astype()
function, we are loosing that efficiency improvement. Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The np.ndarray.astype
s copy
argument also defaults to True
(docs) so I think the behaviour is the same as it previously was. We could set copy=False
here if we won't modify the values, and if the dtype matches (all of the columns are either f32 or f64) there won't be any copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to test if any copies are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
np.ndarray.astype
scopy
argument also defaults toTrue
(docs) so I think the behaviour is the same as it previously was.
Previously, np.ndarray.astype()
wasn't executed at all if dtype is float
already. So, I think the behaviour is actually changed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the copy=False
argument and added a test to check that no copies are made for a single float dtype df in 98325e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more closely I think I could add np.float32
to
LightGBM/python-package/lightgbm/basic.py
Line 549 in 2db4d75
target_dtype = np.find_common_type((dtype.type for dtype in data.dtypes), []) |
2**31-1
in your data as int32 the common dtype will be int32 and once it reaches any of those lines that turn it to float32 there will be a loss of precision (which is something that I think could happen with the current implementation, although I don't think a lot of people use numbers that big). Example:
import numpy as np
X = np.array([2**31-1], dtype=np.int32)
print(X[0]) # 2147483647
print(X.astype(np.float32)[0]) # 2147483600.0
WDYT @StrikerRUS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more closely I think I could add
np.float32
to ...
Sorry, didn't get how adding np.float32
will help to avoid a loss of precision...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. If we add floats to that function it will cast to float64, i.e.
import numpy as np
np.find_common_type([np.int32, np.float32], []) # float64
That also avoids a copy if we have like [int16, int32], the common dtype would be int32 so a copy would be made and then when casting to float another copy would be made. By including float32 in there we can cast to the target dtype only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhu, thanks for the explanation! I think it's great idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 530828b
Hmm seems like there's no np.float128 in windows haha logs. |
Seems that's true: winpython/winpython#613. UPD: and not only in Windows: pymc-devs/pymc-resources#90 (comment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this feature! I left two very minor stylish comments and one suggestion for checking model trained on nullable dtypes in test for a support of nullable dtypes.
Also, I think we should mark this PR as breaking
due to new smart casting algorithm (#4927 (comment)). Right now anything non-float is casted to float32 unconditionally.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@jmoralez Please update this branch to unblock merging button for this PR. @jameslamb Would you like to take a look at this PR? |
Yes please. I would like to review this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, thank you! I read through the conversations and reviewed the diff, and agree with the decisions that were made.
I also learned some new numpy
features from you through this PR! np.shares_memory()
to detect if copies were made is awesome, I'll definitely use that in other projects in the future.
Merging this since both @StrikerRUS and I have approved. @jmoralez sorry I took so long to get back to this and provide a review. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This broadens the accepted pandas dataframes dtypes to include the pandas nullable dtypes (Int, boolean and Float).
Closes #4173.