Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] add support for pandas nullable types #4927
[python-package] add support for pandas nullable types #4927
Changes from all commits
facf766
169a891
d6fe9c8
e1cf6c9
98325e9
b89bf06
2db4d75
8bf1617
530828b
fb5160a
b799847
a1535dd
19e73aa
c070ce2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
if
statement was introduced with the aim to save memory in case of original data already havefloat[32|64]
type: #2383. With new unconditionaldata.astype(target_dtype).values
expression and default argumentcopy=True
ofastype()
function, we are loosing that efficiency improvement. Am I wrong?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
np.ndarray.astype
scopy
argument also defaults toTrue
(docs) so I think the behaviour is the same as it previously was. We could setcopy=False
here if we won't modify the values, and if the dtype matches (all of the columns are either f32 or f64) there won't be any copies.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to test if any copies are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously,
np.ndarray.astype()
wasn't executed at all if dtype isfloat
already. So, I think the behaviour is actually changed here.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the
copy=False
argument and added a test to check that no copies are made for a single float dtype df in 98325e9There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more closely I think I could add
np.float32
toLightGBM/python-package/lightgbm/basic.py
Line 549 in 2db4d75
2**31-1
in your data as int32 the common dtype will be int32 and once it reaches any of those lines that turn it to float32 there will be a loss of precision (which is something that I think could happen with the current implementation, although I don't think a lot of people use numbers that big). Example:WDYT @StrikerRUS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, didn't get how adding
np.float32
will help to avoid a loss of precision...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. If we add floats to that function it will cast to float64, i.e.
That also avoids a copy if we have like [int16, int32], the common dtype would be int32 so a copy would be made and then when casting to float another copy would be made. By including float32 in there we can cast to the target dtype only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uhu, thanks for the explanation! I think it's great idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added in 530828b