-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: support writing NAT in datetime column #146
Bugfix: support writing NAT in datetime column #146
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, looks good!
pyogrio/_io.pyx
Outdated
datetime.second + datetime.microsecond / 10**6, | ||
0 | ||
) | ||
if field_value is None or np.isnat(field_value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it ever be None?
I am also wondering if we can do np.isnat
more efficiently (knowing that this is basically the smallest int64 value). Although it might not matter much for performance, since below we convert to a Python datetime object, which will probably be much slower and dominate the time anyway (so if we want to improve performance here, if it mattered, we should probably first look there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it ever be None?
Not sure. I was also wondering whether a check for nan would be useful or not. The only case I could imagine is if you are saving a string column into an existing datetime columns (when append is supported) or something like that, but it sounds a bit far fetched.
Possibly the same with None: never gonna happen in a numpy datetime column?
I am also wondering if we can do
np.isnat
more efficiently (knowing that this is basically the smallest int64 value). Although it might not matter much for performance, since below we convert to a Python datetime object, which will probably be much slower and dominate the time anyway (so if we want to improve performance here, if it mattered, we should probably first look there).
Indeed. When I implemented the datetime write support I noticed the performance is indeed bad. I experimented a bit back then trying to speed it up by implementing the parsing of the datetime using only c, functions, but I didn't get it working (immediately). So the python datetime is the big bottleneck, normally np.isnat even should be fast I think because the cdef version of numpy is imported (as well)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the pandas cython libraries, we have a bunch of code to convert a datime64 numpy int to the different fields, but that's quite a lot of code that we can't just copy/paste (and it's not public, so we also can't import it from pandas).
One possible avenue would be to convert the full array up front to the different fields (for that there are public features in pandas). That would help performance, but will increase memory usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some more testing regarding the "fieldvalue is None" check, and apparently
- if you add a value None to a datetime64 numpy array it is automatically converted to NaT
- if you add a value np.nan to a datetime64 numpy array you get an error
So, I removed is "fieldvalue is None" check and added a None case in the unittest, so the test demonstrates this behaviour.
The docker builds fail on the following error. I'm not familiar with the building of wheels... so ?
|
That's fixed in the meantime, if you merge in the main branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @theroggy ! One minor comment but otherwise this looks good to me.
Co-authored-by: Brendan Ward <bcward@astutespruce.com>
closes #147