-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
PERF: leverage tzlocal package to provide 2000x speedup for dateutil.tz.tzlocal operations #24737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24737 +/- ##
==========================================
- Coverage 92.39% 92.38% -0.01%
==========================================
Files 166 166
Lines 52358 52358
==========================================
- Hits 48374 48373 -1
- Misses 3984 3985 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24737 +/- ##
===========================================
- Coverage 92.37% 42.75% -49.62%
===========================================
Files 166 171 +5
Lines 52408 52582 +174
===========================================
- Hits 48412 22484 -25928
- Misses 3996 30098 +26102
Continue to review full report at Codecov.
|
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall we need to consider whether this is worth bumping up our min dateutil version and adding an optional dependency to accelerate only dateutil.tz.tzlocal
Alternatively, could tzlocal be vendored?
| - numpy=1.12.0 | ||
| - openpyxl=2.5.5 | ||
| - pytables=3.4.2 | ||
| - python-dateutil=2.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to discuss bumping up our minimum version of dateutil
pandas/_libs/tslibs/timezones.pyx
Outdated
| return isinstance(tz, _dateutil_tzlocal) | ||
|
|
||
|
|
||
| cpdef bint have_tzlocal_package(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this just be run once with the imports and just assign the sentinel value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely
pandas/_libs/tslibs/timezones.pyx
Outdated
|
|
||
| try: | ||
| return pytz.timezone(tzlocal_tz) | ||
| except pytz.exceptions.UnknownTimeZoneError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need the exceptions here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tzlocal will report the system tz even if it isn't a valid pytz tz. For example, we'll hit this case if /etc/timezone contains foo/bar or, more likely, an ambiguous user timezone like STD.
| index = DatetimeIndex(['2018-06-04 10:20:30', pd.NaT], tz=tz) | ||
| result = index.timetz | ||
|
|
||
| print('result', result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover prints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will clean up
| expected = Series([time(23, 56, tzinfo=tz), time(21, 24, tzinfo=tz), | ||
| time(22, 14, tzinfo=tz)]) | ||
| result = s.dt.timetz | ||
| print('result', result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover prints?
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just moved min version for python dateutil
how hard is it to vendor tzlocal?
|
@mroeschke Agreed, it's worth mentioning that this mostly fixes the top performance regression between It'd be relatively simple to vendor it as it's only a handful of files. It's MIT licensed, so should be fine but worth mentioning now. I haven't been able to reproduce the |
|
if we bump to 2.6.0 would u still need tzlocal? |
|
I generally like the idea of avoiding returning Do we have any read on how accurate/canonical tzlocal.get_localzone is? For instance I just called it and got "America/Los_Angeles" back, while I was expecting to get "US/Pacific". Do/should we care about that distinction? Also FWIW, the way dateutil decides to return a tzlocal() object is based on looking at |
|
@jreback Yes, there's just a hard-to-reproduce conflict with 2.5.0-2.5.3. Not sure how serious it is at this point. @jbrockmendel Regarding accuracy, it's pulling from system files like There's no distinction between |
|
Although tzlocal hurts portability (unless that's the intension), I think pandas should ensure that timezones in == timezones out and shouldn't make any assumptions or transformation when returning the timezone (i.e tzlocal in, tzlocal out; US/Pacific in, US/Pacific out), which I can see your PR does @qwhelan . Overall agreed that I'd be great to address this regression. Shot in the dark: I think the proper way to compare dateutil tzs is with |
use |
64e8788 to
79bb80e
Compare
|
Hello @qwhelan! Thanks for updating the PR.
Comment last updated on February 05, 2019 at 07:53 Hours UTC |
3c66b5f to
702b29a
Compare
|
@jreback I'm finally able to reproduce the |
9c695f7 to
6e37589
Compare
…tz.tzlocal operations
|
put in pandas/tseries/_tzlocal.py i think - see how that works |
|
@jreback How about |
|
sure that’s ok too |
Is there anything in particular that bears cythonizing? |
|
Haven't looked closely, but I imagine most all functions here can be |
| cnp.import_array() | ||
|
|
||
|
|
||
| cimport pandas._libs.tslibs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython was making noise about this when running under 2.7; looks like it doesn't raise the same error under 3. Will remove this.
| cdef bint is_tzlocal(object tz) | ||
|
|
||
| cpdef object get_tzlocal_tz(object tz) | ||
| cpdef _set_tzlocal_tz(object tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used externally? If so, it shouldn’t be private. If not, it shouldn’t be in the pxd file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently being used to assist in testing, where we cycle through all pytz timezones and check equivalence against tzlocal.
jreback
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I wouldn't change anything in the vendored copy (its hard to tell if you did); I just saw a comment about 'cythonizing it'. Just drop it in place.
| @@ -0,0 +1,5 @@ | |||
| import sys | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't put this in src (which is not a real package, nor do we want it to be), rather just in pandas/_libs/tzlocal is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative would be something like pandas/vendored/. pd.io.clipboard would also belong in such a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this pandas/_vendored/ (explicity private)
| @@ -0,0 +1,5 @@ | |||
| import sys | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make it clear that this is a vendored copy
|
|
||
|
|
||
| cpdef object get_tzlocal_tz(object tz): | ||
| global tzlocal_tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this to prevent import issues?
|
@qwhelan can you merge master and address build failures? |
| @@ -0,0 +1,5 @@ | |||
| import sys | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this pandas/_vendored/ (explicity private)
|
can you merge master and update |
1 similar comment
|
can you merge master and update |
|
closing as stale. nice idea though, pls ping if you can continue. |
This is a follow-up to #24491, where it was noted that
tzlocaloperations are notably slower:This PR makes them indistinguishable from other timezones:
To do this, we add a new optional dependency:
tzlocal. This package queries the OS to try and identify apytzcompatible timezone name for the local timezone. If this is successful, we can use thepytztimezone instead ofdateutil.tz.tzlocal(), yielding a huge speedup as the latter invokesdatetime.datetime(). In cases where the local timezone is also UTC, further speedups are possible.In all cases, the
tzlocal()object is maintained as the.tzto preserve any semantic meaning.One unresolved issue is an incompatibility with
python-dateutilversions prior to2.6.0. I'm unable to reproduce the error locally, but can repeatably get this error on Travis:While this only shows up on the 2.7 build, I believe this is due to it being the only one actually building against
python-dateutil=2.5.0. Any version>=2.6.0does not exhibit this error.Finally, here is the
asvbenchmark againstv0.24.0rc1:git diff upstream/master -u -- "*.py" | flake8 --diff