-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Code Sample, a copy-pastable example if possible
If we make a Timestamp in an ambiguous DST period while specifying via the offset (or by supplying Timestamp.value directly) that the time is before DST switch, the representation then shows that this is after DST switch. This is backed up by calling Timestamp.tz.utcoffset(Timestamp).
IN:
t1 = pd.Timestamp(1382837400000000000, tz='dateutil/Europe/London')
t1
OUT:
Timestamp('2013-10-27 01:30:00+0100', tz='dateutil/GB-Eire')
IN:
t2 = pd.Timestamp(1382837400000000000, tz='Europe/London')
t2
OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='Europe/London')Problem description
The reason for this bug looks to be buried deep in the interaction of pandas and dateutil.
So this is what I've been able to dig up. When we try to determine whether we are in DST or not, we rely on timezone.utcoffset of the underlying timezone package. What gets executed in dateutil is this:
def utcoffset(self, dt):
...
return self._find_ttinfo(dt).delta
def _find_ttinfo(self, dt):
idx = self._resolve_ambiguous_time(dt)
...
def _resolve_ambiguous_time(self, dt):
idx = self._find_last_transition(dt)
# If we have no transitions, return the index
_fold = self._fold(dt)
if idx is None or idx == 0:
return idx
# If it's ambiguous and we're in a fold, shift to a different index.
idx_offset = int(not _fold and self.is_ambiguous(dt, idx))
return idx - idx_offsetdateutil is expecting an ordinary datetime.timedelta object here, so this is what it does:
- Use
_find_last_transitionto get the index of the last DST transition beforedt. This is done by computingtimedelta.total_secondssince epoch time. Ourpandas.Timedelta.total_secondsis smart, and returns differenttotal_secondsfor before and afterDST, since we basically returnTimedelta.valuewhich is the same asTimestamp.valuewhen counting since epoch time (because of how_Timestamp.__sub__inc_timestamp.pyxis implemented).
This is what we do (doesn't care about dt.replace(tzinfo=None)):
def total_seconds(self):
"""
Total duration of timedelta in seconds (to microsecond precision).
"""
# GH 31043
# Microseconds precision to avoid confusing tzinfo.utcoffset
return (self.value - self.value % 1000) / 1e9This is what datetime.timedelta does (loses DST awareness after dt.replace(tzinfo=None)):
def total_seconds(self):
"""Total seconds in the duration."""
return ((self.days * 86400 + self.seconds) * 10**6 +
self.microseconds) / 10**6- The remainder of
_resolve_ambiguous_timecorrects for ambiguous times, sincedatetime.timedelta.total_secondsafterdt.replace(tzinfo=None)isn't DST-aware. It checks if we are in an ambiguous period and if this is the first time this time has occured: this is whatself._foldis for. fold is 0 for the first time, and 1 for the second time. If it's the first time,dateutilshifts the relevant transition index back by 1, since it thinks thattotal_secondsalways returns the number of seconds calculated using the second time.
I'd like to discuss how we are going to approach this. From what I see, there isn't much we can do on our end. Making Scratch that. The problem isn't so much the total_seconds non-DST-aware by default is bad, because that would be making our implementation less precise unless the user passes a parameter.total_seconds implementation as it is the Timestamp.__sub__ implementation which preserves value when we subtract epoch time.
Another approach is to go to dateutil with this and implement a check there to avoid running the correction if they are dealing with a pandas.Timedelta. Might be tricky to do without introducing a dependency on pandas, though.
First came across this while solving #24329 in #30995
Expected Output
IN:
t1
OUT:
Timestamp('2013-10-27 01:30:00+0000', tz='dateutil/GB-Eire')Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : ru_RU.UTF-8
LOCALE : None.None
pandas : 0.26.0.dev0+1947.gca3bfcc54.dirty
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 44.0.0.post20200106
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.2.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.3.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : 0.8.6
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0