Skip to content

combine_first loses index type information with MultiIndices and different timezones #13650

Closed
@multiloc

Description

@multiloc

See title and example below. I believe this is due to the fact that combination of indices with different timezones first converts to object dtype, then rebases all timestamps to UTC for comparison and then constructs a DatetimeIndex from that. However, this doesn't seem to be applied for the individual levels in a MultiIndex. This is on latest stable 0.18.1.

In [3]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:tz1, tz2 = 'America/New_York', 'UTC'
:
:from1, to1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)], [pd.Timestamp('20160102', tz=tz1), pd.Timestamp('20160103', tz=tz1)]
:
:from2, to2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)], [pd.Timestamp('20160104', tz=tz2), pd.Timestamp('20160105', tz=tz2)]
:
:index1 = pd.MultiIndex.from_arrays([from1, to1])
:df1 = pd.DataFrame([1, 2], index=index1)
:
:index2 = pd.MultiIndex.from_arrays([from2, to2])
:df2 = pd.DataFrame([1, 2], index=index2)
:
:result = df1.combine_first(df2)
:--

In [4]: df1.index.get_level_values(0)
Out[4]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

In [5]: df2.index.get_level_values(0)
Out[5]: DatetimeIndex(['2016-01-03', '2016-01-04'], dtype='datetime64[ns, UTC]', freq=None)

In [6]: result.index.get_level_values(0)
Out[6]: 
Index([2016-01-01 00:00:00-05:00, 2016-01-02 00:00:00-05:00,
       2016-01-03 00:00:00+00:00, 2016-01-04 00:00:00+00:00],
      dtype='object')

Works correctly if the inputs have the same timezone

In [12]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:tz1, tz2 = 'America/New_York', 'America/New_York' 
:
:from1, to1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)], [pd.Timestamp('20160102', tz=tz1), pd.Timestamp('20160103', tz=tz1)]
:
:from2, to2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)], [pd.Timestamp('20160104', tz=tz2), pd.Timestamp('20160105', tz=tz2)]
:
:index1 = pd.MultiIndex.from_arrays([from1, to1])
:df1 = pd.DataFrame([1, 2], index=index1)
:
:index2 = pd.MultiIndex.from_arrays([from2, to2])
:df2 = pd.DataFrame([1, 2], index=index2)
:
:result = df1.combine_first(df2)
:
:--

In [13]: result.index.get_level_values(0)
Out[13]: 
DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00',
               '2016-01-03 00:00:00-05:00', '2016-01-04 00:00:00-05:00'],
              dtype='datetime64[ns, America/New_York]', freq=None)

Behavior is correct for single indices:

In [7]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:
:tz1, tz2 = 'America/New_York', 'UTC'
:
:index1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)]
:index2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)]
:
:df1 = pd.DataFrame([1, 2], index=index1)
:df2 = pd.DataFrame([1, 2], index=index2)
:
:result = df1.combine_first(df2)
:--

In [8]: df2.index
Out[8]: DatetimeIndex(['2016-01-03', '2016-01-04'], dtype='datetime64[ns, UTC]', freq=None)

In [9]: df1.index
Out[9]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

In [10]: result.index
Out[10]: 
DatetimeIndex(['2016-01-01 05:00:00+00:00', '2016-01-02 05:00:00+00:00',
               '2016-01-03 00:00:00+00:00', '2016-01-04 00:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

output of pd.show_versions()

In [1]: import pandas as pd

In [2]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-88-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.6
pip: 8.1.1
setuptools: 20.3
Cython: 0.22
numpy: 1.9.2
scipy: 0.17.0
statsmodels: 0.6.1.post1
xarray: None
IPython: 3.1.0
sphinx: None
patsy: 0.2.1
dateutil: 2.4.2
pytz: 2015.4
blosc: None
bottleneck: 1.0.0
tables: None
numexpr: 2.4.3
matplotlib: 1.5.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions