Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with overlapping multi index intervals #27456

Open
mahdirajabi96 opened this issue Jul 18, 2019 · 13 comments
Open

issues with overlapping multi index intervals #27456

mahdirajabi96 opened this issue Jul 18, 2019 · 13 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type MultiIndex

Comments

@mahdirajabi96
Copy link

mahdirajabi96 commented Jul 18, 2019

Scenario 1: single-level indexing, which works fine:

import pandas as pd # pandas version 0.25.0, python version: 3.6.6
idx = pd.IntervalIndex.from_arrays([1,3,1,2],
                             [3,4,2,4])
df = pd.DataFrame({'Value':[1,2,3,4]},index=idx) 

which returns:

df = 
          Value
(1,3]   1
(3,4]   2
(1,2]   3
(2,4]   4

query results:

df.loc[1.5] = 
          Value
(1,3]   1
(1,2]   3

Scenario 2: Multi-level indexing:

idx1 = pd.MultiIndex.from_arrays([
    pd.Index(['label1','label1','label2','label2']),
    pd.IntervalIndex.from_arrays([1,3,1,2],
                             [3,4,2,4])
])
idx2 = pd.MultiIndex.from_arrays([
    pd.Index(['label1','label1','label2','label2']),
    pd.IntervalIndex.from_arrays([1,2,1,2],
                             [2,4,2,4])
])
df1 = pd.DataFrame({'Value':[1,2,3,4]},index=idx1) #with overlapping intervals 
df2 = pd.DataFrame({'Value':[1,2,3,4]},index=idx2) #without overlapping intervals

which returns:

df1 = 
                    Value
label1    (1,3]   1
label1    (3,4]   2
label2    (1,2]   3
label2    (2,4]   4
df2 = 
                    Value
label1    (1,2]   1
label1    (2,4]   2
label2    (1,2]   3
label2    (2,4]   4

query method 1: works fine on both df1 and df2 but is slow

df1.Value.loc['label1'].loc[1.5]
1

query method 2: works only with df2, doesn't work with df1, is 10 times faster than query method 1

df2.Value.loc[('label1',1.5)]
1
df1.Value.loc[('label1',1.5)]

KeyError Traceback (most recent call last)
C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2889 try:
-> 2890 return self._engine.get_loc(key)
2891 except KeyError:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1.5

During handling of the above exception, another exception occurred:

KeyError Traceback (most recent call last)
in ()
11 display(df)
12 print(df.loc['label1'].loc[1.5])
---> 13 print(df.loc[('label1',1.5)])

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1402 except (KeyError, IndexError, AttributeError):
1403 pass
-> 1404 return self._getitem_tuple(key)
1405 else:
1406 # we by definition only have the 0th axis

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
789 def _getitem_tuple(self, tup):
790 try:
--> 791 return self._getitem_lowerdim(tup)
792 except IndexingError:
793 pass

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
945 return section
946 # This is an elided recursive call to iloc/loc/etc'
--> 947 return getattr(section, self.name)[new_key]
948
949 raise IndexingError("not applicable")

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in getitem(self, key)
1402 except (KeyError, IndexError, AttributeError):
1403 pass
-> 1404 return self._getitem_tuple(key)
1405 else:
1406 # we by definition only have the 0th axis

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
789 def _getitem_tuple(self, tup):
790 try:
--> 791 return self._getitem_lowerdim(tup)
792 except IndexingError:
793 pass

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
913 for i, key in enumerate(tup):
914 if is_label_like(key) or isinstance(key, tuple):
--> 915 section = self._getitem_axis(key, axis=i)
916
917 # we have yielded a scalar ?

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1823 # fall thru to straight lookup
1824 self._validate_key(key, axis)
-> 1825 return self._get_label(key, axis=axis)
1826
1827

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
155 raise IndexingError("no slices here, handle elsewhere")
156
--> 157 return self.obj._xs(label, axis=axis)
158
159 def _get_loc(self, key: int, axis: int):

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
3728
3729 if axis == 1:
-> 3730 return self[key]
3731
3732 self._consolidate_inplace()

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\frame.py in getitem(self, key)
2973 if self.columns.nlevels > 1:
2974 return self._getitem_multilevel(key)
-> 2975 indexer = self.columns.get_loc(key)
2976 if is_integer(indexer):
2977 indexer = [indexer]

C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2890 return self._engine.get_loc(key)
2891 except KeyError:
-> 2892 return self._engine.get_loc(self._maybe_cast_indexer(key))
2893 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2894 if indexer.ndim > 1 or indexer.size > 1:

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1.5

@jreback
Copy link
Contributor

jreback commented Jul 18, 2019

pls edit the top to include a fully reproducible example & version info

@mahdirajabi96
Copy link
Author

I added the version info in the stackoverflow question. Please let me know if you need additional info. Thanks.

@jbrockmendel
Copy link
Member

@mahdirajabi96 we've got over 3000 issues to triage; pointing to stackoverflow practically ensures this stays at the bottom of the list. pls follow @jreback's request

@mahdirajabi96
Copy link
Author

mahdirajabi96 commented Jul 23, 2019

I am trying to query an interval index with a fixed number. If all the intervals are defined as integers, it works, but if they are float, it won't.

import pandas as pd
idx = pd.MultiIndex.from_arrays(
    [pd.Index(['FC','FC','FC','FC','OWNER','OWNER','OWNER','OWNER']),
     pd.Index(['RID1','RID1','RID2','RID2','RID1','RID1','RID2','RID2']),
     pd.IntervalIndex.from_arrays([0,1.2,10,11,0,1,10,11],
                                  [1.2,2,11,12,1,2,11,12])])
idx.names = ['Item','RID','MP']
df = pd.DataFrame({'Value':[1,2,3,4,5,6,7,8]})
df.index = idx
query_df = pd.DataFrame({
        'Item':['FC'  ,'OWNER','FC'  ,'OWNER','OWNER'],
        'RID' :['RID1','RID1' ,'RID1','RID2' ,'RID2' ],
        'MP'  :[0.2   ,1.5    ,1.6   ,11.1   ,10.9   ]})

idx = pd.MultiIndex.from_arrays([query_df.Item,query_df.RID,query_df.MP])
query_df.index = idx
query_df['Value'] = df.Value.loc[idx]


ValueError: setting an array element with a sequence.

Versions:

print(pd.__version__)
0.23.4
print(sys.version)
3.7.1 (default, Dec 10 2018, 22:54:23) [MSC v.1915 64 bit (AMD64)]

@mahdirajabi96
Copy link
Author

this code would work if all the intervals in df were integers.

@jreback
Copy link
Contributor

jreback commented Jul 23, 2019

@mahdirajabi96 pls try your example on 0.25 which has changed substantially the handling of overlapping intervals.

@mahdirajabi96
Copy link
Author

mahdirajabi96 commented Jul 23, 2019

The intervals shouldn't overlap for specific Item and RID and I don't think they are, following is the df output:

Out[24]: 
                     Value
Item  RID  MP             
FC    RID1 (0, 1.2]      1
           (1.2, 2]      2
      RID2 (10, 11]      3
           (11, 12]      4
OWNER RID1 (0, 1]        5
           (1, 2]        6
      RID2 (10, 11]      7
           (11, 12]      8

Route ID 1 (RID1) expands from milepost 0 to 2 and Route ID2 expands from milepost 10 to 12. Each route has two attributes called FC and OWNER.

@mahdirajabi96
Copy link
Author

I tried 0.25 and got a different error:

InvalidIndexError: cannot handle overlapping indices; use IntervalIndex.get_indexer_non_unique

Again, it works fine if I define the df as the following:

Out[24]: 
                     Value
Item  RID  MP             
FC    RID1 (0, 1]        1
           (1, 2]        2
      RID2 (10, 11]      3
           (11, 12]      4
OWNER RID1 (0, 1]        5
           (1, 2]        6
      RID2 (10, 11]      7
           (11, 12]      8

@mahdirajabi96
Copy link
Author

mahdirajabi96 commented Jul 23, 2019

One more comment: if I run my query as the following it works regardless of using integer or float intervals:

query_df['Value'] = query_df.apply(lambda r:df.Value.loc[r.Item].loc[r.RID].loc[r.MP],axis=1)

The only problem is that it is 6 times slower and because I analyze roadway data for DOTs I work with millions of rows and need to perform these queries very often it becomes really an issue.

@jbrockmendel jbrockmendel added Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type MultiIndex labels Jul 29, 2019
@TomAugspurger
Copy link
Contributor

@mahdirajabi96 can you update the original post to include the minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

In particular, can you include a sample demonstrating your code working when the levels are float-dtype intervals rather than integer?

@mahdirajabi96 mahdirajabi96 changed the title issues with non-integer multi index intervals issues with overlapping multi index intervals Aug 2, 2019
@mahdirajabi96
Copy link
Author

@mahdirajabi96 can you update the original post to include the minimal example? http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports

In particular, can you include a sample demonstrating your code working when the levels are float-dtype intervals rather than integer?

I updated the original post to include a minimal example. as indicated earlier, apparently the problem is not the float or integer type, it is overlapping.

@mahdirajabi96
Copy link
Author

Any updates on this ticket? anything else needed from me? Just trying to make sure that it will be addressed. Thank you all.

@jreback
Copy link
Contributor

jreback commented Aug 16, 2019

@mahdirajabi96 if you can investigate would help this along

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type MultiIndex
Projects
None yet
Development

No branches or pull requests

5 participants