Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial indexing of a Panel #8906

Closed
Dimchord opened this issue Nov 27, 2014 · 11 comments
Closed

Partial indexing of a Panel #8906

Dimchord opened this issue Nov 27, 2014 · 11 comments
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@Dimchord
Copy link

See also: http://stackoverflow.com/questions/26736745/indexing-a-pandas-panel-counterintuitive-or-a-bug

These are actually two related(?) issues.
The first is that the DataFrame is transposed, when you index the major_indexer or minor_indexer:

from pandas import Panel
from numpy import arange
p = Panel(arange(24).reshape(2,3,4))
p.shape
Out[4]: (2, 3, 4)
p.iloc[0].shape # original order
Out[5]: (3, 4)
p.iloc[:,0].shape # I would expect (2,4), but it is transposed
Out[6]: (4, 2)
p.iloc[:,:,0].shape # also transposed
Out[7]: (3, 2)
p.iloc[:,0,:].shape # transposed (same as [6])
Out[8]: (4, 2)

This may be a design choice, but it seems counterintuitive to me and it is not in line with the way numpy indexing works.
On a related note, I would expect the following two commands to be equivalent:

p.iloc[1:,0,:].shape # Slicing item_indexer, then transpose
Out[9]: (4, 1)
p.iloc[1:,0].shape # Expected to get the same as [9], but slicing minor_indexer instead????
Out[10]: (3, 2)

INSTALLED VERSIONS

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
byteorder: little
LC_ALL: None
LANG: nl_NL

pandas: 0.15.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.2.0
sphinx: 1.2.2
patsy: 0.2.1
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None

@max-sixty
Copy link
Contributor

I've just hit this too, on 0.16.2. Is this intended? Is it related to #11369?

In [8]: panel = pd.Panel(pd.np.random.rand(2,3,4))

In [10]: panel.shape
Out[10]: (2, 3, 4)

In [11]: panel[:, :, 0].shape
Out[11]: (3, 2)

In numpy:

In [15]: npanel=pd.np.random.rand(2,3,4)

In [16]: npanel.shape
Out[16]: (2, 3, 4)

In [18]: npanel[:,:,0].shape
Out[18]: (2, 3)

CC @jreback, as this seemed like an abandoned issue

@jreback
Copy link
Contributor

jreback commented Oct 30, 2015

yes this has always been like this. DataFrame is 'reversed' in that the columns axis (1) is the 'primary' (we call it the info) axis. This translates to indexing where a Panel is conceptually a dict of DataFrames. Not sure what/if anything can do about this as it would break practially all code.

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Multi Dimensional labels Oct 30, 2015
@max-sixty
Copy link
Contributor

This is a bigger issue than one we're going to solve here. But regardless a couple of points:

Panels generally
  • I have been working with Panels a lot over the past couple of weeks and - from my humble user perspective - it has felt pretty painful. I know it's a difficult challenge to go from 2 -> n dimensions. DataFrames are so beautiful, and Panels seem like an alpha of their functionality at a different level of quality (i.e. a 'preview' with low documentation & testing, rather than a fully functional subset).
  • FWIW, my basic approach now is to use pandas for the initial alignment, and then use numpy functions only. I wonder as pandas moves to a 1.0 release, whether Panel needs to either be given a lot of love or deprecated to 'experimental' or completely moved to something like xray infrastructure for >2D along with the current options for MultiIndex.
Panel indexing
  • I imagine there's something I don't understand, although I don't get why we have this design.
  • My understanding is that a DataFrame has row x column dimensions which are consistent across the indexers, and then there are some 'convenience' methods (such as df['a'] which reference the info_axis / columns and df[2:5] which reference the rows). In production, using the indexers is rigorous and predictable.
  • I would have thought a consistent design could exist for Panels - while there might be convenience methods, standard indexers would apply to items x rows (/ major) x columns (/ minor), and selecting a slice of one would collapse the others, in order. I had thought the info_axis & stat_axis were for convenience only, not affecting the core indexing operations (but sounds like I'm wrong).

xray mostly has the design I expected, I think, although does remember the collapsed dimension:

In [22]: panel_x=xray.DataArray(pd.np.random.rand(4,3,2))

In [24]: panel_x
Out[24]: 
<xray.DataArray (dim_0: 4, dim_1: 3, dim_2: 2)>
array([[[ 0.81499518,  0.73722039],
...
        [ 0.21864764,  0.93710684]]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1

In [25]: panel_x.loc[:,0,:]
Out[25]: 
<xray.DataArray (dim_0: 4, dim_2: 2)>
array([[ 0.81499518,  0.73722039],
       [ 0.41809174,  0.28529916],
       [ 0.82198192,  0.14365383],
       [ 0.55948113,  0.24809068]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
    dim_1    int64 0
  * dim_2    (dim_2) int64 0 1

Relevant xref: #9595, #10000
CC @shoyer

@jreback
Copy link
Contributor

jreback commented Oct 30, 2015

@MaximilianR

Well, @shoyer and I had some discussions w.r.t. essentially making .to_panel() simply return a DataArray directly (then you would work with it), and deprecating Panel.

That's an option; more closely aligns pandas and x-ray.

However, I think is a nice use case for a dense Panel. if you allow that x-ray is more 'geared' towards sparse type nd-arrays (of course it has dense support), more that is its primary usecase.

I happen to (well in the past), used Panels quite a lot where I would things like:

fields x time-axis x tickers, where the pandas model makes a lot of sense.

So maybe you can elaborate where you think pandas is lacking (in docs/tests/etc). Pretty much everything is there. So asside from the indexing conventions, not sure what issues there are.

@max-sixty
Copy link
Contributor

Here are a couple of issues I've had in addition to the above; I can provide more on these / others if helpful:

  • Very standard functions such as multiply that exist on Panels, when other is a different dimension. Without going to numpy, this is very slow as it iterates through each series combination. SO question here. I just had a go with xray and it seems decent:
In [56]: x
Out[56]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ 0,  1,  2,  3],
        [ 4,  5,  6,  7],
        [ 8,  9, 10, 11]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Coordinates:
  * dim_0    (dim_0) int64 0 1
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3

In [57]: x * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
Out[57]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ 0,  0,  0,  0],
        [ 0,  0,  0,  0],
        [ 0,  0,  0,  0]],

       [[12, 13, 14, 15],
        [16, 17, 18, 19],
        [20, 21, 22, 23]]])
Coordinates:
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3
  * dim_0    (dim_0) int64 0 1

In [58]: x.to_pandas() * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-58-18d40558bcd9> in <module>()
----> 1 x.to_pandas() * pd.np.asarray([0,1])[:, pd.np.newaxis, pd.np.newaxis]

/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/pandas/core/ops.py in f(self, other)
   1050             raise ValueError('Simple arithmetic with %s can only be '
   1051                              'done with scalar values' %
-> 1052                              self._constructor.__name__)
   1053 
   1054         return self._combine(other, op)

ValueError: Simple arithmetic with Panel can only be done with scalar values
  • Non-standard functions such as percentile that don't exist on a Panel. I ended up using np.nanpercentile here; the alternative was apply over series combinations, which was extremely slow. (I tried applying the DataFrame percentile over two of the axes and then reorganizing the axes, which I think was a bit faster, but awkward).
  • Selecting, as in REGR: Panel.where #11451.
    I ended up using np.where:
panel.loc[:, :, :] = pd.np.where(
        panel.notnull(),
        panel,
        fallback_df[:, :, pd.np.newaxis]
    )

xray seems decent at this too:

In [61]: x.where(x>5)
Out[61]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ nan,  nan,  nan,  nan],
        [ nan,  nan,   6.,   7.],
        [  8.,   9.,  10.,  11.]],

       [[ 12.,  13.,  14.,  15.],
        [ 16.,  17.,  18.,  19.],
        [ 20.,  21.,  22.,  23.]]])
Coordinates:
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3
  * dim_0    (dim_0) int64 0 1

In [62]: x.where(x[0]>5)
Out[62]: 
<xray.DataArray (dim_0: 2, dim_1: 3, dim_2: 4)>
array([[[ nan,  nan,  nan,  nan],
        [ nan,  nan,   6.,   7.],
        [  8.,   9.,  10.,  11.]],

       [[ nan,  nan,  nan,  nan],
        [ nan,  nan,  18.,  19.],
        [ 20.,  21.,  22.,  23.]]])
Coordinates:
  * dim_1    (dim_1) int64 0 1 2
  * dim_2    (dim_2) int64 0 1 2 3
  * dim_0    (dim_0) int64 0 1

Hope this is helpful - thanks for your engagement @jreback

@shoyer
Copy link
Member

shoyer commented Oct 31, 2015

Yes, these sorts of issues are exactly why we wrote xray in the first place. The pandas API and internals weren't really designed with n-dimensional data in mind, which makes panels and nd-panel quite awkward.

xray mostly has the design I expected, I think, although does remember the collapsed dimension:

The collapsed dimension is essentially just metadata and can be safely ignored. I think @jreback was a little confused here, but scalar coordinates are not used for any sort of alignment.

IMO the xray.DataArray is almost strictly more useful the panels. The main feature gap is that we currently don't support MultiIndex in xray, but hopefully that will change soon.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2015

@MaximilianR

since I understand you recently switched from using Panels to x-ray, can you elaborate on how it went? good-bad-ugly?

if we deprecate Panel entirely and make to_panel return an x-ray object. What are upsides / downsides?

@max-sixty
Copy link
Contributor

Sure - I'll give a short synthesis, and happy to answer any follow up questions you have.

Good:

  • Clear, explicit API, very few surprises. Indexing in particular is very reliable. Stark contrast to Panel!
  • Labeled dimensions, and the benefits that come with them - .sel, .isel (which becomes more important for higher dimensional datasets)
  • Clear difference between a DataArray and Dataset, independent of dimensionality (the ability to have DataArrays aligned on different dimensions is awesome)

Bad - minor, and very specific to my experience:

  • Index issues - for indexes whose .values aren't the same as the index (PeriodIndex, maybe tz?). PeriodIndex is very usable though given some recent minor changes. No MultiIndexes. @shoyer will have a better view here
  • Smaller API - greater need to use numpy / bottleneck / numbagg functions. For example, .where doesn't take an other argument
  • A bit less magic - for example, you can't slice a date index with a string ['2015']
  • I think this is a big plus for XRay generally, but given that DataArrays can only be a single type, that would have to be handled in .to_panel

Overall it's a beautiful library, both for exploratory work and for production. I'm very excited to be using it, and grateful to @shoyer for creating it.

I don't have a strong view on whether we should make to_panel return an XRay DataArray, but I do think we should choose an articulate a vision & roadmap on Panel vs XRay - the time the community spends on improving Panel around the edges is a waste IMHO, and it's the role of the maintainers to ensure that contributors know whether they're working on sustainable products.

Let me know if I can help beyond this at all,
Max

@shoyer
Copy link
Member

shoyer commented Jan 4, 2016

The good news is that almost all of @MaximilianR's issues should be fixable with a bit more work -- there are no fundamental design issues. For example, I just made a PR adding MultiIndex support (pydata/xarray#702).

for example, you can't slice an date index with a string ['2015']

Could you share an example where this fails? There may be a bug here -- we've had support for string indexing of datetime indexes since almost the beginning: http://xray.readthedocs.org/en/stable/time-series.html#datetime-indexing

@max-sixty
Copy link
Contributor

That should read PeriodIndex:

In [51]: ds=xray.Dataset(coords={'date':pd.period_range(periods=10,start='2000')})

In [52]: ds['d']=('date', pd.np.random.rand(10))

In [53]: ds.sel(date='2000')
Out[53]: 
<xray.Dataset>
Dimensions:  ()
Coordinates:
    date     object 2000-01-01
Data variables:
    d        float64 0.8965

Confirming it works for DatetimeIndex:

In [54]: ds=xray.Dataset(coords={'date':pd.date_range(periods=10,start='2000')})

In [55]: ds['d']=('date', pd.np.random.rand(10))

In [56]: ds.sel(date='2000')
Out[56]: 
<xray.Dataset>
Dimensions:  (date: 10)
Coordinates:
  * date     (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
    d        (date) float64 0.09303 0.5456 0.4934 0.08438 0.1854 0.2823 ...

@jreback
Copy link
Contributor

jreback commented Jul 10, 2017

closing as Panels are deprecated

@jreback jreback closed this as completed Jul 10, 2017
@jreback jreback modified the milestone: won't fix Jul 11, 2017
@TomAugspurger TomAugspurger modified the milestones: won't fix, No action Jul 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Indexing Related to indexing on series/frames, not to indexes themselves Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

6 participants