-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Better handling of MultiIndex with Excel #5254
Comments
I can sketch something out about how to handle it or even just write it as a preliminary PR. To get started could you give me a couple of small DataFrame examples with MultiIndexes that the Excel writers would be required to handle. Maybe something a little simpler that the examples in the docs to start begin. |
In [3]: df = pd.DataFrame(zip([0] * 6, [1] * 6, [2] * 6, [3] * 6, [4] * 6), columns=pd.MultiIndex.from_arrays([['Foo', 'Foo', 'Bar', 'Bar', 'Baz'], ['A', 'B', 'C', 'D', 'E']])
)
In [4]: df
Out[4]:
Foo Bar Baz
A B C D E
0 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4 Here's what the Excel File should look like: There's something in core/format that I believe handles this (or maybe it's on MI), so I think it should be trivial to convert an MI into a set of spans. (which I think is what the excel writer needs anyways). |
I'm less clear on MI index (as opposed to columns) and I'm inclined to not mess with that for now. |
I think it's as simple as passing sparsify=True to it's formatter and looking for empty cols. |
Example 2: In [7]: df = pd.DataFrame([list('abcde'), list('fghij'), list('klmno')])
In [8]: df
Out[8]:
0 1 2 3 4
0 a b c d e
1 f g h i j
2 k l m n o
In [11]: df.columns = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'C', 'C'], ['D', 'E', 'E', 'H', 'I'], ['J', 'L', 'M', 'N', 'N']])
In [12]: df
Out[12]:
A C
D E H I
J L M N N
0 a b c d e
1 f g h i j
2 k l m n o |
oops, need to make that last 3 level |
@cancan101 yeah, that's the reverse of this, still useful tho! |
I can have a look at that. Probably not before Sunday though. |
you guys going to try to throw this in? (no problem with it, esp if its experimental...so can even do after rc1)... |
I really want this, even if it means stepping through code for hours. Especially if we can get it on the reading side. |
haha! np |
I had a preliminary look at the Excel writer merge code. The merge mechanisms in the Excel writers will probably work without modification if the cells ranges for the merges are identified correctly. However, that part is completely missing (you probably already knew that, I'm just catching up). Of the existing writers the |
Feels like the easiest way would be to modify the sparsify function on |
pushing to 0.14 |
I pushed some initial code for this to the branch above. It is WIP and the All I've done so far is to uncomment and fix some of the existing code but it appears to be working. Using the above dataframes as an example: import pandas as pd
df = pd.DataFrame(zip([0] * 6, [1] * 6, [2] * 6, [3] * 6, [4] * 6),
columns=pd.MultiIndex.from_arrays([
['Foo', 'Foo', 'Bar', 'Bar', 'Baz'],
['A', 'B', 'C', 'D', 'E']]))
df.to_excel('merge1.xls')
df.to_excel('merge2.xlsx')
df.to_excel('merge3.xlsx', engine='xlsxwriter') The output from the The merged ranges aren't centred but that is easily fixable. The output is the same for For the second example: import pandas as pd
df = pd.DataFrame([list('abcde'), list('fghij'), list('klmno')])
df.columns = pd.MultiIndex.from_arrays([
['A', 'A', 'A', 'C', 'C'],
['D', 'E', 'E', 'H', 'I'],
['J', 'L', 'M', 'N', 'N']])
df.to_excel('merge4.xls')
df.to_excel('merge5.xlsx')
df.to_excel('merge6.xlsx', engine='xlsxwriter') Output: The Openpyxl output: Anyway, some progress. |
That looks really great! The hemstring test has to do with round-tripping MI. Would you be able to put your changes under a keyword argument and then we could work on reading back files like that separately? Also, if it's not working with openpyxl, I guess we could make it engine dependent and warn if you try to do it with openpyxl. |
I fixed two of the issues mentioned above:
I'll have a look at the hemstring test failure next. |
I could use your input on the failing Currently in the test on the
This is read back as follow and asserted to be true in the comparisons:
However, after my changes the following is read back (it is what is actually in the Excel file as well):
Which fails as follow:
I'm not sure if this is a genuine failure or if the test is based around the existing handling of MI indices and isn't going to work with the new MI handling. |
What you have is the behavior that it ought to be, I'm 👍 on changing the test (esp b/c we already said we were defaulting tupleize_columns to produce MIs in 0.13. But you should add it as a note to the doc/source/v0.13.0.txt section as well as release.rst that you've improved MI handling and MI round-tripping. @jreback and others - are you okay with this positive change (that's backwards incompatible only in that the existing behavior was poor)? Btw - does this handle hierarchical rows correctly too? |
yep I agree the new behavior is really the correct behavior |
Also @jmcnamara - thanks so much for tackling this! Looks like you've resolved a bunch of issues with Excel IO with hierarchical data. |
Not currently. Is this a suitable example to test with?: import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo']),
np.array(['one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(6, 4), index=arrays)
>>> df
0 1 2 3
bar one 0.405417 -0.045123 0.539755 0.102764
two -1.431279 -0.903300 -1.018812 -0.064565
baz one 0.084649 -0.497144 0.463838 -0.814132
two 0.790201 -2.513413 2.278565 -1.208934
foo one 0.246844 0.257885 0.919118 -1.678567
two 1.238596 0.889024 -1.078195 0.427454 If you have other examples to test with could you add them to the thread. The starting examples above were very useful. |
Yeah, that's fine. I'd just tweak slightly so you have different size chunks and different dtypes, i.e.: arrays = [np.array(['bar', 'bar', 'bar', 'baz', 'baz', 'foo']),
np.array([1, 2, 1, 2, 1, 2])] |
The other thing we need to handle (eventually) are index names and And here are some roundtrip examples that could be useful to explicitly handle, just to make sure you cover what you can in terms of naming. >>> df = DataFrame(list(zip(range(3), range(3), range(3), range(3))))
>>> cols = MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 1, 1, 2]], names=['cat1', 'cat2'])
>>> df1 = df.copy()
>>> df1.columns = cols
>>> df1 # roundtrip this
cat1 a b
cat2 1 1 1 2
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2 Then same thing with rows
And finally both
Pandas will output a blank line, so it's not ambiguous to roundtrip, but some |
I made some progress with the handling of hierarchical rows. For this example: import pandas as pd
import numpy as np
arrays = [np.array(['bar', 'bar', 'bar', 'baz', 'baz', 'foo']),
np.array(['one', 'two', 'three', 'one', 'two', 'one'])]
df = pd.DataFrame(np.random.randn(6, 4), index=arrays)
df.to_excel('merge7.xls')
df.to_excel('merge8.xlsx')
df.to_excel('merge9.xlsx', engine='xlsxwriter') The output looks like this for all three Excel writers: I'll start writing some tests next and then look at the index name handling. |
@jtratner @jreback A question. With the Excel MultiIndex prototype I have some failing test cases like this:
Which gives:
The equivalent Excel file is: I'm not sure how to handle this through the current interface but I'm working around it like this:
Does this seem okay? Am I missing something more obvious. |
@jreback will know - we handle the same thing in csv. If not I can look it up tomorrow. Where are you in terms of this right now? Could you put the writer functionality under a keyword argument so we could put it in 0.13? [i.e., doesn't do it by default but with kwarg will output nicely] Or do you want to wait for 0.14? |
you have to specify header=[0,1] in that case |
In the Excel Digging into the other parsers I see that there is usually a I have another question. :-) For the MI I'm using the import pandas as pd
from pandas.core.format import _get_level_lengths
cols = pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 1, 1, 2]], names=['cat1', 'cat2'])
levels = cols.format(sparsify=True, adjoin=False, names=False)
level_lengths = _get_level_lengths(levels)
# Then do something with the spans and values:
for lnum, (spans, values) in enumerate(zip(level_lengths, levels)):
print spans, values
# Output:
{0: 2, 2: 2} (u'a', '', u'b', '')
{0: 1, 1: 1, 2: 1, 3: 1} ('1', '1', '1', '2') This works fine but there is a problem in that the >>> cols
MultiIndex(levels=[[u'a', u'b'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 0, 0, 1]],
names=[u'cat1', u'cat2']) Is there a clean or simple way to access the the non-stringifed non-compact levels? |
@jmcnamara can you tell me what you want to get out? One thing you can do is just: for lev, lab in zip(mi.levels, mi.labels):
vals = lev.take(lab) Labels are the integer positions of the values. There might be a faster way. Example of take for reference: In [2]: import numpy as np
In [3]: arr = np.array([0, 1, 2, 3, 1, 1])
In [4]: arr2 = np.array(['a', 'b', 'c', 'd'])
In [6]: arr2.take(arr)
Out[6]:
array(['a', 'b', 'c', 'd', 'b', 'b'],
dtype='|S1') |
What we need is a way to get a set of values like: (value, span) (instead of the spaces), which should be a small modification of sparsify. That would make it dead simple to get the length of the merged cell. |
I have it mainly working for all the combinations that you listed above. I'm also starting to understand what is going on. :-) It will probably take me another week to finish this. I don't currently have it under a keyword arg but I can do that. When is the target release date for 0.13? |
@jmcnamara all I mean is: if you can put it under a keyword, we can release it as experimental for 0.13. If you can't, then needs to go to 0.14. Check out |
0.13 is supposed to be this week / next few days. |
@jtratner Thanks. That looks like what I need. P.S., you guys are quick. This almost feels like IRC sometimes. |
We were using hipchat for a bit - might be interesting to use IRC - GH's interface is really easy to use like chat, dunno. |
Here is a first pass that meets the criteria laid out above and still can be backward compatible with the existing version (it isn't backward compatible yet but I'll add the keyword option discussed above next). Here is an example based on @jtratner code above: import pandas as pd
df = pd.DataFrame(list(zip(range(3), range(3), range(3), range(3))))
cols = pd.MultiIndex.from_arrays([['A', 'A', 'B', 'B'],
[1, 1, 1, 2],
['G', 'H', 'I', 'J']],
names=['Cat1', 'Cat2', 'Cat3'])
df1 = df.copy()
df1.columns = cols
df2 = df.copy()
ind = pd.MultiIndex.from_arrays([['X', 'Y', 'Z'],
['One', 'One', 'Two']],
names=['Ind1', 'Ind2'])
df2.index = ind
df3 = df2.copy()
df3.columns = cols
print "==========================="
print df1
print "==========================="
print df2
print "==========================="
print df3
print "==========================="
df1.to_excel('merge17.xlsx', engine='xlsxwriter')
df2.to_excel('merge18.xlsx', engine='xlsxwriter')
df3.to_excel('merge19.xlsx', engine='xlsxwriter') The text output is:
The Excel output is: Note, in the last example the |
I agree - looks much better. @jreback you okay with including this as experimental in 0.13 if it's done? |
that's fine do we even write a multi index column now? |
It ends up in this weird format where sublevels are represented as dots...doesn't actually read in correclty, just ends up with dots. So something like this:
Ends up as the columns: |
Literal |
then @jmcnamara I think is a fix then ? what do u need a kw arg for? |
I don't think it can read in the format that it puts out... |
if it does, then doesn't need a kwarg |
Possibly, but it matches the repl (above) and html output:
Is it due to something like a |
Yes, it is really a fix since there isn't real MI support currently, just the It would be easier to just add this as a fix but there is a possibility that it would break existing code that was writing and reading MI with the current @jtratner's idea was to put the new One other issue here is round-tripping. For simple cases like in the test suite it is possible to roundtrip with the current Excel reader and the new And finally, in summary, even though I would love this functionality to go out as the default in 0.13 I think that there is too much of a risk if you intend to ship in the next few days. So I think having it in under an optional keyword is probably the way to go. |
We'd just remove the And yes once you get to multiple name levels, harder to round trip. |
I'm going to put something together this weekend, even if I can only make it work with one engine. The current MI output for columns is too hard to use. Would be really nice to round-trip it, but I'm less concerned about that.
@jmcnamara if you have any tips they'd certainly be appreciated - I may only do it for Xlsxwriter this weekend. (existing MI summary form can judge where to put whitespace to show that one index covers multiple values. Just need to do the same thing but rather than blank cells, should be a merged cell).
@cancan101 were you interested in this too?
The text was updated successfully, but these errors were encountered: