Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: pandas 0.10 fails to read an excel file that pandas 0.9 can read #2651

Closed
lexual opened this issue Jan 7, 2013 · 7 comments
Closed
Assignees
Labels
Bug IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@lexual
Copy link
Contributor

lexual commented Jan 7, 2013

I have an excel file that reads fine in pandas 0.9, fails to read in pandas 0.10

  fname = sys.argv[1]                                                            
  xlsx = pd.ExcelFile(fname)                                                     
  sheet = xlsx.sheet_names[0]                                                    

  # this works in pandas 0.9, fails in pandas 0.10.0                             
 df = xlsx.parse(sheet, index_col=[1, 2])                                                                                                                                                                                                                                                                              

Triggers this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/home/vagrant/.virtualenvs/foo/local/lib/python2.7/site-packages/IPython/utils/py3compat.pyc in execfile(fname, *where)
    176             else:
    177                 filename = fname
--> 178             __builtin__.execfile(filename, *where)

/tmp/X/pandas_error_test.py in <module>()
     15 
     16 if __name__ == '__main__':
---> 17     main()

/tmp/X/pandas_error_test.py in main(files)
     11 
     12     # this works in pandas 0.9, fails in pandas 0.10.0
---> 13     df = xlsx.parse(sheet, index_col=[1, 2])
     14 
     15 

/home/vagrant/.virtualenvs/foo/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in parse(self, sheetname, header, skiprows, skip_footer, index_col, parse_cols, parse_dates, date_parser, na_values, thousands, chunksize, **kwds)
   1866                                      thousands=thousands,
   1867                                      chunksize=chunksize,
-> 1868                                      skip_footer=skip_footer)
   1869 
   1870     def _should_parse(self, i, parse_cols):

/home/vagrant/.virtualenvs/foo/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _parse_xlsx(self, sheetname, header, skiprows, skip_footer, index_col, has_index_names, parse_cols, parse_dates, date_parser, na_values, thousands, chunksize)
   1934                             chunksize=chunksize)
   1935 
-> 1936         return parser.read()
   1937 
   1938     def _parse_xls(self, sheetname, header=0, skiprows=None,

/home/vagrant/.virtualenvs/foo/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, nrows)
    622             #     self._engine.set_error_bad_lines(False)
    623 
--> 624         ret = self._engine.read(nrows)
    625 
    626         if self.options.get('as_recarray'):

/home/vagrant/.virtualenvs/foo/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in read(self, rows)
   1245             content = content[1:]
   1246 
-> 1247         alldata = self._rows_to_cols(content)
   1248         data = self._exclude_implicit_index(alldata)
   1249 

/home/vagrant/.virtualenvs/foo/local/lib/python2.7/site-packages/pandas/io/parsers.pyc in _rows_to_cols(self, content)
   1461             msg = ('Expected %d fields in line %d, saw %d' %
   1462                    (col_len, row_num + 1, zip_len))
-> 1463             raise ValueError(msg)
   1464 
   1465         return zipped_content

ValueError: Expected 10 fields in line 2, saw 11

I suspect a similar error happened when building the docs on the pandas site. See:

http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#v0-10-0-december-17-2012

And search for "expected", and you'll see a similar error is being triggered. See screenshot for details.
Screen Shot 2013-01-07 at 2 10 20 PM

I haven't attached the file, as it contains a client's data.

@ghost
Copy link

ghost commented Jan 8, 2013

If you can provide an excel file that repros the problem, I'll take a look.
Maybe you can replace the data with random data, or perhaps narrow it
down to a couple of offending lines?

@ghost ghost assigned wesm Jan 21, 2013
wesm added a commit that referenced this issue Jan 21, 2013
@wesm
Copy link
Member

wesm commented Jan 21, 2013

Pushing this issue to 0.10.2 until a reproduction is found. Docs issue was just a bug in the code

@lexual
Copy link
Contributor Author

lexual commented Jan 21, 2013

Is it possible to provide a file that reproduces to one of the devs directly offline? The file I have does have a client's data, which while isn't particularly sensitive, would be best kept out of public domain.

@wesm
Copy link
Member

wesm commented Jan 21, 2013

No problem. send it to my e-mail address (found in my profile)

@wesm
Copy link
Member

wesm commented Jan 22, 2013

That it parsed before was basically luck and not by design. I can't fix this in time for 0.10.1, if it even can be fixed

yarikoptic added a commit to neurodebian/pandas that referenced this issue Jan 23, 2013
Version 0.10.1

* tag 'v0.10.1': (195 commits)
  RLS: set released to true
  RLS: Version 0.10.1
  TST: skip problematic xlrd test
  Merging in MySQL support pandas-dev#2482
  Revert "Merging in MySQL support pandas-dev#2482"
  BUG: don't let np.prod overflow int64
  RLS: note changed return type in DatetimeIndex.unique
  RLS: more what's new for 0.10.1
  RLS: some what's new for 0.10.1
  API: restore inplace=TRue returns self, add FutureWarnings. re pandas-dev#1893
  Merging in MySQL support pandas-dev#2482
  BUG: fix python 3 dtype issue
  DOC: fix what's new 0.10 doc bug re pandas-dev#2651
  BUG: fix C parser thread safety. verify gil release close pandas-dev#2608
  BUG: usecols bug with implicit first index column. close pandas-dev#2654
  BUG: plotting bug when base is nonzero pandas-dev#2571
  BUG: period resampling bug when all values fall into a single bin. close pandas-dev#2070
  BUG: fix memory error in sortlevel when many multiindex levels. close pandas-dev#2684
  STY: CRLF
  BUG: perf_HEAD reports wrong vbench name when an exception is raised
  ...
@lexual
Copy link
Contributor Author

lexual commented Jan 24, 2013

FYI, this looks to be specific to using index_col.

i.e.

# This works
df = xlsx.parse(sheet)          
# this fals
df = xlsx.parse(sheet, index_col=[1, 2])                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                

@lexual
Copy link
Contributor Author

lexual commented Jan 24, 2013

Calling without index_col is allowing me to parse the file now. Thanks again for looking into this one for Wes.

@lexual lexual closed this as completed Jan 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

2 participants