-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot print DataFrame with unicode columns in IPython #680
Comments
We've seen this bug in DataFrame.to_string as well. |
@takluyver do you know if there any workarounds for Python 2.x? Basically
|
Sorry for the premature post, this is the code that produces the original issue I intended to report. Let me know if you want a new issue. from pandas import DataFrame |
If you've got unicode on Python 2, you can encode it just before you return from The best options are: 1) check |
Wow, thanks @takluyver as I don't think I would have had the patience to figure that all out on my own. The referenced commit with changes and unit tests makes all the above work fine and typing |
Somehow this does not work for me, see log of failing unittests at the bottom. I tried it interactively also: In [1]: cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:df = DataFrame({u'c/\u03c3':[1,2,3]})
:--
In [2]: print df.to_string()
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
...
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)
In [3]: df
Out[3]: ---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
...
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)
In [4]: print u'c/\u03c3'
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
/projects/hardware/users/wovermei/sandbox/pandas/sandbox/<ipython-input-4-45529fc7c4b5> in <module>()
----> 1 print u'c/\u03c3'
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)
In [5]: import sys
In [6]: sys.stdin.encoding
Out[6]: 'ISO-8859-1' ERROR: test_to_string_repr_unicode (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1823, in test_to_string_repr_unicode
df.to_string(col_space=10, buf=buf)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 969, in to_string
formatter.to_string()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 195, in to_string
fmt_values = self._format_col(c)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 248, in _format_col
formatter)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 429, in _format_fixed_width
formatted = [formatter(x) for x in values]
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 226, in formatter
col_width=col_width)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 603, in _format
return _just_help('%s' % _stringify(s))
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
return '%s' % console_encode(col)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 0: ordinal not in range(256)
======================================================================
ERROR: test_to_string_unicode_columns (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1832, in test_to_string_unicode_columns
df.to_string(buf=buf)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 969, in to_string
formatter.to_string()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 189, in to_string
str_columns = self._get_formatted_column_labels()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 370, in _get_formatted_column_labels
fmt_columns = self.columns.format()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in format
result.extend(_stringify(x) for x in self)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in <genexpr>
result.extend(_stringify(x) for x in self)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
return '%s' % console_encode(col)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 0: ordinal not in range(256)
======================================================================
ERROR: test_to_string_with_formatters_unicode (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1841, in test_to_string_with_formatters_unicode
result = df.to_string(formatters={u'c/\u03c3': lambda x: '%s' % x})
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 969, in to_string
formatter.to_string()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 189, in to_string
str_columns = self._get_formatted_column_labels()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 370, in _get_formatted_column_labels
fmt_columns = self.columns.format()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in format
result.extend(_stringify(x) for x in self)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in <genexpr>
result.extend(_stringify(x) for x in self)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
return '%s' % console_encode(col)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)
======================================================================
ERROR: test_repr_unicode (pandas.tests.test_series.TestSeries)
----------------------------------------------------------------------
Traceback (most recent call last):
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_series.py", line 752, in test_repr_unicode
repr(s)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/series.py", line 558, in __repr__
name=True)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/series.py", line 596, in _get_repr
return formatter.to_string()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 118, in to_string
fmt_values = self._get_formatted_values()
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 105, in _get_formatted_values
fmt_values.append(' %s' % self.formatter(v))
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 58, in formatter
col_width=col_width)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 603, in _format
return _just_help('%s' % _stringify(s))
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
return '%s' % console_encode(col)
File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 0: ordinal not in range(256) |
I guess that gives the lie to my assertion that all modern Linux uses UTF-8 as the terminal encoding. What distro is that, @lodagro ? It should be an easy fix - see my comments on b4ca18b. |
Linux version 2.6.18-238.5.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Mon Feb 21 05:52:39 EST 2011 |
RHEL 5.1? |
no 5.1 available, but 5.6, same issue.
|
I changed my linux LANG environment variable. pandas testing runs fine now, print on screen is not as expected though (\sigma is not shown).
In [1]: import sys
In [2]: sys.stdin.encoding
Out[2]: 'UTF-8'
In [3]: cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:df = DataFrame({u'c/\u03c3':[1,2,3]})
:--
In [4]: df
Out[4]:
c/Ï
0 1
1 2
2 3 |
Python is now able to encode it (any character can be encoded to UTF-8), but I guess your terminal is still expecting ISO-8859-1 encoded text, so it doesn't display properly. If what your terminal expects doesn't match the encoding Python thinks it should be using, output will inevitably get messed up. |
Indeed, if i change the terminal also it works fine. In [1]: print u'c/\u03c3'
c/σ |
I tried again running unittests including @adamklein lastest commits (SHA 41e6083 and SHA c52dd87). One unittest fails with Latin1 encoding (see below), it works fine with UTF-8 FAIL: test_to_string_with_formatters_unicode (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
File "<...>/lib/python2.7/site-packages/pandas-0.7.0.dev_c52dd87-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1842, in test_to_string_with_formatters_unicode
self.assertEqual(result, ' c/\xcf\x83\n0 1 \n1 2 \n2 3 ')
AssertionError: ' c/?\n0 1 \n1 2 \n2 3 ' != ' c/\xcf\x83\n0 1 \n1 2 \n2 3 ' |
This no longer works after the latest fix: import StringIO Raises: |
@craustin : Do you have a traceback with that? Also, what is |
@adamklein Indeed the unit test assumes UTF-8 will be generated, pandas does:return value.encode(sys.stdin.encoding, 'replace') 'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd' So when linux is set up to use a non UTF-8 codec and python tries to encode UTF-8 values, instead of raising UnicodeError, replacement markers will be inserted, making the encoded different from what it would be when UTF-8 codec would be in place. Thest needs to check for UTF-8 return string or alternatives. In the example above ? are used to replace malformed data. Test runs fine if i set UTF-8 codec. |
sys.stdin.encoding is 'cp437' Traceback: pandas\core\format.pyc in to_string(self) Lib\site-packages\numpy\core\numeric.pyc in array_repr(arr, max_line_width, precision, suppress_small) Lib\site-packages\numpy\core\arrayprint.pyc in array2string(a, max_line_width, precision, suppress_small, separator, Lib\site-packages\numpy\core\arrayprint.pyc in _array2string(a, max_line_width, precision, suppress_small, separator, prefix) Lib\site-packages\numpy\core\arrayprint.pyc in _formatArray(a, format_function, rank, max_line_len, next_line_prefix, separator, edge_items, summary_insert) UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 2: ordinal not in range(128) |
all set it looks like |
My changes in PR #685 are relevant to this code on Python 3. |
I'll merge and test that in the next day or two. thanks a lot for chasing down those issues |
Another repro - let me know if you want a new issue: from pandas import Series, DataFrame |
That one can't be fixed. You can't mix byte strings and unicode in a StringIO |
This works without the reindex(['test']), so I suppose that is causing pandas to write ascii instead of unicode. If this behavior is by design, I'll handle it on my end. |
It looks like when In [55]: dm
Out[55]:
Empty DataFrame
Columns: array([u'c/\u03c3'],
dtype='<U3')
Index: array([], dtype=object)
In [56]: dm.reindex(['test'])
Out[56]:
c/σ
test NaN |
I suspect that in both cases 8-bit strings are written, but in the second case you have a non-ascii character (a byte > 127). Trying to combine that with unicode causes the problem. From the StringIO docs: "The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called." |
Wouter, you are right. I will attempt to make it print unicode in both cases, which is consistent. |
After some discussion I think the best solution is to produce a unicode string containing the formatted DataFrame or Series and attempt to convert unicode to string (which will need to be a no-op in Python 3), which, in the case of failure, will return unicode. This is essentially the same behavior as in pandas <= 0.6.1 except there will be proper handling of encodings for returning non-ASCII from |
* commit 'v0.7.0rc1-73-g69d5bd8': (44 commits) BUG: integer slices should never access label-indexing, GH pandas-dev#700 BUG: pandas-dev#680 clean up with check for py3compat BUG: pandas-dev#680 rears again. cut off another hydra head ENH: change to tree-like MultiIndex output with > 2 levels, GH pandas-dev#689 TST: added a test related to pandas-dev#680 BUG: related to closes pandas-dev#691, removed cruft BUG: closes pandas-dev#691, assignment with ix and mixed dtypes BUG: handle incomparable values when creating Factor, caused bug in py3 TST: Fixes for tests on Python 3. BUG: pandas-dev#680, print consistently when dataframe is empty TST: unit test for PR pandas-dev#684 ENH: Allow Series.to_csv to ignore the index. BUG: raise exception in DateRange with MonthEnd(0) instead of infinite loop, GH pandas-dev#683 BUG: unbox 0-dimensional arrays in map_infer, GH pandas-dev#690 updated license and credits for overview ENH: cythonize timestamp conversion in HDFStore TST: ok, this appears to work GH pandas-dev#680 TST: even more woes GH pandas-dev#680 TST: unicode woes on windoze GH pandas-dev#680 TST: unicode codec test issue, GH pandas-dev#680 ...
Upgrade pytest-cov to > 2.6.1 to fix man-group/arctic#678
from pandas import DataFrame
df = DataFrame({u'c/\u03c3':[1,2,3]})
Try typing 'df' in IPython:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 5: ordinal not in range(128)
The text was updated successfully, but these errors were encountered: