Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot print DataFrame with unicode columns in IPython #680

Closed
craustin opened this issue Jan 25, 2012 · 29 comments
Closed

Cannot print DataFrame with unicode columns in IPython #680

craustin opened this issue Jan 25, 2012 · 29 comments
Labels
Bug Unicode Unicode strings
Milestone

Comments

@craustin
Copy link

from pandas import DataFrame
df = DataFrame({u'c/\u03c3':[1,2,3]})

Try typing 'df' in IPython:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 5: ordinal not in range(128)

@craustin
Copy link
Author

We've seen this bug in DataFrame.to_string as well.

@wesm
Copy link
Member

wesm commented Jan 25, 2012

@takluyver do you know if there any workarounds for Python 2.x? Basically __repr__ can't return unicode in Python 2.x. this works, though:

In [7]: print df.to_string()
   c/σ
0  1  
1  2  
2  3  

@craustin
Copy link
Author

Sorry for the premature post, this is the code that produces the original issue I intended to report. Let me know if you want a new issue.

from pandas import DataFrame
df = DataFrame({u'c/\u03c3':[1,2,3]})
df.to_string(formatters={u'c/\u03c3': lambda x: '%s' % x})

@takluyver
Copy link
Contributor

If you've got unicode on Python 2, you can encode it just before you return from __repr__. Then you run into the question of picking an encoding. On modern Linux & OS X, the terminal encoding is usually UTF-8. On Windows, it's the current code page. IPython has code to try to detect the terminal encoding, but there's no guarantee that you're running in a terminal, so there might not be a 'correct' choice.

The best options are: 1) check sys.stdin.encoding, being aware that stdin can be None, and encoding can be None, or 2) avoid the whole question, and .encode('unicode-escape') or .encode('ascii', 'replace').

@wesm
Copy link
Member

wesm commented Jan 25, 2012

Wow, thanks @takluyver as I don't think I would have had the patience to figure that all out on my own. The referenced commit with changes and unit tests makes all the above work fine and typing df into the console works now (prints the \sigma). I need to make sure that it works alright on Windows too whenever I have a chance but I suspect so. Also may be able to close #340 once I check it out

@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

Somehow this does not work for me, see log of failing unittests at the bottom.

I tried it interactively also:

In [1]: cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:df = DataFrame({u'c/\u03c3':[1,2,3]})
:--

In [2]: print df.to_string()
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
...
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)

In [3]: df
Out[3]: ---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
...
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)

In [4]: print u'c/\u03c3'
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
/projects/hardware/users/wovermei/sandbox/pandas/sandbox/<ipython-input-4-45529fc7c4b5> in <module>()
----> 1 print u'c/\u03c3'

UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)

In [5]: import sys

In [6]: sys.stdin.encoding
Out[6]: 'ISO-8859-1'
ERROR: test_to_string_repr_unicode (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1823, in test_to_string_repr_unicode
    df.to_string(col_space=10, buf=buf)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 969, in to_string
    formatter.to_string()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 195, in to_string
    fmt_values = self._format_col(c)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 248, in _format_col
    formatter)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 429, in _format_fixed_width
    formatted = [formatter(x) for x in values]
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 226, in formatter
    col_width=col_width)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 603, in _format
    return _just_help('%s' % _stringify(s))
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
    return '%s' % console_encode(col)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
    return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 0: ordinal not in range(256)

======================================================================
ERROR: test_to_string_unicode_columns (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1832, in test_to_string_unicode_columns
    df.to_string(buf=buf)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 969, in to_string
    formatter.to_string()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 189, in to_string
    str_columns = self._get_formatted_column_labels()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 370, in _get_formatted_column_labels
    fmt_columns = self.columns.format()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in format
    result.extend(_stringify(x) for x in self)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in <genexpr>
    result.extend(_stringify(x) for x in self)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
    return '%s' % console_encode(col)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
    return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 0: ordinal not in range(256)

======================================================================
ERROR: test_to_string_with_formatters_unicode (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1841, in test_to_string_with_formatters_unicode
    result = df.to_string(formatters={u'c/\u03c3': lambda x: '%s' % x})
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 969, in to_string
    formatter.to_string()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 189, in to_string
    str_columns = self._get_formatted_column_labels()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 370, in _get_formatted_column_labels
    fmt_columns = self.columns.format()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in format
    result.extend(_stringify(x) for x in self)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/index.py", line 269, in <genexpr>
    result.extend(_stringify(x) for x in self)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
    return '%s' % console_encode(col)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
    return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 2: ordinal not in range(256)

======================================================================
ERROR: test_repr_unicode (pandas.tests.test_series.TestSeries)
----------------------------------------------------------------------
Traceback (most recent call last):
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/tests/test_series.py", line 752, in test_repr_unicode
    repr(s)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/series.py", line 558, in __repr__
    name=True)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/series.py", line 596, in _get_repr
    return formatter.to_string()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 118, in to_string
    fmt_values = self._get_formatted_values()
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 105, in _get_formatted_values
    fmt_values.append(' %s' % self.formatter(v))
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/format.py", line 58, in formatter
    col_width=col_width)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 603, in _format
    return _just_help('%s' % _stringify(s))
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 512, in _stringify
    return '%s' % console_encode(col)
  File ".../lib/python2.7/site-packages/pandas-0.7.0.dev_b4ca18b-py2.7-linux-x86_64.egg/pandas/core/common.py", line 823, in console_encode
    return value.encode(sys.stdin.encoding)
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u03c3' in position 0: ordinal not in range(256)

@takluyver
Copy link
Contributor

I guess that gives the lie to my assertion that all modern Linux uses UTF-8 as the terminal encoding. What distro is that, @lodagro ?

It should be an easy fix - see my comments on b4ca18b.

@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

Linux version 2.6.18-238.5.1.el5 (mockbuild@x86-002.build.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Mon Feb 21 05:52:39 EST 2011

@takluyver
Copy link
Contributor

RHEL 5.1?

@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

no 5.1 available, but 5.6, same issue.

LSB Version:    :core-4.0-amd64:core-4.0-ia32:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-ia32:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-ia32:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 5.6 (Tikanga)
Release:        5.6
Codename:       Tikanga

adamklein added a commit that referenced this issue Jan 25, 2012
@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

I changed my linux LANG environment variable. pandas testing runs fine now, print on screen is not as expected though (\sigma is not shown).

[1115][i] more /etc/sysconfig/i18n 
SYSFONT="latarcyrheb-sun16"
LANG="en_US"
LC_COLLATE="C"

[1116][i] echo $LANG  
en_US

[1117][i] export LANG=en_US.UTF-8 

[1118][i] echo $LANG 
en_US.UTF-8
In [1]: import sys

In [2]: sys.stdin.encoding
Out[2]: 'UTF-8'

In [3]: cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:from pandas import DataFrame
:df = DataFrame({u'c/\u03c3':[1,2,3]})
:--

In [4]: df
Out[4]:
   c/Ï
0  1
1  2
2  3

@takluyver
Copy link
Contributor

Python is now able to encode it (any character can be encoded to UTF-8), but I guess your terminal is still expecting ISO-8859-1 encoded text, so it doesn't display properly. If what your terminal expects doesn't match the encoding Python thinks it should be using, output will inevitably get messed up.

@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

Indeed, if i change the terminal also it works fine.

In [1]: print u'c/\u03c3'
c/σ

@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

I tried again running unittests including @adamklein lastest commits (SHA 41e6083 and SHA c52dd87). One unittest fails with Latin1 encoding (see below), it works fine with UTF-8

FAIL: test_to_string_with_formatters_unicode (pandas.tests.test_frame.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "<...>/lib/python2.7/site-packages/pandas-0.7.0.dev_c52dd87-py2.7-linux-x86_64.egg/pandas/tests/test_frame.py", line 1842, in test_to_string_with_formatters_unicode
    self.assertEqual(result, '  c/\xcf\x83\n0 1   \n1 2   \n2 3   ')
AssertionError: '  c/?\n0 1  \n1 2  \n2 3  ' != '  c/\xcf\x83\n0 1   \n1 2   \n2 3   '

@craustin
Copy link
Author

This no longer works after the latest fix:

import StringIO
from pandas import DataFrame
buf = StringIO.StringIO()
dm = DataFrame({u'c/\u03c3': []})
dm.to_string(buf)

Raises:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 2: ordinal not in range(128)

@takluyver
Copy link
Contributor

@craustin : Do you have a traceback with that? Also, what is sys.stdin.encoding set to?

@adamklein
Copy link
Contributor

@craustin looking into this, it's actually in the numpy repr method where it's throwing, ugh
@lodagro hmm, so the unit test assumes UTF-8 and your terminal is latin1

@lodagro
Copy link
Contributor

lodagro commented Jan 25, 2012

@adamklein Indeed the unit test assumes UTF-8 will be generated,

pandas does:return value.encode(sys.stdin.encoding, 'replace')

'replace': replace malformed data with a suitable replacement marker, such as '?' or '\ufffd'

So when linux is set up to use a non UTF-8 codec and python tries to encode UTF-8 values, instead of raising UnicodeError, replacement markers will be inserted, making the encoded different from what it would be when UTF-8 codec would be in place. Thest needs to check for UTF-8 return string or alternatives. In the example above ? are used to replace malformed data.

Test runs fine if i set UTF-8 codec.

adamklein added a commit that referenced this issue Jan 25, 2012
@craustin
Copy link
Author

sys.stdin.encoding is 'cp437'

Traceback:
pandas\core\frame.pyc in to_string(self, buf, columns, col_space, colSpace, header, index, na_rep, formatters, float_format, sparsify, nanRep, index_names, justify)
967 index_names=index_names,
968 header=header, index=index)
--> 969 formatter.to_string()
970
971 if buf is None:

pandas\core\format.pyc in to_string(self)
182 info_line = 'Empty %s\nColumns: %s\nIndex: %s'
183 to_write.append(info_line % (type(self.frame).name,
--> 184 repr(frame.columns),
185 repr(frame.index)))
186 else:

Lib\site-packages\numpy\core\numeric.pyc in array_repr(arr, max_line_width, precision, suppress_small)
1348 if arr.size > 0 or arr.shape==(0,):
1349 lst = array2string(arr, max_line_width, precision, suppress_small,
-> 1350 ', ', "array(")
1351 else: # show zero-length shape unless it is (0,)
1352 lst = "[], shape=%s" % (repr(arr.shape),)

Lib\site-packages\numpy\core\arrayprint.pyc in array2string(a, max_line_width, precision, suppress_small, separator,
prefix, style)
298 else:
299 lst = _array2string(a, max_line_width, precision, suppress_small,
--> 300 separator, prefix)
301 return lst
302

Lib\site-packages\numpy\core\arrayprint.pyc in _array2string(a, max_line_width, precision, suppress_small, separator, prefix)
220 lst = _formatArray(a, format_function, len(a.shape), max_line_width,
221 next_line_prefix, separator,
--> 222 _summaryEdgeItems, summary_insert)[:-1]
223
224 return lst

Lib\site-packages\numpy\core\arrayprint.pyc in _formatArray(a, format_function, rank, max_line_len, next_line_prefix, separator, edge_items, summary_insert)
344 s, line = _extendLine(s, line, word, max_line_len, next_line_prefix)
345
--> 346 word = format_function(a[-1])
347 s, line = _extendLine(s, line, word, max_line_len, next_line_prefix)
348 s += line + "]\n"

UnicodeEncodeError: 'ascii' codec can't encode character u'\u03c3' in position 2: ordinal not in range(128)

wesm added a commit that referenced this issue Jan 25, 2012
wesm added a commit that referenced this issue Jan 25, 2012
wesm added a commit that referenced this issue Jan 25, 2012
wesm added a commit that referenced this issue Jan 25, 2012
@wesm
Copy link
Member

wesm commented Jan 25, 2012

all set it looks like

@wesm wesm closed this as completed Jan 25, 2012
@takluyver
Copy link
Contributor

My changes in PR #685 are relevant to this code on Python 3.

@wesm
Copy link
Member

wesm commented Jan 25, 2012

I'll merge and test that in the next day or two. thanks a lot for chasing down those issues

@craustin
Copy link
Author

Another repro - let me know if you want a new issue:

from pandas import Series, DataFrame
from StringIO import StringIO
buf = StringIO()
dm = DataFrame({u'c/\u03c3': Series({})}).reindex(['test']) # repro requires this reindex
dm.to_string(buf)
print >>buf, u"""\u03c3""" # repro requires this print
print buf.getvalue()

@wesm
Copy link
Member

wesm commented Jan 26, 2012

That one can't be fixed. You can't mix byte strings and unicode in a StringIO

@craustin
Copy link
Author

This works without the reindex(['test']), so I suppose that is causing pandas to write ascii instead of unicode. If this behavior is by design, I'll handle it on my end.

@lodagro
Copy link
Contributor

lodagro commented Jan 26, 2012

It looks like when to_string(buffer) is called on an empty DataFrame, strings are written to the buffer. But when the DataFrame is not empty unicode is written to the buffer.

In [55]: dm
Out[55]:
Empty DataFrame
Columns: array([u'c/\u03c3'],
      dtype='<U3')
Index: array([], dtype=object)

In [56]: dm.reindex(['test'])
Out[56]:
      c/σ
test  NaN

@takluyver
Copy link
Contributor

I suspect that in both cases 8-bit strings are written, but in the second case you have a non-ascii character (a byte > 127). Trying to combine that with unicode causes the problem.

From the StringIO docs: "The StringIO object can accept either Unicode or 8-bit strings, but mixing the two may take some care. If both are used, 8-bit strings that cannot be interpreted as 7-bit ASCII (that use the 8th bit) will cause a UnicodeError to be raised when getvalue() is called."

@adamklein
Copy link
Contributor

Wouter, you are right. I will attempt to make it print unicode in both cases, which is consistent.

@wesm
Copy link
Member

wesm commented Jan 27, 2012

After some discussion I think the best solution is to produce a unicode string containing the formatted DataFrame or Series and attempt to convert unicode to string (which will need to be a no-op in Python 3), which, in the case of failure, will return unicode. This is essentially the same behavior as in pandas <= 0.6.1 except there will be proper handling of encodings for returning non-ASCII from __repr__ in the console. This has been a fun one, thanks for all the help, guys

yarikoptic added a commit to neurodebian/pandas that referenced this issue Feb 10, 2012
* commit 'v0.7.0rc1-73-g69d5bd8': (44 commits)
  BUG: integer slices should never access label-indexing, GH pandas-dev#700
  BUG: pandas-dev#680 clean up with check for py3compat
  BUG: pandas-dev#680 rears again. cut off another hydra head
  ENH: change to tree-like MultiIndex output with > 2 levels, GH pandas-dev#689
  TST: added a test related to pandas-dev#680
  BUG: related to closes pandas-dev#691, removed cruft
  BUG: closes pandas-dev#691, assignment with ix and mixed dtypes
  BUG: handle incomparable values when creating Factor, caused bug in py3
  TST: Fixes for tests on Python 3.
  BUG: pandas-dev#680, print consistently when dataframe is empty
  TST: unit test for PR pandas-dev#684
  ENH: Allow Series.to_csv to ignore the index.
  BUG: raise exception in DateRange with MonthEnd(0) instead of infinite loop, GH pandas-dev#683
  BUG: unbox 0-dimensional arrays in map_infer, GH pandas-dev#690
  updated license and credits for overview
  ENH: cythonize timestamp conversion in HDFStore
  TST: ok, this appears to work GH pandas-dev#680
  TST: even more woes GH pandas-dev#680
  TST: unicode woes on windoze GH pandas-dev#680
  TST: unicode codec test issue, GH pandas-dev#680
  ...
dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

5 participants