Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError with html.table_schema = True #16848

Open
dhirschfeld opened this issue Jul 7, 2017 · 1 comment
Open

UnicodeDecodeError with html.table_schema = True #16848

dhirschfeld opened this issue Jul 7, 2017 · 1 comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap IO JSON read_json, to_json, json_normalize Unicode Unicode strings

Comments

@dhirschfeld
Copy link
Contributor

I have a DataFrame with a column of binary data which has an object dtype.

If display.html.table_schema is False then this displays fine with the display just being the repr of the byte string.

If I set display.html.table_schema = True then attempting to display the DataFrame throws a UnicodeDecodeError from the json conversion.

It would be good if the display also worked when using the table_schema option.

In [1]: pd.Series([b'\x00\x00\x00\x00\x00\x01\x82S'])
Out[1]: 
0    b'\x00\x00\x00\x00\x00\x01\x82S'
dtype: object

In [2]: pd.options.display.html.table_schema = True

In [3]: pd.Series([b'\x00\x00\x00\x00\x00\x01\x82S'])
Traceback (most recent call last):

  File "C:\Miniconda3\lib\site-packages\IPython\core\formatters.py", line 336, in __call__
    return method()

  File "C:\Miniconda3\lib\site-packages\pandas\core\generic.py", line 141, in _repr_data_resource_
    payload = json.loads(data.to_json(orient='table'),

  File "C:\Miniconda3\lib\site-packages\pandas\core\generic.py", line 1245, in to_json
    lines=lines)

  File "C:\Miniconda3\lib\site-packages\pandas\io\json\json.py", line 46, in to_json
    date_unit=date_unit, default_handler=default_handler).write()

  File "C:\Miniconda3\lib\site-packages\pandas\io\json\json.py", line 166, in write
    data = super(JSONTableWriter, self).write()

  File "C:\Miniconda3\lib\site-packages\pandas\io\json\json.py", line 90, in write
    default_handler=self.default_handler

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 58: invalid start byte

Out[3]: 
0    b'\x00\x00\x00\x00\x00\x01\x82S'
dtype: object

In [4]: pd.get_option('display.encoding')
Out[4]: 'utf-8'

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.5.0a1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None
@TomAugspurger TomAugspurger added IO JSON read_json, to_json, json_normalize Unicode Unicode strings Difficulty Intermediate labels Jul 7, 2017
@TomAugspurger TomAugspurger modified the milestones: 0.21.0, Next Major Release Jul 7, 2017
@TomAugspurger
Copy link
Contributor

The underlying problem is in the call to to_json

In [4]: pd.Series([b'\x00\x00\x00\x00\x00\x01\x82S']).to_json()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-4-13bdc7ec1cc6> in <module>()
----> 1 pd.Series([b'\x00\x00\x00\x00\x00\x01\x82S']).to_json()

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/generic.py in to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines)
   1250                             force_ascii=force_ascii, date_unit=date_unit,
   1251                             default_handler=default_handler,
-> 1252                             lines=lines)
   1253
   1254     def to_hdf(self, path_or_buf, key, **kwargs):

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/io/json/json.py in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines)
     46         obj, orient=orient, date_format=date_format,
     47         double_precision=double_precision, ensure_ascii=force_ascii,
---> 48         date_unit=date_unit, default_handler=default_handler).write()
     49
     50     if lines:

~/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/io/json/json.py in write(self)
     90             date_unit=self.date_unit,
     91             iso_dates=self.date_format == 'iso',
---> 92             default_handler=self.default_handler
     93         )
     94

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 42: invalid start byte

I'm not sure how tricky it would be to pass through. The standard-library doesn't even try to serialze bytes

import json
In [9]: json.dumps(s[0])
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-4b0d6b435871> in <module>()
----> 1 json.dumps(s[0])

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    229         cls is None and indent is None and separators is None and
    230         default is None and not sort_keys and not kw):
--> 231         return _default_encoder.encode(obj)
    232     if cls is None:
    233         cls = JSONEncoder

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py in encode(self, o)
    197         # exceptions aren't as detailed.  The list call should be roughly
    198         # equivalent to the PySequence_Fast that ''.join() would do.
--> 199         chunks = self.iterencode(o, _one_shot=True)
    200         if not isinstance(chunks, (list, tuple)):
    201             chunks = list(chunks)

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py in iterencode(self, o, _one_shot)
    255                 self.key_separator, self.item_separator, self.sort_keys,
    256                 self.skipkeys, _one_shot)
--> 257         return _iterencode(o, 0)
    258
    259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/encoder.py in default(self, o)
    178         """
    179         raise TypeError("Object of type '%s' is not JSON serializable" %
--> 180                         o.__class__.__name__)
    181
    182     def encode(self, o):

TypeError: Object of type 'bytes' is not JSON serializable

Either way, we don't want _repr_table_schema_ failing here...

@jbrockmendel jbrockmendel added IO HTML read_html, to_html, Styler.apply, Styler.applymap and removed Effort Medium labels Oct 16, 2019
@mroeschke mroeschke added the Bug label Apr 14, 2020
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap IO JSON read_json, to_json, json_normalize Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

4 participants