Skip to content

UnicodeDecodeError with Latin-1 characters in Stata files #23736

Closed
@yatharth

Description

@yatharth

Steps to reproduce

df = pd.read_stata('buggy_file.dta')

Expected behaviour

Pandas reads the stata file just fine.

Actual behaviour

Pandas raises an error to do with encoding, traceable back to this line:

Diagnosis

The error is caused by the “smart quote” character , which is encoded in Latin-1 in the Stata .dta file, but it considered an invalid byte sequence in Unicode.

The errors originates in the StataReader class in io/stata.py:

    def _decode(self, s):
        s = s.partition(b"\0")[0]
        return s.decode('utf-8')

Instead of 'utf-8', Pandas should use self._encoding or self._default_encoding, just like other parts of the code use when reading from the input buffer/file. Making the relevant change on my machine makes the issue go away.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.5.final.0 python-bits: 64 OS: Darwin OS-release: 17.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs InfoClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions