Skip to content

BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
d485c4a
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 5, 2016
ae62350
BUG: `read_csv` throws UnicodeDecodeError with unicode
nateGeorge Jul 6, 2016
36bcdd8
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 6, 2016
285ccf9
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
173c38b
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
78d46d6
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
35dfb13
chore: matched master
nateGeorge Jul 12, 2016
71f084e
DOC: add pd.read_csv bug #13549
nateGeorge Jul 12, 2016
da8fce4
TST: out-> result and tm.ensure_clean
nateGeorge Jul 12, 2016
1825486
TST: conform to PEP8
nateGeorge Jul 12, 2016
1d30333
TST: condense test_read_utf_aliases test
nateGeorge Jul 12, 2016
4f680d7
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 12, 2016
b582195
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 13, 2016
e26c92a
CLN: remove unnecessary BytesIO import
nateGeorge Jul 13, 2016
d14b69e
CLN: remove unnecessary csv write line
nateGeorge Jul 13, 2016
eeb7011
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Jul 13, 2016
b8d78c4
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 5, 2016
75869f4
BUG: `read_csv` throws UnicodeDecodeError with unicode
nateGeorge Jul 6, 2016
9c88919
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
6725536
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
671ad41
BUG: `read_csv` throws UnicodeDecodeError with unicode aliases
nateGeorge Jul 6, 2016
3c4a798
BUG: Groupby.nth includes group key inconsistently #12839
adneu Jul 6, 2016
5675b82
In gbq, use googleapiclient instead of apiclient #13454 (#13458)
parthea Jul 7, 2016
ff6117e
RLS: switch master from 0.18.2 to 0.19.0 (#13586)
jorisvandenbossche Jul 8, 2016
b983957
BUG: Datetime64Formatter not respecting ``formatter``
haleemur Jul 8, 2016
451c054
BUG: Fix TimeDelta to Timedelta (#13600)
yui-knk Jul 9, 2016
33278a9
COMPAT: 32-bit compat fixes mainly in testing
jreback Jul 7, 2016
181cecd
BUG: DatetimeIndex - Period shows ununderstandable error
sinhrks Jul 10, 2016
a2e5d54
ENH: add downcast to pd.to_numeric
gfyoung Jul 10, 2016
6c8b21b
CLN: remove radd workaround in ops.py
sinhrks Jul 10, 2016
5d99cff
DEPR: rename Timestamp.offset to .freq
sinhrks Jul 10, 2016
8e7904f
CLN: Remove the engine parameter in CSVFormatter and to_csv
gfyoung Jun 10, 2016
a07b5d3
BUG: Block/DTI doesnt handle tzlocal properly
sinhrks Jul 10, 2016
ff2a335
BUG: Series contains NaT with object dtype comparison incorrect (#13592)
sinhrks Jul 11, 2016
1f8cc7f
CLN/TST: Add tests for nan/nat mixed input (#13477)
sinhrks Jul 11, 2016
f743eb3
BUG: groupby apply on selected columns yielding scalar (GH13568) (#13…
jorisvandenbossche Jul 11, 2016
e161699
TST: Clean up tests of DataFrame.sort_{index,values} (#13496)
IamJeffG Jul 11, 2016
5765b92
DOC: add pd.read_csv bug #13549
nateGeorge Jul 12, 2016
ac18b36
TST: out-> result and tm.ensure_clean
nateGeorge Jul 12, 2016
1fc6b90
TST: conform to PEP8
nateGeorge Jul 12, 2016
6b0e2ca
TST: condense test_read_utf_aliases test
nateGeorge Jul 12, 2016
41a6fae
DOC: asfreq clarify original NaNs are not filled (GH9963) (#13617)
jorisvandenbossche Jul 12, 2016
f730e60
BUG: Invalid Timedelta op may raise ValueError
sinhrks Jul 12, 2016
05a2d04
CLN: Cleanup ops.py
sinhrks Jul 12, 2016
c4e93bd
CLN: Removed outtype in DataFrame.to_dict (#13627)
gfyoung Jul 12, 2016
430273d
CLN: Fix compile time warnings
yui-knk Jul 13, 2016
1fa91b9
CLN: remove unnecessary BytesIO import
nateGeorge Jul 13, 2016
e379e9f
CLN: remove unnecessary csv write line
nateGeorge Jul 13, 2016
a35521e
Pin IPython for doc build to 4.x (see #13639)
jorisvandenbossche Jul 13, 2016
6c09821
CLN: reorg type inference & introspection
jreback Jul 13, 2016
5584dff
BLD: included pandas.api.* in setup.py (#13640)
gfyoung Jul 13, 2016
9463dee
docs: add note about read_csv() bug
nateGeorge Aug 15, 2016
5198179
cln: trying to merge with master
nateGeorge Aug 15, 2016
3c30cd0
CLN: merge with master
nateGeorge Aug 15, 2016
e77ac2d
Merge branch 'fix/read_csv-utf-aliases' of github.com:nateGeorge/pand…
nateGeorge Aug 19, 2016
69ab536
CLN: reset to master branch
nateGeorge Aug 19, 2016
1eb478d
Merge branch 'master' of github.com:pydata/pandas into fix/read_csv-u…
nateGeorge Aug 19, 2016
a2f178f
CLN: fix small diff from upstream/master
nateGeorge Aug 19, 2016
8e05f7e
BUG: _read encoding fix
nateGeorge Aug 19, 2016
ab153d5
DOC: add note on read_csv bug
nateGeorge Aug 19, 2016
0c1de9f
TST: add test for read_csv with unicode bug
nateGeorge Aug 19, 2016
77ec966
CLN: fix indents and spacings
nateGeorge Aug 19, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
CLN: merge with master
  • Loading branch information
nateGeorge committed Aug 15, 2016
commit 3c30cd084a82a05e4ff8a38a8a7202d8fd97154f
56 changes: 56 additions & 0 deletions doc/source/whatsnew/v0.19.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -233,11 +233,52 @@ New behaviour:
In [2]: pd.read_csv(StringIO(data), names=names)


<<<<<<< HEAD
New behaviour:

.. ipython :: python

In [2]: pd.read_csv(StringIO(data), names=names)
=======
.. _whatsnew_0190.enhancements.read_csv_categorical:

:func:`read_csv` supports parsing ``Categorical`` directly
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The :func:`read_csv` function now supports parsing a ``Categorical`` column when
specified as a dtype (:issue:`10153`). Depending on the structure of the data,
this can result in a faster parse time and lower memory usage compared to
converting to ``Categorical`` after parsing. See the io :ref:`docs here <io.categorical>`

.. ipython:: python

data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3'

pd.read_csv(StringIO(data))
pd.read_csv(StringIO(data)).dtypes
pd.read_csv(StringIO(data), dtype='category').dtypes

Individual columns can be parsed as a ``Categorical`` using a dict specification

.. ipython:: python

pd.read_csv(StringIO(data), dtype={'col1': 'category'}).dtypes

.. note::

The resulting categories will always be parsed as strings (object dtype).
If the categories are numeric they can be converted using the
:func:`to_numeric` function, or as appropriate, another converter
such as :func:`to_datetime`.

.. ipython:: python

df = pd.read_csv(StringIO(data), dtype='category')
df.dtypes
df['col3']
df['col3'].cat.categories = pd.to_numeric(df['col3'].cat.categories)
df['col3']
>>>>>>> f93ad1ca828dc70a865445f1555958acbf132af1

.. _whatsnew_0190.enhancements.semi_month_offsets:

Expand Down Expand Up @@ -968,5 +1009,20 @@ Bug Fixes

- Bug in the CSS classes assigned to ``DataFrame.style`` for index names. Previously they were assigned ``"col_heading level<n> col<c>"`` where ``n`` was the number of levels + 1. Now they are assigned ``"index_name level<n>"``, where ``n`` is the correct level for that MultiIndex.
- Bug where ``pd.read_gbq()`` could throw ``ImportError: No module named discovery`` as a result of a naming conflict with another python package called apiclient (:issue:`13454`)
- Bug in ``Index.union`` returns an incorrect result with a named empty index (:issue:`13432`)
- Bugs in ``Index.difference`` and ``DataFrame.join`` raise in Python3 when using mixed-integer indexes (:issue:`13432`, :issue:`12814`)
- Bug in ``.to_excel()`` when DataFrame contains a MultiIndex which contains a label with a NaN value (:issue:`13511`)
- Bug in invalid frequency offset string like "D1", "-2-3H" may not raise ``ValueError (:issue:`13930`)
- Bug in ``concat`` and ``groupby`` for hierarchical frames with ``RangeIndex`` levels (:issue:`13542`).

- Bug in ``agg()`` function on groupby dataframe changes dtype of ``datetime64[ns]`` column to ``float64`` (:issue:`12821`)

- Bug in operations on ``NaT`` returning ``float`` instead of ``datetime64[ns]`` (:issue:`12941`)

- Bug in ``pd.read_csv`` in Python 2.x with non-UTF8 encoded, multi-character separated data (:issue:`3404`)

- Bug in ``Index`` raises ``KeyError`` displaying incorrect column when column is not in the df and columns contains duplicate values (:issue:`13822`)
- Bug in ``Period`` and ``PeriodIndex`` creating wrong dates when frequency has combined offset aliases (:issue:`13874`)
- Bug in ``pd.to_datetime()`` did not cast floats correctly when ``unit`` was specified, resulting in truncated datetime (:issue:`13845`)
- Bug in ``.to_string()`` when called with an integer ``line_width`` and ``index=False`` raises an UnboundLocalError exception because ``idx`` referenced before assignment.
- Bug in ``pd.read_csv()``, where aliases for utf-xx (e.g. UTF-xx, UTF_xx, utf_xx) raised ``UnicodeDecodeError`` (:issue:`13549`)
82 changes: 82 additions & 0 deletions pandas/io/tests/parser/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -1513,3 +1513,85 @@ def test_read_csv_utf_aliases(self):
data = 'mb_num,multibyte\n4.8,test'.encode(encoding)
result = self.read_csv(BytesIO(data), encoding=encoding)
tm.assert_frame_equal(result, expected)

def test_null_byte_char(self):
# see gh-2741
data = '\x00,foo'
cols = ['a', 'b']

expected = DataFrame([[np.nan, 'foo']],
columns=cols)

if self.engine == 'c':
out = self.read_csv(StringIO(data), names=cols)
tm.assert_frame_equal(out, expected)
else:
msg = "NULL byte detected"
with tm.assertRaisesRegexp(csv.Error, msg):
self.read_csv(StringIO(data), names=cols)

def test_utf8_bom(self):
# see gh-4793
bom = u('\ufeff')
utf8 = 'utf-8'

def _encode_data_with_bom(_data):
bom_data = (bom + _data).encode(utf8)
return BytesIO(bom_data)

# basic test
data = 'a\n1'
expected = DataFrame({'a': [1]})

out = self.read_csv(_encode_data_with_bom(data),
encoding=utf8)
tm.assert_frame_equal(out, expected)

# test with "regular" quoting
data = '"a"\n1'
expected = DataFrame({'a': [1]})

out = self.read_csv(_encode_data_with_bom(data),
encoding=utf8, quotechar='"')
tm.assert_frame_equal(out, expected)

# test in a data row instead of header
data = 'b\n1'
expected = DataFrame({'a': ['b', '1']})

out = self.read_csv(_encode_data_with_bom(data),
encoding=utf8, names=['a'])
tm.assert_frame_equal(out, expected)

# test in empty data row with skipping
data = '\n1'
expected = DataFrame({'a': [1]})

out = self.read_csv(_encode_data_with_bom(data),
encoding=utf8, names=['a'],
skip_blank_lines=True)
tm.assert_frame_equal(out, expected)

# test in empty data row without skipping
data = '\n1'
expected = DataFrame({'a': [np.nan, 1.0]})

out = self.read_csv(_encode_data_with_bom(data),
encoding=utf8, names=['a'],
skip_blank_lines=False)
tm.assert_frame_equal(out, expected)

def test_temporary_file(self):
# see gh-13398
data1 = "0 0"

from tempfile import TemporaryFile
new_file = TemporaryFile("w+")
new_file.write(data1)
new_file.flush()
new_file.seek(0)

result = self.read_csv(new_file, sep='\s+', header=None)
new_file.close()
expected = DataFrame([[0, 0]])
tm.assert_frame_equal(result, expected)
You are viewing a condensed version of this merge commit. You can view the full changes here.