-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
BUG: read_csv throws UnicodeDecodeError with unicode aliases #13571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
d485c4a
ae62350
36bcdd8
285ccf9
173c38b
78d46d6
35dfb13
71f084e
da8fce4
1825486
1d30333
4f680d7
b582195
e26c92a
d14b69e
eeb7011
b8d78c4
75869f4
9c88919
6725536
671ad41
3c4a798
5675b82
ff6117e
b983957
451c054
33278a9
181cecd
a2e5d54
6c8b21b
5d99cff
8e7904f
a07b5d3
ff2a335
1f8cc7f
f743eb3
e161699
5765b92
ac18b36
1fc6b90
6b0e2ca
41a6fae
f730e60
05a2d04
c4e93bd
430273d
1fa91b9
e379e9f
a35521e
6c09821
5584dff
9463dee
5198179
3c30cd0
e77ac2d
69ab536
1eb478d
a2f178f
8e05f7e
ab153d5
0c1de9f
77ec966
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1469,3 +1469,23 @@ def test_memory_map(self): | |
|
||
out = self.read_csv(mmap_file, memory_map=True) | ||
tm.assert_frame_equal(out, expected) | ||
|
||
def test_read_csv_utf_aliases(self): | ||
# see gh issue 13549 | ||
path = 'test.csv' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use the context manager
remove |
||
expected = DataFrame({'A': [0, 1], 'B': [2, 3], | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose we could do one row as expected = pd.DataFrame({'mb_num': [4.8], 'multibyte': ['test']}) I used BytesIO because I don't think StringIO can support different encodings (I tried and wasn't able to get StringIO to work). |
||
'multibyte_test': ['testing123', 'bananabis'], | ||
'mb_nums': [154.868, 457.8798]}) | ||
|
||
for byte in [8, 16]: | ||
expected.to_csv(path, encoding='utf-' + str(byte), index=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You don't need this anymore. But much cleaner! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yeah, whoops. I'm testing the test in a separate file because I haven't figured out how to run the nosetests in common.py yet. |
||
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']: | ||
encoding = fmt.format(byte) | ||
for engine in ['c', 'python', None]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you don't iterate thru the engines, this is what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So should I just write it like for byte in [8, 16]:
expected.to_csv(path, encoding='utf-' + str(byte), index=False)
for fmt in ['utf-{0}', 'utf_{0}', 'UTF-{0}', 'UTF_{0}']:
encoding = fmt.format(byte)
result = self.read_csv(
path,
encoding=encoding)
tm.assert_frame_equal(result, expected) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's correct (I didn't see this before because it was hidden away), see below. |
||
out = self.read_csv( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. out->result There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean change variable 'out' to 'result'? There are at least 8 other test functions in that file that use 'out' and 'expected' and do tm.assert_numpy_array_equal(out, expected) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh i c. we almost always use we NEVER use an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the prior function is explicity comparing for rec-arrays There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I'll change the 'out's to 'result's in the file for consistency. Is that ok to include in this pull? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Surprised no one responded to this, but as that change is NOT directly related to your main fix, it would have been preferable to put in a separate PR (or at least in an individual commit SEPARATE from anything else). So don't undo what you've done, but just for future reference. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, thanks. |
||
path, | ||
engine=engine, | ||
encoding=encoding) | ||
tm.assert_frame_equal(out, expected) | ||
|
||
os.remove("test.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put in 0.18.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks around
UnicodeDecodeError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.read_csv()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like 0.18.2 was moved to 0.19.0 this weekend