Skip to content

read_excel with dtype=str converts empty cells to np.nan #20429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
dd53df8
TST: Test for astype_nansafe. Modified test for astype
nikoskaragiannakis Mar 20, 2018
6f771fb
BUG: np.nan should stay as it is when we cast to str/basestring
nikoskaragiannakis Mar 20, 2018
37f00ad
BUG: revert change in lib.pyx. modify excel functionality directly
nikoskaragiannakis Mar 20, 2018
f194b70
TST: revert changes in dtypes/test_cast. test excel functionality
nikoskaragiannakis Mar 20, 2018
eb8f4c5
DOC: added description
nikoskaragiannakis Mar 20, 2018
ac6a409
TST: correction and pep8
nikoskaragiannakis Mar 20, 2018
6994bb0
BUG: pep8
nikoskaragiannakis Mar 20, 2018
40a563f
TST: remove unused import
nikoskaragiannakis Mar 20, 2018
9858259
DOC: resolved conflict
nikoskaragiannakis Mar 20, 2018
5f71a99
Update v0.23.0.txt
nikoskaragiannakis Mar 20, 2018
0a93b60
conflict again
nikoskaragiannakis Mar 20, 2018
f296f9a
arghh
nikoskaragiannakis Mar 20, 2018
7c0af1f
DOC: add disallowing of Series construction of len-1 list with index …
jorisvandenbossche Mar 19, 2018
f0fd0a7
Bug: Allow np.timedelta64 objects to index TimedeltaIndex (#20408)
mroeschke Mar 19, 2018
61e0519
DOC: Only use ~ in class links to hide prefixes. (#20402)
dukebody Mar 19, 2018
9fdac27
DOC: update the pandas.DataFrame.plot.hist docstring (#20155)
liopic Mar 19, 2018
ddb904f
DOC" update the Pandas core window rolling count docstring" (#20264)
tommy-stone Mar 19, 2018
694849d
BUG: astype_unicode astype_str turn a np.nan to empty string (#20377)
nikoskaragiannakis Mar 24, 2018
5ba95a1
TST: added unitest for read_excel and modified series/test_dtypes for…
nikoskaragiannakis Mar 24, 2018
d3ceec3
TST: added unitest for read_csv (#20377)
nikoskaragiannakis Mar 25, 2018
ea1d73a
BUG: patched TextReader to turn np.nan to empty string if dtype=str (…
nikoskaragiannakis Mar 25, 2018
c1376a5
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
3103811
DOC: updated IO section (#20377)
nikoskaragiannakis Mar 25, 2018
7d5f6b2
pull from master
nikoskaragiannakis Mar 25, 2018
478d08d
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
edb26d7
BUG: np.nan stays as np.nan (#20377)
nikoskaragiannakis Apr 2, 2018
c3ab9cb
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
69f6c95
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
97a345a
TST: pep8 (#20377)
nikoskaragiannakis Apr 2, 2018
8b2fb0b
TXT: Moved test from series.test_io to io.parser.na_values. Corrected…
nikoskaragiannakis Apr 2, 2018
c9f5120
DOC: updated IO section (#20377)
nikoskaragiannakis Apr 2, 2018
fab0b27
resolve conflict
nikoskaragiannakis Apr 2, 2018
571d5c4
pep8 correction
nikoskaragiannakis Apr 2, 2018
0712392
Merge remote-tracking branch 'upstream/master' into nikoskaragiannaki…
TomAugspurger Apr 3, 2018
47bc105
DOC: Better explanation (#20377)
nikoskaragiannakis Apr 5, 2018
3740dfe
BUG: use checknull (#20377)
nikoskaragiannakis Apr 5, 2018
7d453bb
TST: update tests (#20377)
nikoskaragiannakis Apr 8, 2018
bcd739d
BUG: string nans to np.nan in Series for list data (#20377)
nikoskaragiannakis Apr 8, 2018
7341cd1
sync
nikoskaragiannakis Apr 8, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
pep8 correction
  • Loading branch information
nikoskaragiannakis committed Apr 2, 2018
commit 571d5c473b0a9d082426f99a0165f285d071f91e
10 changes: 8 additions & 2 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,10 @@ cpdef ndarray[object] astype_unicode(ndarray arr):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
arr_i = arr[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is arr_i in the cdef?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh!

util.set_value_at_unsafe(result, i, unicode(arr_i) if arr_i is not np.nan else np.nan)
util.set_value_at_unsafe(
result,
i,
unicode(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Member

@gfyoung gfyoung Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting spacing...maybe we should do this instead:

uni_arr_i = unicode(arr_i) if arr_i is not np.nan else np.nan
util.set_value_at_unsafe(result, i, uni_arr_i)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use np.isnan here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using np.nan here raises:

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is not friendly to strings - ok
use checknull (should already be imported
from util )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When i use that, all hell breaks loose. I get errors in tests like this one https://github.com/pandas-dev/pandas/blob/master/pandas/tests/frame/test_dtypes.py#L533

Is it because they use np.NaN? It looks like checknull checks both np.NaN and np.nan, while before the change I used to check only np.nan.
If that's the case, then I have to modify more tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung are you sure this indentation is a big problem? Because if I do what you suggest, then how should I declare uni_arr_i (and str_arr_i) in the cdef?
Would it be ok if I changed it to sth like

util.set_value_at_unsafe(
    ...
)

(moved the close bracket in the next line)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would work as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the nans are the same; iow they point to the same object
go ahead and change tests if need be and i will have a look


return result

Expand All @@ -480,7 +483,10 @@ cpdef ndarray[object] astype_str(ndarray arr):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
arr_i = arr[i]
util.set_value_at_unsafe(result, i, str(arr_i) if arr_i is not np.nan else np.nan)
util.set_value_at_unsafe(
result,
i,
str(arr_i) if arr_i is not np.nan else np.nan)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


return result

Expand Down