Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: float conversion in read_csv is inaccurate for precise input #8002

Closed
mdmueller opened this issue Aug 12, 2014 · 7 comments · Fixed by #8044
Closed

BUG: float conversion in read_csv is inaccurate for precise input #8002

mdmueller opened this issue Aug 12, 2014 · 7 comments · Fixed by #8044
Labels
IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@mdmueller
Copy link
Contributor

This is coming from #2566, which noted that the xstrtod() method used by read_csv doesn't agree with standard numpy float conversion. I see that the priority is speed of parsing over complete accuracy (within 0.5 units in the last place, or ULP), but there are two issues here that (at least to me) actually appear to be buggy:

  1. Low-precision values (i.e. less than about 15 significant figures) should be guaranteed to be within 0.5 ULP of the correct result, or at least within 1 ULP. However, I have found cases in which xstrtod() is off by more than 1 ULP, although these don't come up often.
  2. High-precision values of course can't be guaranteed to be within 0.5 ULP without a costly correction loop as in the ordinary strtod(), but the error in conversion increases linearly as the number of supplied significant figures increases. With 30 significant figures, the error in conversion can potentially be over 7 ULP.

Here is an IPython notebook analyzing the accuracy of xstrtod(). I think there are two problems here: xstrtod() keeps reading digits after the 17th, none of which should matter for conversion, and the scaling step at the end produces a compounded error by repeatedly multiplying/dividing by powers of 10. I have a solution for AstroPy that seems to fix these issues, so I can open a PR if it's agreed that xstrtod() should be changed.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

how about an option for this (default is turned off), may float_precision=None|high ?

@mdmueller
Copy link
Contributor Author

That could work--our decision in AstroPy was also to use a parameter named use_fast_converter, although we have it defaulting to False. Then should the updated xstrtod() be used if float_precision='high', or just strtod() (which is guaranteed to be within 0.5 ULP)?

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

I would like to leave the existing code in-place. No objection to adding new options like this though. I don't think the intermediate (using strtod()) is that useful, while a higher-precision prob is. The tradeoff of course is speed/accuracy.

@mdickinson
Copy link

Just to add another voice: I also ran into this issue recently. We were serialising data to disk via DataFrame.to_csv and then re-reading with read_csv, and were surprised to discover that the re-read data didn't exactly match the original. While the errors are small, and not an issue for calculation, it's important to us that the serialisation round-trips correctly. So a float_precision='high' option would be useful.

@jreback jreback added this to the 0.15.1 milestone Aug 12, 2014
@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

ok @amras1 we'll put this on the table for a new option.

@mdmueller
Copy link
Contributor Author

Sounds good--I have the altered xstrtod() ready, so let me know if I should open a PR. However, the new xstrtod() isn't guaranteed to fix the round-trip issue @mdickinson noticed; strtod() is probably the best solution for that case.

@jreback
Copy link
Contributor

jreback commented Aug 12, 2014

I suppose you could have 3 levels for float_precision=None|high|round_trip or something where you either: use current xstrtod, new one, strtod

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants