Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pandas C parser xstrtod match float/np.float64 internal routine #2566

Closed
yarivm opened this issue Dec 19, 2012 · 12 comments
Closed

Make pandas C parser xstrtod match float/np.float64 internal routine #2566

yarivm opened this issue Dec 19, 2012 · 12 comments
Milestone

Comments

@yarivm
Copy link

yarivm commented Dec 19, 2012

read_csv() converts the string '0.011277' into a np.float64 whose repr() is:
'0.011276999999999999' .
However, repr(np.float64('0.011277')) returns:
'0.011277000000000001'
Also, in Pandas v.0.9.0, read_csv() produced '0.011276999999999999', while in Pandas v.10.0 read_csv() produces '0.011277000000000001'.

This problem showed up when I truncated the number (with 1e-6 precision).

System setting:
Pandas v. 10.0,
Numpy v.1.7.0b2
Windows 7

@wesm
Copy link
Member

wesm commented Dec 19, 2012

This is a well-known "Python problem":

In [21]: from StringIO import StringIO

In [22]: read_csv(StringIO('0.011277'), header=None)
Out[22]: 
          0
0  0.011277

In [23]: read_csv(StringIO('0.011277'), header=None)[0][0]
Out[23]: 0.011276999999999999

In [24]: str(read_csv(StringIO('0.011277'), header=None)[0][0])
Out[24]: '0.011277'

It varies by platform, here is using np.float64 to do the casting:

In [29]: read_csv(StringIO('0.011277'), header=None)[0][0].view(np.uint64)
Out[29]: 4577654369684779146

In [30]: np.float64('0.011277')
Out[30]: 0.011277000000000001

In [31]: np.float64('0.011277')
Out[31]: 0.011277000000000001

In [32]: np.float64('0.011277').view(np.uint64)
Out[32]: 4577654369684779147

Looks like the conversion algorithm differs in the result by one bit in the mantissa

@yarivm
Copy link
Author

yarivm commented Dec 19, 2012

That was a quick response :)

  1. I just added to the issue's text above that the behavior of Pandas changed from v.0.9.0 to v.0.10.0.
  2. I don't understand how it's a Python problem. Isn't it the code in read_csv() which produces this behavior? Don't we want read_csv() to conform to Numpy's string parsing?

@wesm
Copy link
Member

wesm commented Dec 19, 2012

The parsed numbers are < 1e-17 apart which is within the acceptable margin of error (typically 1e-14 or 1e-15) for double precision floating point numbers. I'm happy to make the results consistent but will require some digging to modify the new parser's C code for string to double conversion to exactly match Python's internal version.

@yarivm
Copy link
Author

yarivm commented Dec 19, 2012

How can I write code which gives back '0.011277' from both floats? Should I use np.round(x,14)?

@wesm
Copy link
Member

wesm commented Dec 19, 2012

use str instead of repr

@yarivm
Copy link
Author

yarivm commented Dec 19, 2012

Thanks.

@yarivm
Copy link
Author

yarivm commented Dec 23, 2012

The biggest issue here for me is that df.astype('i') doesn't work as before/expected, because truncation now sometimes returns a value which is smaller by 1. My ad-hoc solution is to round before truncation: np.rint(df.acolumn).astype('i').

@wesm
Copy link
Member

wesm commented Dec 24, 2012

Are you multiplying by a number then converting to integer? Or how does df.astype('i') lead to a wrong answer (maybe you have data where it becomes slightly less than a whole number?). Generally speaking relying on a float->int cast without some explicit handling of floating point error is not very advisable.

@yarivm
Copy link
Author

yarivm commented Dec 25, 2012

  1. Yes, I multiply by 1e6 before converting, in attempt to retrieve the digits. 2) You are right, float->int is problematic in general.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

closing as not a bug

@jreback jreback closed this as completed Sep 21, 2013
@mdmueller
Copy link
Contributor

I might be missing something here (no expert on floating-point storage), but isn't this still an issue? For highly precise values, read_csv() doesn't seem to have the closest approximation:

In [3]: df = pd.read_csv(StringIO('1.2345678901234567890'), header=None)

In [4]: df[0][0]
Out[4]: 1.2345678901234569

In [5]: df[0][0].hex()
Out[5]: '0x1.3c0ca428c59fcp+0'

In [6]: val = np.float64('1.2345678901234567890')

In [7]: val
Out[7]: 1.2345678901234567

In [8]: val.hex()
Out[8]: '0x1.3c0ca428c59fbp+0'

I guess I'm just wondering what the rationale is behind using xstrtod() rather than strtod() (which agrees exactly with numpy's conversion). I've noticed that xstrtod() is quite a bit faster than strtod() for high-precision values, but this seems to be because xstrtod() bypasses the correction loop in strtod() and therefore isn't guaranteed to be within 0.5 ULP.

@jreback
Copy link
Contributor

jreback commented Jul 25, 2014

@amras1 I think that is exactly the reason, speed of parsing for floats. The difference is immaterial as its below the precision of floats anyhow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants