-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
fix spurious overflow for Float16(::Rational) #52395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3c40acc
to
4270c36
Compare
4270c36
to
c34a75d
Compare
Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Constructing a floating-point number from a `Rational` should now be correctly rounded. Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Constructing a floating-point number from a `Rational` should now be correctly rounded. Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Float16(x::Rational{<:Union{Int128,UInt128}}) = | ||
Float16(Float64(x)) # UInt128 overflows Float32, include Int128 for consistency | ||
Float32(x::Rational{<:Union{Int128,UInt128}}) = | ||
Float32(Float64(x)) # UInt128 overflows Float32, include Int128 for consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't these suffer from double-rounding accuracy loss? @oscardssmith
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they would. Doing this properly is fairly hard though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth noting that Float16(num)/Float16(den)
also potentially has triple rounding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question is: is this problem worse than the problem this PR fixes? I.e. should we merge and improve this later, or should this wait until this PR does it "right" (whatever that means?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should merge and improve later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my thought as well, but want to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would doing it "properly" be something like:
a = Float64(x)
b = Float32(a)
# check if we double-rounded in the wrong direction
if x > a && b < a && nextfloat(b) < a
b = nextfloat(b)
elseif x < a && b > a && prevfloat(b) > a
b = prevfloat(b)
end
return b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A example of a test case where double rounding gives a different result is:
julia> r = 12928845309018111//18014398509481984
12928845309018111//18014398509481984
julia> Float32(r) == Float32(Float64(r))
false
@vtjnash's code doesn't fix this, however — its output matches Float32(Float64(r))
on this example.
(Example constructed by tweaking a number almost exactly halfway between two Float32
values.)
Was there a point to merging this? This PR was just a quick fix compared to the thorough fix of #49749. |
Primarily that it was a quick fix, compared to the thorough fix of #49749 :). To be slightly more serious, I think we should merge the more thorough fix, but this was an improvement that was ready to merge first. |
Constructing a floating-point number from a `Rational` should now be correctly rounded. Implementation approach: 1. Convert the (numerator, denominator) pair to a (sign bit, integral significand, exponent) triplet using integer arithmetic. The integer type in question must be wide enough. 2. Convert the above triplet into an instance of the chosen FP type. There is special support for IEEE 754 floating-point and for `BigFloat`, otherwise a fallback using `ldexp` is used. As a bonus, constructing a `BigFloat` from a `Rational` should now be thread-safe when the rounding mode and precision are provided to the constructor, because there is no access to the global precision or rounding mode settings. Updates JuliaLang#45213 Updates JuliaLang#50940 Updates JuliaLang#52507 Fixes JuliaLang#52394 Closes JuliaLang#52395 Fixes JuliaLang#52859
Fixes #52394.
Update: also fixes
Float32
forUInt128
, since currentlyFloat32((typemax(UInt128)-0x01) // typemax(UInt128))
givesNan32
.