Skip to content

Fix fptrunc Float64 -> Float16 rounding through Float32 #57809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Mar 18, 2025

Conversation

xal-0
Copy link
Member

@xal-0 xal-0 commented Mar 17, 2025

Widening from Float32 to Float64 and then rounding to Float16 will not introduce any error, but going from Float64 -> Float32 -> Float16 will round incorrectly if the intermediate Float32 is halfway between two Float16s.

Fixes #57805.

Thanks to @vtjnash for suggesting the fix.

xal-0 and others added 2 commits March 17, 2025 16:24
Widening from Float32 to Float64 and then rounding (by any method) to Float16
will not introduce any error, but going from Float64 -> Float32 -> Float16 will
round incorrectly if the intermediate Float32 is halfway between two Float16s.

Fixes JuliaLang#57805.

Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
@giordano giordano added bugfix This change fixes an existing bug float16 backport 1.10 Change should be backported to the 1.10 release backport 1.11 Change should be backported to release-1.11 backport 1.12 Change should be backported to release-1.12 labels Mar 18, 2025
@oscardssmith oscardssmith merged commit a676b12 into JuliaLang:master Mar 18, 2025
9 of 12 checks passed
#define fintrinsic_write_float16(p, x) *(uint16_t *)p = float_to_half(x)
#define fintrinsic_write_bfloat16(p, x) *(uint16_t *)p = float_to_bfloat(x)
#define fintrinsic_write_float16(p, x) *(uint16_t *)p = double_to_half(x)
#define fintrinsic_write_bfloat16(p, x) *(uint16_t *)p = double_to_bfloat(x)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the same logic applies to bfloat16. Since bfloat16 is just a truncated float32, shouldn't it always be legal to go through Float32? If not, it'd be good to have a test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the double_to_bfloat code, the same correction to the rounding direction is required whenever the float32 (after truncation from float64) is exactly halfway between 2 bfloat16 values, but wasn't required for the subnormal case (as was demonstrated in the original issue for float16)

KristofferC pushed a commit that referenced this pull request Mar 20, 2025
Widening from Float32 to Float64 and then rounding to Float16 will not
introduce any error, but going from Float64 -> Float32 -> Float16 will
round incorrectly if the intermediate Float32 is halfway between two
Float16s.

Fixes #57805.

Thanks to @vtjnash for suggesting the fix.

Co-authored-by: Jameson Nash <jameson@juliacomputing.com>
(cherry picked from commit a676b12)
@KristofferC KristofferC removed the backport 1.12 Change should be backported to release-1.12 label Mar 24, 2025
@KristofferC KristofferC mentioned this pull request Mar 31, 2025
71 tasks
@KristofferC KristofferC mentioned this pull request Apr 25, 2025
71 tasks
@KristofferC KristofferC mentioned this pull request Jun 4, 2025
75 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.10 Change should be backported to the 1.10 release backport 1.11 Change should be backported to release-1.11 bugfix This change fixes an existing bug float16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Float16 runtime and compiler intrinsics models do not match
6 participants