Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix horrifying bug in lossless_cast of a subtract #8155

Merged
merged 40 commits into from
Jun 26, 2024

Conversation

abadams
Copy link
Member

@abadams abadams commented Mar 14, 2024

This is one thing breaking the SVE2 pr.

Copy link
Member

@vksnk vksnk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof.

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

owww

@zvookin
Copy link
Member

zvookin commented Mar 14, 2024

Please sir, may I have a test?

@mcourteaux
Copy link
Contributor

Hahahaha, I'm giggling for minutes now... 😆

@steven-johnson
Copy link
Contributor

Is this a recent injection? If not, might want to backport to 17...

@steven-johnson steven-johnson added the backport me This change should be backported to release versions label Mar 15, 2024
@mcourteaux
Copy link
Contributor

Is this a recent injection? If not, might want to backport to 17...

No, 4 years ago:
394536f

@abadams
Copy link
Member Author

abadams commented Mar 15, 2024

I added a fuzz test, and there are more bugs :(

I'm pretty ready to state that all new features or algorithms must come with a fuzz tester.

@zvookin
Copy link
Member

zvookin commented Mar 15, 2024

This is why people measure test coverage. It's kinda annoying, but probably a thing to consider.

TODO:

- Dedup the constant integer code with the same code in the simplifier.
- Move constant interval arithmetic operations out of the class.
- Make the ConstantInterval part of the return type of lossless_cast
(and turn it into an inner helper) so that it isn't constantly
recomputed.
@steven-johnson
Copy link
Contributor

where do we stand on this -- do we need to do more work (tests, etc) before landing?

@abadams
Copy link
Member Author

abadams commented Mar 18, 2024

Still working on it. I couldn't figure out how to fix the remaining bugs using top-down reasoning about the types like what was there already, so I rewrote it to use bottom-up constant-integer bounds analysis, using code lifted out of the simplifier. It's working but there are inefficiencies to fix and some code deduplication to do.

@mcourteaux
Copy link
Contributor

This is too much of a change to backport now

I don't know how important it is, but changing + to - as a backported hotfix seems easy.

@abadams
Copy link
Member Author

abadams commented Jun 5, 2024

Sure we could backport that. It was just the tip of the iceberg when it came to bugs in lossless_cast though, and fixing the other ones is what required all the work.

@abadams
Copy link
Member Author

abadams commented Jun 5, 2024

#8264

@abadams
Copy link
Member Author

abadams commented Jun 5, 2024

Tests are breaking because vulkan is busted on main, and because the fuzzer and asan builds are still busted on linux-worker-1. This is ready for review.

@steven-johnson
Copy link
Contributor

Tests are breaking because vulkan is busted on main, and because the fuzzer and asan builds are still busted on linux-worker-1. This is ready for review.

@derek-gerstmann is working on the Vulkan bug, but one of us should figure out the fuzzer/asan issue

@@ -235,6 +231,25 @@ Expr random_expr(std::mt19937 &rng) {
}
}

bool definitely_has_ub(Expr e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what, no comment about nasal demons?

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but @vksnk should probably review too and he's out most of this week.

I'll do a full test of this inside Google later today.

if (target.has_feature(Target::AVX512) && op->type.lanes() >= 32) {
ConstantInterval ca = constant_integer_bounds(a);
ConstantInterval cb = constant_integer_bounds(b);
if (!ca.contains(-32768) || !cb.contains(-32768)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-0x8000 is probably clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyone reading this code who doesn't know the bit layout of -32768 should not be reading this code :-)

@abadams
Copy link
Member Author

abadams commented Jun 11, 2024

Is this now good to merge? I think it was contingent on the results of tests inside Google.

@steven-johnson
Copy link
Contributor

Not yet -- need to retest in Google with the Xtensa codegen fix in place, but I'm OOO for at least another day or two (out of town due to family medical emergency. Can it wait till then?

@abadams
Copy link
Member Author

abadams commented Jun 11, 2024

Yes, there's no rush, I just wanted to make sure it's not stalled on me.

@vksnk
Copy link
Member

vksnk commented Jun 11, 2024

Xtensa codegen is fixed now, so I re-run tests in Google codebase: there are a couple of tests which now produce mismatch with thegoldens (in one of the tests, the max diff is 6 out of 255, which seems fairly significant), but I guess this is somewhat expected?

@abadams
Copy link
Member Author

abadams commented Jun 11, 2024

Are the tests that differ using floats in some way that would be unstable? (e.g. using a floating-point comparison to decide on an interpolation direction in a demosaicking algorithm)

If they're fixed-point algorithms that's definitely not expected.

@vksnk
Copy link
Member

vksnk commented Jun 11, 2024

It's hard to tell for sure, because it's one of those large end-to-end tests which can call 10 different Halide functions deep inside, but I'd guess it does use some floating-point math (I'm not 100%, but I think one of these tests was failing before due to some FP differences before, so we needed to regenerate the goldens; although, the diff was smaller I think).

@abadams
Copy link
Member Author

abadams commented Jun 11, 2024

Do you think it's worth testing if it passes on main under strict_float, and fails on this branch under strict_float? If that's the case there's probably something to worry about.

@vksnk
Copy link
Member

vksnk commented Jun 11, 2024

Sure, I'll see if it can be narrowed down.

@abadams abadams added this to the v18.0.0 milestone Jun 23, 2024
@steven-johnson
Copy link
Contributor

Update on Google: it looks like there are no real red flags here -- just goldens to be regenerated -- so I'm hoping to be able to approve this for landing later today pending confirmation.

@steven-johnson
Copy link
Contributor

Approved and landed in Google, so I'm landing this here now.

@steven-johnson steven-johnson merged commit cab27d8 into main Jun 26, 2024
19 checks passed
@steven-johnson steven-johnson deleted the abadams/fix_lossless_cast_of_sub branch June 26, 2024 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants