-
Notifications
You must be signed in to change notification settings - Fork 1.6k
restore topk pre-filtering of batches and make sort query fuzzer less sensitive to expected non determinism #16501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…)" This reverts commit 5ca4ff0.
We have the test in https://github.com/apache/datafusion/pull/16465/files#diff-f38cac7a9ac55c93d71632c96d6d2afa219cfb07351125a349099c86df859446 which seems to be passing. I'm running a local 1200 run iteration to confirm. |
Sadly the 1200 run still reports failures 😭 I feel like @AdamGS 's original intuition that it's something about sort stabilit with nulls is correct. I'll see if I can find a fix... |
Thanks @adriangb ! |
So from my investigation what I think is happening is that #15770 fundamentally converted the TopK operation from being isolated per partition to having shared state via the dynamic filter. This causes some non-determinism with test runs since partitions can interact. I think this doesn't cause actual issues with queries, but the tests are picking it up. But I'm not 100% sure about that. @Dandandan and I were already talking about having a shared TopK heap between partitions, I think that would resolve the issue. But otherwise more investigation is needed. FWIW the TopK dynamic filters still work without this code - it's just using the filter to filter rows in the TopK operator itself that doesn't work. This is all I had time for today. I think more work is needed before we can merge this PR in the current state. |
This sounds like we need to update the tests to be deterministic then somehow (or ignore results that are not deterministic |
Would love to give a hand with that, I have some thoughts I can try and put into a preliminary PR. |
Thank you @AdamGS! It would be super helpful if we could first determine if the test is being overly sensitive to non-determinism (query results are exactly the same across runs and correct but test still fails) or if the issue is actually reflecting incorrect query results or non-deterministic query results (e.g. the query is correct according to the sort order but the actual order of rows is different across runs). |
Wondering if you've had a chance to look into this? I am trying to decide how much time I should allocate to this next week and make sure I don't overlap your work. Thanks! |
Had a pretty busy week and this is mostly extra-curricular work for me so I didn't get anything done, planning to spend some time over the weekend digging into it. If I get anywhere worthwhile I'll send it your way sunday evening/monday morning? |
Amazing great! |
I've pushed a series of commits that simplify the test down to a trivial example:
SELECT * FROM t ORDER BY u32 LIMIT 1 Now it's clear what is happening: previously each partition had So nothing to do with nulls, nulls just happened to make it much more likely that the sort values would be the same when using random data. With the addition of pre-filtering rows in the TopK operator, since the filter is updated across partitions, as soon as 1 partition sees the value for Options I can see:
|
Amazing, you found it!
I think this is the way to go because the engine is correct in this case (sort doesn't care about partition/scan order, should just output top n rows in database). |
Sounds good, but how do we go about that? It's a pretty big change from "check results match row by row" to "check results are semantically correct". |
I think a relative simple solution might be only checking columns from the generated sort expressions to be have equal values (i.e. unstable sorting). |
I agree this is the right way to go |
Since there seems to be agreement on the path forward I pushed 514ab74 which I think achieves the goal by simply changing |
@Dandandan @alamb @AdamGS can you review and verify you agree with the proposed change to the tests? Would love to merge this and close this chapter in preparation for the next release 😄 |
Since it seemed we came to a clear repro and conclusion and the best way to know if the fuzzer is still flakey or not is to run it a lot I went ahead and merged this so it gets shaken out a lot in CI before the next release. |
Amazing -- thank you @adriangb and @Dandandan |
Which issue does this PR close?
Rationale for this change
sort_query_fuzzer_runner
#16491What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?