Remove implicit determinization from WildcardQuery#15961
Remove implicit determinization from WildcardQuery#15961drempapis wants to merge 8 commits intoapache:mainfrom
Conversation
|
@rmuir can you please have a look? |
| Query query, String field, Supplier<ByteRunAutomaton> automaton) { | ||
| runAutomata.add(LabelledCharArrayMatcher.wrap(query.toString(), automaton.get())); | ||
| } | ||
|
|
There was a problem hiding this comment.
I think we can just replace the Supplier<ByteRunAutomaton> with a Supplier<ByteRunnable> in the signature of consumeTermsMatching? It's a minimal change, replacing an implementation with an interface, and keeps the API surface on QueryVisitor low.
There's a further question around whether this should be taking something as complicated as a ByteRunnable in the first place, and should instead take a Predicate<BytesRef> but that's a bigger change and one that we can keep separate.
There was a problem hiding this comment.
That's a good point. I updated the consumeTermsMatching signature applying Supplier<ByteRunnable>.
Regarding the second part, I agree that using a Predicate<BytesRef> should be handled in a separate PR.
| int result = 1; | ||
| result = prime * result + ((runAutomaton == null) ? 0 : runAutomaton.hashCode()); | ||
| result = prime * result + ((nfaRunAutomaton == null) ? 0 : nfaRunAutomaton.hashCode()); | ||
| result = prime * result + normalAutomatonHashCode(); |
There was a problem hiding this comment.
Without that change I am getting errors of the like
java.lang.AssertionError: expected: org.apache.lucene.queries.intervals.IntervalQuery<f:MultiTerm(*.txt)> but was: org.apache.lucene.queries.intervals.IntervalQuery<f:MultiTerm(*.txt)>
at __randomizedtesting.SeedInfo.seed([CA4FA9AC9103B4C8:8C2346BE162DAD34]:0)
at junit@4.13.2/org.junit.Assert.fail(Assert.java:89)
at junit@4.13.2/org.junit.Assert.failNotEquals(Assert.java:835)
at junit@4.13.2/org.junit.Assert.assertEquals(Assert.java:120)
at junit@4.13.2/org.junit.Assert.assertEquals(Assert.java:146)
at org.apache.lucene.queries.intervals.TestIntervalQuery.testEquality(TestIntervalQuery.java:485)
in tests
- org.apache.lucene.queries.intervals.TestIntervalQuery.testEquality (:lucene:queries)
- org.apache.lucene.queries.intervals.TestIntervals.testEquality (:lucene:queries)
- org.apache.lucene.search.TestWildcardQuery.testEquals (:lucene:core)
CompiledAutomaton.equals() was updated so two NFA-backed NORMAL automata are considered equal when their automaton graphs are the same, even if they are different Java objects. That means hashCode() also had to change.
Previously, NFA hash code came from nfaRunAutomaton.hashCode(). So two objects that are now equals() could still produce different hash codes, which breaks the Java equals/hashCode contract.
The current update keeps existing DFA behavior (runAutomaton.hashCode()), and for NFA computes a structural hash from the underlying automaton graph using AutomatonStructuralComparator.structuralAutomatonHashCode()
There was a problem hiding this comment.
Yes but I am trying to avoid the insertion of a whole new automaton class/java file. I can see where the old logic may have bugs, but maybe we can fix them and avoid that?
There was a problem hiding this comment.
I've simplified the code by removing the new standalone comparator class and keeping the fix inside existing automaton classes. This update
- Keeps
CompiledAutomatonequality/hash behavior aligned with the original shape. - Adds structural equals/hashCode to
Automaton. - Implements
NFARunAutomaton.equals/hashCodeto delegate to its wrapped Automaton.
Is there other way to fix equality-contract issue properly?
There was a problem hiding this comment.
Its good. I like how the new hashcode/equals are defined... as "same automaton".
There were historical problems here around hashcode/equals being defined as "accepting same language"... which led to trouble.
|
I'm a fan of this change: I feel like the hashcode/equals is the way it should have always worked! I'll give it a few days for more feedback. Thank you for doing this work @drempapis |
This change continues the determinization cleanup started for regexp queries (#15939) by applying the same model to wildcard queries.
Previously, wildcard automata were implicitly determinized up front. After removing that implicit determinization, some code paths (especially query visiting/highlighting and equality checks) still assumed a DFA-only representation.
QueryVisitor.consumeTermsMatching()accepted onlyByteRunAutomaton(DFA). After removing implicit wildcard determinization, many valid query automata are NFA-backed. Forcing them back intoByteRunAutomatonat visit time would reintroduce determinization cost and defeat the goal of this work.This PR adds
QueryVisitor.consumeTermsMatchingRunnable(), allowing visitor consumers (e.g. highlighter) to work with either DFA or NFA-backed execution viaByteRunnable, while keeping the existingconsumeTermsMatching()API for compatibility.