Skip to content

Remove implicit determinization from WildcardQuery#15961

Open
drempapis wants to merge 8 commits intoapache:mainfrom
drempapis:wildcard-query-remove-det
Open

Remove implicit determinization from WildcardQuery#15961
drempapis wants to merge 8 commits intoapache:mainfrom
drempapis:wildcard-query-remove-det

Conversation

@drempapis
Copy link
Copy Markdown
Contributor

This change continues the determinization cleanup started for regexp queries (#15939) by applying the same model to wildcard queries.

Previously, wildcard automata were implicitly determinized up front. After removing that implicit determinization, some code paths (especially query visiting/highlighting and equality checks) still assumed a DFA-only representation.

QueryVisitor.consumeTermsMatching() accepted only ByteRunAutomaton (DFA). After removing implicit wildcard determinization, many valid query automata are NFA-backed. Forcing them back into ByteRunAutomaton at visit time would reintroduce determinization cost and defeat the goal of this work.

This PR adds QueryVisitor.consumeTermsMatchingRunnable(), allowing visitor consumers (e.g. highlighter) to work with either DFA or NFA-backed execution via ByteRunnable, while keeping the existing consumeTermsMatching() API for compatibility.

@github-actions github-actions bot added this to the 11.0.0 milestone Apr 15, 2026
@drempapis
Copy link
Copy Markdown
Contributor Author

@rmuir can you please have a look?

Query query, String field, Supplier<ByteRunAutomaton> automaton) {
runAutomata.add(LabelledCharArrayMatcher.wrap(query.toString(), automaton.get()));
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just replace the Supplier<ByteRunAutomaton> with a Supplier<ByteRunnable> in the signature of consumeTermsMatching? It's a minimal change, replacing an implementation with an interface, and keeps the API surface on QueryVisitor low.

There's a further question around whether this should be taking something as complicated as a ByteRunnable in the first place, and should instead take a Predicate<BytesRef> but that's a bigger change and one that we can keep separate.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I updated the consumeTermsMatching signature applying Supplier<ByteRunnable>.

Regarding the second part, I agree that using a Predicate<BytesRef> should be handled in a separate PR.

int result = 1;
result = prime * result + ((runAutomaton == null) ? 0 : runAutomaton.hashCode());
result = prime * result + ((nfaRunAutomaton == null) ? 0 : nfaRunAutomaton.hashCode());
result = prime * result + normalAutomatonHashCode();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this code change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without that change I am getting errors of the like

java.lang.AssertionError: expected: org.apache.lucene.queries.intervals.IntervalQuery<f:MultiTerm(*.txt)> but was: org.apache.lucene.queries.intervals.IntervalQuery<f:MultiTerm(*.txt)>
	at __randomizedtesting.SeedInfo.seed([CA4FA9AC9103B4C8:8C2346BE162DAD34]:0)
	at junit@4.13.2/org.junit.Assert.fail(Assert.java:89)
	at junit@4.13.2/org.junit.Assert.failNotEquals(Assert.java:835)
	at junit@4.13.2/org.junit.Assert.assertEquals(Assert.java:120)
	at junit@4.13.2/org.junit.Assert.assertEquals(Assert.java:146)
	at org.apache.lucene.queries.intervals.TestIntervalQuery.testEquality(TestIntervalQuery.java:485)

in tests

  - org.apache.lucene.queries.intervals.TestIntervalQuery.testEquality (:lucene:queries)
  - org.apache.lucene.queries.intervals.TestIntervals.testEquality (:lucene:queries)
  - org.apache.lucene.search.TestWildcardQuery.testEquals (:lucene:core)

CompiledAutomaton.equals() was updated so two NFA-backed NORMAL automata are considered equal when their automaton graphs are the same, even if they are different Java objects. That means hashCode() also had to change.

Previously, NFA hash code came from nfaRunAutomaton.hashCode(). So two objects that are now equals() could still produce different hash codes, which breaks the Java equals/hashCode contract.

The current update keeps existing DFA behavior (runAutomaton.hashCode()), and for NFA computes a structural hash from the underlying automaton graph using AutomatonStructuralComparator.structuralAutomatonHashCode()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I am trying to avoid the insertion of a whole new automaton class/java file. I can see where the old logic may have bugs, but maybe we can fix them and avoid that?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've simplified the code by removing the new standalone comparator class and keeping the fix inside existing automaton classes. This update

  • Keeps CompiledAutomaton equality/hash behavior aligned with the original shape.
  • Adds structural equals/hashCode to Automaton.
  • Implements NFARunAutomaton.equals/hashCode to delegate to its wrapped Automaton.

Is there other way to fix equality-contract issue properly?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good. I like how the new hashcode/equals are defined... as "same automaton".

There were historical problems here around hashcode/equals being defined as "accepting same language"... which led to trouble.

@rmuir rmuir requested review from dweiss and mikemccand April 16, 2026 13:46
@rmuir
Copy link
Copy Markdown
Member

rmuir commented Apr 16, 2026

I'm a fan of this change: I feel like the hashcode/equals is the way it should have always worked!

I'll give it a few days for more feedback. Thank you for doing this work @drempapis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants