Closed
Description
I have a stack trace that looks like:
Caused by: java.lang.IllegalArgumentException: Fractional absolute document frequencies are not allowed
at org.apache.lucene.search.spell.DirectSpellChecker.setThresholdFrequency(DirectSpellChecker.java:182)
at org.elasticsearch.search.suggest.phrase.DirectCandidateGenerator.drawCandidates(DirectCandidateGenerator.java:131)
at org.elasticsearch.search.suggest.phrase.MultiCandidateGeneratorWrapper.drawCandidates(MultiCandidateGeneratorWrapper.java:52)
I do not have and cannot get the index that causes this failure. But it looks to me like the failure is caused by this series of events:
DirectCandidateGenerator#thresholdFrequency
spits out a frequency that is bigger thanInteger.MAX_VALUE
. This looks to be possible using the default configuration for common words like "the" when the corpus is a couple of million documents and each document is large, like, say, as big as a wikipedia page.- We call
DirectSpellChecker#setThresholdFrequency
with that number. The JVM helpfully casts thelong
returned by step 1 into afloat
, losing precision but keeping the magnitude of the number largely intact. - Lucene attempts to validate that the
float
is either less than 0 or a whole number. The "is it a whole number" check looks likethresholdFrequency != (int) thresholdFrequency
. That will consider floats that don't fit intoint
s as not whole numbers. Most of the time, anyway.
There is a work around: set "suggest_mode": "always"
. We'll skip the math and just pick 0 for the frequency. Which is both less than one and whole number so Lucene is quite happy with it.
It looks like we should either clamp the value to Integer.MAX_VALUE
in Elasticsearch or Lucene should use something else to check for fractional numbers.