Skip to content

Commit

Permalink
Use Randomness#get()
Browse files Browse the repository at this point in the history
Signed-off-by: Rishabh Maurya <rishabhmaurya05@gmail.com>
  • Loading branch information
rishabhmaurya committed May 16, 2024
1 parent b16cdfc commit f6ed5e1
Showing 1 changed file with 5 additions and 4 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.LeafReaderContext;
import org.opensearch.common.Randomness;
import org.opensearch.common.xcontent.XContentFactory;
import org.opensearch.common.xcontent.json.JsonXContent;
import org.opensearch.core.common.bytes.BytesReference;
Expand All @@ -32,12 +33,12 @@
* <p>The sample size should be chosen carefully to ensure a high probability of selecting at least one document where the field is present.
* However, it's essential to strike a balance because a large sample size can lead to performance issues since each sample document's _source field is loaded and examined until the field is found.
*
* <p>Determining the sample size (<var>S</var>) is akin to deciding how many balls to draw from a bin, ensuring a high probability ((<var>&gt;=P</var>)) of drawing at least one green ball (documents with the field) from a mixture of <var>R</var> red balls (documents without the field) and <var>G</var> green balls:
* <p>Determining the sample size ({@code S}) is akin to deciding how many balls to draw from a bin, ensuring a high probability ({@code >=P}) of drawing at least one green ball (documents with the field) from a mixture of {@code R } red balls (documents without the field) and {@code G } green balls:
* <pre>{@code
* P >= 1 - C(R, S) / C(R + G, S)
* }</pre>
* Here, <var>C()</var> represents the binomial coefficient.
* For a high confidence level, we aim for <var>P &gt;= 0.95</var>. For example, with 10^7 documents where the field is present in 2% of them, the sample size <var>S</var> should be around 149 to achieve a probability of 0.95.
* Here, {@code C()} represents the binomial coefficient.
* For a high confidence level, we aim for {@code P >= 0.95 }. For example, with {@code 10^7 } documents where the field is present in {@code 2% } of them, the sample size {@code S } should be around 149 to achieve a probability of {@code 0.95}.
*/
public class FieldTypeInference {
private final IndexReader indexReader;
Expand Down Expand Up @@ -156,7 +157,7 @@ private void setNextLeaf() {

private static int[] getSortedRandomNum(int sampleSize, int upperBound, int attempts) {
Set<Integer> generatedNumbers = new HashSet<>();
Random random = new Random();
Random random = Randomness.get();
int itr = 0;
while (generatedNumbers.size() < sampleSize && itr++ < attempts) {
int randomNumber = random.nextInt(upperBound);
Expand Down

0 comments on commit f6ed5e1

Please sign in to comment.