Skip to content

bugfix: prevent background merges on the realtime lucene index#13050

Merged
Jackie-Jiang merged 1 commit intoapache:masterfrom
itschrispeck:concurrent_merges_fix
May 3, 2024
Merged

bugfix: prevent background merges on the realtime lucene index#13050
Jackie-Jiang merged 1 commit intoapache:masterfrom
itschrispeck:concurrent_merges_fix

Conversation

@itschrispeck
Copy link
Contributor

#12744 changed the Lucene index build process to commit the realtime Lucene index, and reuse it in the immutable segment.

.commit() is called at the beginning of segment conversion, and the Lucene index directory is copied to another location. Since IndexWriter can merge segments in the background even without insertions/updates to the index, these merges non-deterministically can trigger a FileNotFoundException:

java.lang.RuntimeException: Caught exception while instantiating the LuceneTextIndexCreator for column: __mergedTextIndex 
     at org.apache.pinot.segment.local.segment.creator.impl.text.LuceneTextIndexCreator.<init>(LuceneTextIndexCreator.java:149) 
<...>
Caused by: java.lang.RuntimeException: Failed to copy the mutable lucene index: java.io.FileNotFoundException: File system element for parameter 'source' does not exist: '/.../dataDir/table_REALTIME/consumers/table__0__3170__20240430T2313Z/col.lucene.v9.index/_3r.si' 
     at org.apache.pinot.segment.local.segment.creator.impl.text.LuceneTextIndexCreator.convertMutableSegment(LuceneTextIndexCreator.java:192) ...

By using NoMergeScheduler, the async merges will never occur. This patch was tested in our staging/prod env and the intermittent segment build failures disappeared.

tag: bugfix

@codecov-commenter
Copy link

codecov-commenter commented May 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.17%. Comparing base (59551e4) to head (ef290b9).
Report is 396 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff              @@
##             master   #13050      +/-   ##
============================================
+ Coverage     61.75%   62.17%   +0.41%     
+ Complexity      207      198       -9     
============================================
  Files          2436     2502      +66     
  Lines        133233   136658    +3425     
  Branches      20636    21179     +543     
============================================
+ Hits          82274    84961    +2687     
- Misses        44911    45397     +486     
- Partials       6048     6300     +252     
Flag Coverage Δ
custom-integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration <0.01% <0.00%> (-0.01%) ⬇️
integration1 <0.01% <0.00%> (-0.01%) ⬇️
integration2 0.00% <0.00%> (ø)
java-11 62.13% <100.00%> (+0.42%) ⬆️
java-21 62.05% <100.00%> (+0.42%) ⬆️
skip-bytebuffers-false 62.15% <100.00%> (+0.40%) ⬆️
skip-bytebuffers-true 62.00% <100.00%> (+34.27%) ⬆️
temurin 62.17% <100.00%> (+0.41%) ⬆️
unittests 62.16% <100.00%> (+0.41%) ⬆️
unittests1 46.69% <0.00%> (-0.20%) ⬇️
unittests2 27.96% <100.00%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Jackie-Jiang
Copy link
Contributor

I saw this test being flaky. Do you think it is related?

Error:  Tests run: 2, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 6.580 s <<< FAILURE! -- in org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest
Error:  org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.testQueries -- Time elapsed: 0.289 s
Error:  org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.testQueryCancellationIsSuccessful -- Time elapsed: 0.032 s <<< FAILURE!
org.testng.TestException: 

The exception was thrown with the wrong message: expected ".*TEXT_MATCH query timeout on realtime consuming segment.*" but got "java.util.concurrent.RejectedExecutionException: java.lang.InterruptedException"
	at org.testng.internal.invokers.ExpectedExceptionsHolder.wrongException(ExpectedExceptionsHolder.java:69)
	at org.testng.internal.invokers.TestInvoker.considerExceptionsInternal(TestInvoker.java:885)
	at org.testng.internal.invokers.TestInvoker.considerExceptions(TestInvoker.java:860)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:741)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethod(TestInvoker.java:230)
	at org.testng.internal.invokers.MethodRunner.runInSequence(MethodRunner.java:63)
	at org.testng.internal.invokers.TestInvoker$MethodInvocationAgent.invoke(TestInvoker.java:992)
	at org.testng.internal.invokers.TestInvoker.invokeTestMethods(TestInvoker.java:203)
	at org.testng.internal.invokers.TestMethodWorker.invokeTestMethods(TestMethodWorker.java:154)
	at org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:134)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at org.testng.TestRunner.privateRun(TestRunner.java:739)
	at org.testng.TestRunner.run(TestRunner.java:614)
	at org.testng.SuiteRunner.runTest(SuiteRunner.java:421)
	at org.testng.SuiteRunner.runSequentially(SuiteRunner.java:413)
	at org.testng.SuiteRunner.privateRun(SuiteRunner.java:373)
	at org.testng.SuiteRunner.run(SuiteRunner.java:312)
	at org.testng.SuiteRunnerWorker.runSuite(SuiteRunnerWorker.java:52)
	at org.testng.SuiteRunnerWorker.run(SuiteRunnerWorker.java:95)
	at org.testng.TestNG.runSuitesSequentially(TestNG.java:1274)
	at org.testng.TestNG.runSuitesLocally(TestNG.java:1208)
	at org.testng.TestNG.runSuites(TestNG.java:1112)
	at org.testng.TestNG.run(TestNG.java:1079)
	at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:155)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:102)
	at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:91)
	at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:137)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495)
Caused by: java.util.concurrent.ExecutionException: java.util.concurrent.RejectedExecutionException: java.lang.InterruptedException
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.testQueryCancellationIsSuccessful(LuceneMutableTextIndexTest.java:105)
	at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
	at java.base/java.lang.reflect.Method.invoke(Method.java:580)
	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:141)
	at org.testng.internal.invokers.TestInvoker.invokeMethod(TestInvoker.java:686)
	... 27 more
Caused by: java.util.concurrent.RejectedExecutionException: java.lang.InterruptedException
	at org.apache.pinot.common.utils.ScalingThreadPoolExecutor$ForceQueuePolicy.rejectedExecution(ScalingThreadPoolExecutor.java:93)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:145)
	at org.apache.pinot.segment.local.realtime.impl.invertedindex.RealtimeLuceneTextIndex.getDocIds(RealtimeLuceneTextIndex.java:149)
	at org.apache.pinot.segment.local.realtime.impl.invertedindex.LuceneMutableTextIndexTest.lambda$testQueryCancellationIsSuccessful$0(LuceneMutableTextIndexTest.java:103)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
	at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.InterruptedException
	at java.base/java.util.concurrent.locks.ReentrantLock$Sync.lockInterruptibly(ReentrantLock.java:159)
	at java.base/java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:372)
	at java.base/java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:332)
	at org.apache.pinot.common.utils.ScalingThreadPoolExecutor$ForceQueuePolicy.rejectedExecution(ScalingThreadPoolExecutor.java:90)
	... 9 more

@itschrispeck
Copy link
Contributor Author

itschrispeck commented May 3, 2024

I don't think it's related, but is probably caused by this behavior: https://julien.ponge.org/blog/not-all-java-single-threaded-executors-are-created-equal-a-java-finalizer-horror-story/

Did you see the flakiness is Java 21? It should be fixed there, but could misbehave in the suites run with 11/17.

Do you mind if I put the fix into another PR since it's not related to the fix for the reuse feature, or want me to bundle it into this one? (there's a couple other nits I'd rather bundle together if no preference).

@chenboat
Copy link
Contributor

chenboat commented May 3, 2024

A general question: if we turn off merge, will it result in too many Lucene segments for realtime part and cause query perf issues?

@itschrispeck
Copy link
Contributor Author

A general question: if we turn off merge, will it result in too many Lucene segments for realtime part and cause query perf issues?

In theory it could, but in practice Pinot's typical segment sizes prevent this (>5-10GB segments isn't normal), and we don't see a query perf regression with even a small 10MB buffer size and 1.5GB segment size.

FDs are probably more of a concern for systems w/ low limits, but I think can be addressed if needed w/ lucene buffer size/compound file format/Pinot segment size tuning - all of which have existing configs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants