-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-30345][SQL] Fix intermittent test failure (ConnectException) on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite #27001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…n ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite
Here's the diff file to reproduce the issue: |
This patch also resolve SPARK-28883 which error message is reported to ConnectException, but given there might be some failures not relevant to this, I'd defer to reviewers judging whether this patch also resolves SPARK-28883 entirely or partially. |
Test build #115730 has finished for PR 27001 at commit
|
ThriftServerQueryTestSuite has been touched from various people, but let me first cc.ing some committers. |
sqlContext.setConf(ConfVars.HIVE_SERVER2_THRIFT_PORT.varname, port.toString) | ||
hiveServer2 = HiveThriftServer2.startWithContext(sqlContext) | ||
|
||
eventually(timeout(30.seconds), interval(1.seconds)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave some comments here?
import org.apache.spark.sql.QueryTest | ||
import org.apache.spark.sql.test.SharedSparkSession | ||
|
||
trait SharedThriftServer extends QueryTest with SharedSparkSession { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just moving QueryTest
into each test suite since they are not dependent between each other?
The change looks reasonable to me. |
b5c41d2
to
160ca1a
Compare
Thanks for the reviewing, I've addressed review comments. Please take a look again. Thanks in advance! |
Test build #115810 has finished for PR 27001 at commit
|
The fix itself LGTM, but do you know why Thrift(Binary|Http)CLIService prepare and launch the service asynchronously? Seems end-users can also hit this problem if they run queries right after the server is started. |
I don't know about history, but if I get the commit history correctly, codebase of Here we're embedding Hive server in test suites hence get a chance to encounter race condition, but most likely it will run as standalone app which end users would just think server is not ready and rerun the query when query fails for such reason. (And the chance is rare, as it's even hard to reproduce without adding artificial sleep. It might be under 1 second.) Btw, there's another test suites which runs Hive server as external standalone app - it shouldn't encounter this issue but it would show longer startup and shutdown. Lines 986 to 1215 in 9c046dc
|
thanks, merging to master! |
Thanks all for reviewing and merging! |
Thank you very much @HeartSaVioR for this trait cleanup! I missed this PR over Christmas and just now saw the new code. This trait that starts the Thriftserver with existing SparkContext instead of as a separate process can be very useful to increase thriftserver test coverage in other places - e.g. add tests validating the state of the thriftserver listener, session manager, query cancellation etc... - things to assert which the test needs to touch thriftserver internals to validate. |
…n ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite ### What changes were proposed in this pull request? This patch fixes the intermittent test failure on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite, getting ConnectException when querying to thrift server. (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/) The relevant unit test log messages are following: ``` 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:ThriftBinaryCLIService is started. 19/12/23 13:33:01.875 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is started. ... 19/12/23 13:33:01.888 pool-1-thread-1 INFO ThriftServerWithSparkContextSuite: HiveThriftServer2 started successfully ... 19/12/23 13:33:01.909 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== TEST OUTPUT FOR o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== ... 19/12/23 13:33:02.017 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Supplied authorities: localhost:15441 19/12/23 13:33:02.018 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO Utils: Resolved authority: localhost:15441 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.078 HiveServer2-Background-Pool: Thread-213 INFO BaseSessionStateBuilder$$anon$2: Optimization rule 'org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation' is excluded from the optimizer. 19/12/23 13:33:02.121 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite WARN HiveConnection: Failed to connect to localhost:15441 19/12/23 13:33:02.124 pool-1-thread-1-ScalaTest-running-ThriftServerWithSparkContextSuite INFO ThriftServerWithSparkContextSuite: ===== FINISHED o.a.s.sql.hive.thriftserver.ThriftServerWithSparkContextSuite: 'SPARK-29911: Uncache cached tables when session closed' ===== 19/12/23 13:33:02.143 Thread-35 INFO ThriftCLIService: Starting ThriftBinaryCLIService on port 15441 with 5...500 worker threads 19/12/23 13:33:02.327 pool-1-thread-1 INFO HiveServer2: Shutting down HiveServer2 19/12/23 13:33:02.328 pool-1-thread-1 INFO ThriftCLIService: Thrift server has stopped ``` (Here the error is logged as `WARN HiveConnection: Failed to connect to localhost:15441` - the actual stack trace can be seen on Jenkins test summary.) The reason of test failure: Thrift(Binary|Http)CLIService prepare and launch the service asynchronously (in new thread), which suites are not waiting for completion and just start running tests, ends up with race condition. That can be easily reproduced, via adding artificial sleep in `ThriftBinaryCLIService.run()` here: https://github.com/apache/spark/blob/ba3f6330dd2b6054988f1f6f0ffe014fc4969088/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java#L49 (Note that `sleep` should be added before initializing server socket. E.g. Line 57) This patch changes the test initialization logic to try executing simple query to wait until the service is available. The patch also refactors the code to apply the change both ThriftServerQueryTestSuite and ThriftServerWithSparkContextSuite easily. ### Why are the changes needed? This patch fixes the intermittent failure observed here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/ ### Does this PR introduce any user-facing change? No ### How was this patch tested? Artificially made the test fail consistently (by the approach described above), and confirmed the patch fixed the test. Closes apache#27001 from HeartSaVioR/SPARK-30345. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
This patch fixes the intermittent test failure on ThriftServerQueryTestSuite/ThriftServerWithSparkContextSuite, getting ConnectException when querying to thrift server.
(https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/)
The relevant unit test log messages are following:
(Here the error is logged as
WARN HiveConnection: Failed to connect to localhost:15441
- the actual stack trace can be seen on Jenkins test summary.)The reason of test failure: Thrift(Binary|Http)CLIService prepare and launch the service asynchronously (in new thread), which suites are not waiting for completion and just start running tests, ends up with race condition.
That can be easily reproduced, via adding artificial sleep in
ThriftBinaryCLIService.run()
here:spark/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java
Line 49 in ba3f633
(Note that
sleep
should be added before initializing server socket. E.g. Line 57)This patch changes the test initialization logic to try executing simple query to wait until the service is available. The patch also refactors the code to apply the change both ThriftServerQueryTestSuite and ThriftServerWithSparkContextSuite easily.
Why are the changes needed?
This patch fixes the intermittent failure observed here:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/115646/testReport/
Does this PR introduce any user-facing change?
No
How was this patch tested?
Artificially made the test fail consistently (by the approach described above), and confirmed the patch fixed the test.