-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KYUUBI #2642] Fix flaky test - JpsApplicationOperation with spark local mode #2669
Conversation
Also cc @cxzl25 |
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/JpsApplicationOperation.scala
Outdated
Show resolved
Hide resolved
It seems that this PR can not fix the issue. It is confused that, we can get engineId when invoking
|
Codecov Report
@@ Coverage Diff @@
## master #2669 +/- ##
============================================
- Coverage 64.29% 64.24% -0.05%
- Complexity 82 84 +2
============================================
Files 385 385
Lines 18656 18673 +17
Branches 2529 2531 +2
============================================
+ Hits 11994 11996 +2
- Misses 5515 5531 +16
+ Partials 1147 1146 -1
Continue to review full report at Codecov.
|
seems still flaky ..
|
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/JpsApplicationOperation.scala
Show resolved
Hide resolved
|
It seems that the process should not terminate itself before being killed. |
the UT passed. Let me re-run them again. |
the engine log looks pretty normal |
kyuubi-server/src/test/scala/org/apache/kyuubi/engine/JpsApplicationOperationSuite.scala
Outdated
Show resolved
Hide resolved
|
cc @wForget |
trigger re-run again. |
kyuubi-server/src/main/scala/org/apache/kyuubi/engine/JpsApplicationOperation.scala
Show resolved
Hide resolved
please update the PR desc |
updated |
re-run again and succeed. Thanks for your review, merging to master |
@@ -81,16 +81,18 @@ class JpsApplicationOperationSuite extends KyuubiFunSuite { | |||
assert(desc1.contains("id")) | |||
assert(desc1("name").contains(id)) | |||
assert(desc1("state") === "RUNNING") | |||
val response = jps.killApplicationByTag(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause of this problem is that spark-submit
will generate two processes.
First start org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit
to generate submission parameters,
Then start org.apache.spark.deploy.SparkSubmit
to execute the main logic.
So the call to getApplicationInfoByTag
is successful, it sees process 1, but sometimes killApplicationByTag
happens to be executed when the two processes are being switched. At this time, jps cannot see the process of the relevant tag.
Now add in the eventually block and it seems to work ok.
Is there some possibility, such as the engine startup is slow, more than 10s, or the kill command fails?
However, after this investigation, the next time we encounter this case, it should be quickly fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the comments @cxzl25
cc @cfmcgrady
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some possibility, such as the engine startup is slow, more than 10s, or the kill command fails?
It seems difficult for JpsApplicationOperation to handle it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the JpsApplicationOperation.getApplicationInfoByTag
should make sure the started JVM application is org.apache.spark.deploy.SparkSubmit
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @yaooqinn
…cal mode # 🔍 Description ## Issue References 🔗 This pull request fixes #2642 ## Describe Your Solution 🔧 #2669 (comment) ```bash jps -ml ``` ``` 99427 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.abc=123 ``` ``` 99416 org.apache.spark.deploy.SparkSubmit --conf spark.abc=123 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal ``` ## Types of changes 🔖 - [x] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 #### Behavior Without This Pull Request ⚰️ https://github.com/apache/kyuubi/actions/runs/6981210739/job/18997901268?pr=5773 ``` - JpsApplicationOperation with spark local mode *** FAILED *** The code passed to eventually never returned normally. Attempted 41 times over 10.235822332 seconds. Last failure message: "23463" did not equal null. (JpsApplicationOperationSuite.scala:92) ``` #### Behavior With This Pull Request 🎉 GA #### Related Unit Tests JpsApplicationOperationSuite "JpsApplicationOperation with spark local mode" --- # Checklists ## 📝 Author Self Checklist - [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project - [x] I have performed a self-review - [ ] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) ## 📝 Committer Pre-Merge Checklist - [ ] Pull request title is okay. - [ ] No license issues. - [ ] Milestone correctly set? - [ ] Test coverage is ok - [ ] Assignees are selected. - [ ] Minimum number of approvals - [ ] No changes are requested **Be nice. Be informative.** Closes #5785 from cxzl25/test_jps_local. Closes #5785 bb456a6 [sychen] Fix flaky test - JpsApplicationOperation with spark local mode Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Cheng Pan <chengpan@apache.org>
…cal mode # 🔍 Description ## Issue References 🔗 This pull request fixes #2642 ## Describe Your Solution 🔧 #2669 (comment) ```bash jps -ml ``` ``` 99427 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.abc=123 ``` ``` 99416 org.apache.spark.deploy.SparkSubmit --conf spark.abc=123 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal ``` ## Types of changes 🔖 - [x] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 #### Behavior Without This Pull Request ⚰️ https://github.com/apache/kyuubi/actions/runs/6981210739/job/18997901268?pr=5773 ``` - JpsApplicationOperation with spark local mode *** FAILED *** The code passed to eventually never returned normally. Attempted 41 times over 10.235822332 seconds. Last failure message: "23463" did not equal null. (JpsApplicationOperationSuite.scala:92) ``` #### Behavior With This Pull Request 🎉 GA #### Related Unit Tests JpsApplicationOperationSuite "JpsApplicationOperation with spark local mode" --- # Checklists ## 📝 Author Self Checklist - [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project - [x] I have performed a self-review - [ ] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) ## 📝 Committer Pre-Merge Checklist - [ ] Pull request title is okay. - [ ] No license issues. - [ ] Milestone correctly set? - [ ] Test coverage is ok - [ ] Assignees are selected. - [ ] Minimum number of approvals - [ ] No changes are requested **Be nice. Be informative.** Closes #5785 from cxzl25/test_jps_local. Closes #5785 bb456a6 [sychen] Fix flaky test - JpsApplicationOperation with spark local mode Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 19ae399) Signed-off-by: Cheng Pan <chengpan@apache.org>
…cal mode # 🔍 Description ## Issue References 🔗 This pull request fixes #2642 ## Describe Your Solution 🔧 #2669 (comment) ```bash jps -ml ``` ``` 99427 org.apache.spark.launcher.Main org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver --conf spark.abc=123 ``` ``` 99416 org.apache.spark.deploy.SparkSubmit --conf spark.abc=123 --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal ``` ## Types of changes 🔖 - [x] Bugfix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Breaking change (fix or feature that would cause existing functionality to change) ## Test Plan 🧪 #### Behavior Without This Pull Request ⚰️ https://github.com/apache/kyuubi/actions/runs/6981210739/job/18997901268?pr=5773 ``` - JpsApplicationOperation with spark local mode *** FAILED *** The code passed to eventually never returned normally. Attempted 41 times over 10.235822332 seconds. Last failure message: "23463" did not equal null. (JpsApplicationOperationSuite.scala:92) ``` #### Behavior With This Pull Request 🎉 GA #### Related Unit Tests JpsApplicationOperationSuite "JpsApplicationOperation with spark local mode" --- # Checklists ## 📝 Author Self Checklist - [x] My code follows the [style guidelines](https://kyuubi.readthedocs.io/en/master/contributing/code/style.html) of this project - [x] I have performed a self-review - [ ] I have commented my code, particularly in hard-to-understand areas - [x] I have made corresponding changes to the documentation - [x] My changes generate no new warnings - [ ] I have added tests that prove my fix is effective or that my feature works - [ ] New and existing unit tests pass locally with my changes - [x] This patch was not authored or co-authored using [Generative Tooling](https://www.apache.org/legal/generative-tooling.html) ## 📝 Committer Pre-Merge Checklist - [ ] Pull request title is okay. - [ ] No license issues. - [ ] Milestone correctly set? - [ ] Test coverage is ok - [ ] Assignees are selected. - [ ] Minimum number of approvals - [ ] No changes are requested **Be nice. Be informative.** Closes #5785 from cxzl25/test_jps_local. Closes #5785 bb456a6 [sychen] Fix flaky test - JpsApplicationOperation with spark local mode Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Cheng Pan <chengpan@apache.org> (cherry picked from commit 19ae399) Signed-off-by: Cheng Pan <chengpan@apache.org>
Why are the changes needed?
To close #2642
In this PR, we put the
getApplicationInfoByTag
andkillApplicationByTag
into the sameeventually
block to try to fix the confused flaky test.Since we kill the application with signal
SIGTERM
, so the process is not terminated immediately.SIGTERM vs SIGKILL: What's the Difference?
So we have to wrap the following
getApplicationInfoByTag
block(assert the application is finished) witheventually
afterkillApplicationByTag
.BTW, we fix a potential issue in case that the
jps
command is not in the PATH.How was this patch tested?
Add some test cases that check the changes thoroughly including negative and positive cases if possible
Add screenshots for manual tests if appropriate
Run test locally before make a pull request