-
-
Notifications
You must be signed in to change notification settings - Fork 758
Fix subprocess handling when killed on timeout #5220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix subprocess handling when killed on timeout #5220
Conversation
characters are handled correctly.
This reverts commit 4e0bb04.
where the argument contains unicode character.
easier testability, etc. Add unit tests for that function.
Keep in mind that API tests are problematic - we don't exercise exactly the same code paths as we do when running outside tests because the tests itself utilize webtest module which has the same bug as webob. This means I needed to patch the code so it utilizes the same version of the functions as we do in prod. Not ideal and also shows why it's importat we also have actual end to end tests for the api.
by end to end tests.
the request URL contains invalid or incorrectly URL encoded characters. Previously we didn't handle this scenario which means original UnicodeDecodeError error with the stack trace got propagated to the user which is a no go. Related issue - GoogleCloudPlatform/webapp2#152.
tests on Ubuntu 16.04. Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working on a test workaround since it won't be needed in the near future anyway once we remove support for 16.04.
Thanks for the contribution. That change will likely take a while to review since I remember we had edge cases in that code before so we need to make sure it's reviewed and tested well (and we also need a regression test case for this issue). |
Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv)
@@ -165,17 +165,14 @@ def on_timeout_expired(timeout): | |||
# Note: We explicitly set the returncode to indicate the timeout. | |||
LOG.debug("Command execution timeout reached.") | |||
|
|||
# NOTE: It's important we set returncode twice - here and below to avoid race in this | |||
# function because "kill_func()" is async and "process.kill()" is not. | |||
process.returncode = TIMEOUT_EXIT_CODE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm looking into this tonight.
I started by adding a test case and I was indeed able to reproduce the zombie process test issue with a local command runner like setup where we use a custom kill_func()
.
It seems that adding process.wait()
after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just curious where you encountered this issue?
Was it via local shell command action runner or some other code path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And under which Python version you encountered this issue?
Since technically now that only support Python >= 3.6, we could also simplify that function a bit and utilize timeout
parameter for communicate()
method which we couldn't do when we still supported Python 2.7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just curious where you encountered this issue?
That was long-running Python actions, but next morning found zombies and started to investigate root cause of them. Finally I realized those processes just never receive kill signal, checked source of kill
/communicate
/wait
and here we are :)
It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.
No, wait()
won't work in both cases (eventlet and gevent) because this method checks returncode
against None
to proceed, otherwise it will just return existing returncode
. Also it should not cause any mess with timeout detection logic becuse it will be set to -9
right after wait()
.
And under which Python version you encountered this issue?
Python 3.6. I'll be able to look into using communicate()
here tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using communicate()
is actually not a blocker, just a potential future simplification :)
And thanks for answering those questions.
I think the change which was merged should cover all those scenarios and I believe we now also have integration tests for most of those situations (shell=True as is the case for local runner, shell=False for python action runner, etc. Only thing we are missing is tests for read_stdout_func + read_stdout_func which verify threads which call those functions are correctly stopped / killed on timeout.).
Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using communicate() is actually not a blocker, just a potential future simplification :)
I see :)
Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?
Just tried to reproduce the issue and everything seem to be fine:
- subprocess is getting killed on timeout
- it lefts no zombie process after kill process
- sure, subprocess might be tracked down within ST2 as timed out process
Thank you for helping with my PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. Thanks for confirming and submitting this bug fix.
…r0m4n-z/st2 into r0m4n-z-fix_zombie_spawning_green_shell
determine if the process has timed out, utilize a special _timed_out attribute instead. This way we don't interfere with upstream logic which may base decisions based on the value of the process returncode value.
function, including regression test case for zombie / stray process issue.
I added a integration regression test cases (39e95b4) and made a change to utilize a different variable on the process object to signal if a timeout has occurred (a4e3c98) - this should indeed be safer since it won't affect any upstream or other code which may be basing behavior of the value of that attribute. In the future when I get a chance, I also plan to simplify that function and utilize timeout parameter of the Thanks again for reporting this issue. |
Problem:
Running ST2 actions as subprocesses (
eventlet.green
) spawns zombies on timeout.Reproduce:
process.kill()
will be called.Method
process.kill()
is never handled as well asprocess.communicate()
, which callsprocess.wait()
- all those methods checkprocess.returncode
against None and ifreturncode
already set, it just will be returned.Why do I suggest to remove line 170: it doesn't seem to be necessary to set returncode here to avoid race condition for some
kill_func
. Probably additional custom attribute could be used here onprocess
to provide that information if it's really necessary, but I don't think ST2 should do that.