Skip to content

Fix subprocess handling when killed on timeout #5220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from 2 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b367ef2
Add WIP fix for handling urls with unicode characters.
Kami Mar 13, 2021
4e0bb04
Add WIP fix for the client side and make sure urls with unicode
Kami Mar 13, 2021
20e9b81
Revert "Add WIP fix for the client side and make sure urls with unicode"
Kami Mar 13, 2021
cab8b52
Add a fix for CLI which would result in exceptions in various scenarios
Kami Mar 13, 2021
e08f9f1
Move functionality for surrogate re-encoding to a uility function for
Kami Mar 13, 2021
2f06857
Add additional unit tests for the utf-8 bug.
Kami Mar 13, 2021
3ef58c9
Add changelog entry.
Kami Mar 13, 2021
db8e85e
Add an example rule fixture for rule with unicode name which can be used
Kami Mar 13, 2021
f3a92b1
Use a different more unique name to avoid breaking other test with a
Kami Mar 13, 2021
04e58b6
Update API router code to throw a more user-friendly exception in case
Kami Mar 13, 2021
15274ea
Add a test case for it.
Kami Mar 13, 2021
ae42830
fix subprocess handling when killed on timeout
engineer-roman Apr 3, 2021
fde03cc
Merge branch 'master' into url_path_unicode_fix
Kami Apr 5, 2021
c0ed21b
For now, remove test fixture which is causing failure in end to end
Kami Apr 5, 2021
b475c64
Merge branch 'master' into fix_zombie_spawning_green_shell
Kami Apr 5, 2021
c687259
Merge pull request #5189 from StackStorm/url_path_unicode_fix
Kami Apr 5, 2021
d1912e1
Merge branch 'fix_zombie_spawning_green_shell' of https://github.com/…
Kami Apr 6, 2021
a4e3c98
Instead of overriding process.returncode and utilizing that attribute to
Kami Apr 6, 2021
708fa44
Update docstring.
Kami Apr 6, 2021
39e95b4
Add integration tests for st2common.util.green.shell.run_command()
Kami Apr 6, 2021
927fe07
Add TODO comment.
Kami Apr 6, 2021
0770969
Add changelog entry.
Kami Apr 6, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 1 addition & 4 deletions st2common/st2common/util/green/shell.py
Original file line number Diff line number Diff line change
Expand Up @@ -165,17 +165,14 @@ def on_timeout_expired(timeout):
# Note: We explicitly set the returncode to indicate the timeout.
LOG.debug("Command execution timeout reached.")

# NOTE: It's important we set returncode twice - here and below to avoid race in this
# function because "kill_func()" is async and "process.kill()" is not.
process.returncode = TIMEOUT_EXIT_CODE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into this tonight.

I started by adding a test case and I was indeed able to reproduce the zombie process test issue with a local command runner like setup where we use a custom kill_func().

It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just curious where you encountered this issue?

Was it via local shell command action runner or some other code path?

Copy link
Member

@Kami Kami Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And under which Python version you encountered this issue?

Since technically now that only support Python >= 3.6, we could also simplify that function a bit and utilize timeout parameter for communicate() method which we couldn't do when we still supported Python 2.7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kami

Also, just curious where you encountered this issue?

That was long-running Python actions, but next morning found zombies and started to investigate root cause of them. Finally I realized those processes just never receive kill signal, checked source of kill/communicate/wait and here we are :)

It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.

No, wait() won't work in both cases (eventlet and gevent) because this method checks returncode against None to proceed, otherwise it will just return existing returncode. Also it should not cause any mess with timeout detection logic becuse it will be set to -9 right after wait().

And under which Python version you encountered this issue?

Python 3.6. I'll be able to look into using communicate() here tonight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using communicate() is actually not a blocker, just a potential future simplification :)

And thanks for answering those questions.

I think the change which was merged should cover all those scenarios and I believe we now also have integration tests for most of those situations (shell=True as is the case for local runner, shell=False for python action runner, etc. Only thing we are missing is tests for read_stdout_func + read_stdout_func which verify threads which call those functions are correctly stopped / killed on timeout.).

Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?

Copy link
Contributor Author

@engineer-roman engineer-roman Apr 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using communicate() is actually not a blocker, just a potential future simplification :)

I see :)

Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?

Just tried to reproduce the issue and everything seem to be fine:

  1. subprocess is getting killed on timeout
  2. it lefts no zombie process after kill process
  3. sure, subprocess might be tracked down within ST2 as timed out process

Thank you for helping with my PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thanks for confirming and submitting this bug fix.


if kill_func:
LOG.debug("Calling kill_func.")
kill_func(process=process)
else:
LOG.debug("Killing process.")
process.kill()

process.wait()
# NOTE: It's imporant to set returncode here as well, since call to process.kill() sets
# it and overwrites it if we set it earlier.
process.returncode = TIMEOUT_EXIT_CODE
Expand Down