Fix subprocess handling when killed on timeout #5220

engineer-roman · 2021-04-03T23:20:07Z

Problem:
Running ST2 actions as subprocesses (eventlet.green) spawns zombies on timeout.

Reproduce:

Spawn process which execution is going to take more than command's timeout.
process.kill() will be called.
Spawned process becomes zombie.

Method process.kill() is never handled as well as process.communicate(), which calls process.wait() - all those methods check process.returncode against None and if returncode already set, it just will be returned.

Why do I suggest to remove line 170: it doesn't seem to be necessary to set returncode here to avoid race condition for some kill_func. Probably additional custom attribute could be used here on process to provide that information if it's really necessary, but I don't think ST2 should do that.

characters are handled correctly.

This reverts commit 4e0bb04.

where the argument contains unicode character.

easier testability, etc. Add unit tests for that function.

Keep in mind that API tests are problematic - we don't exercise exactly the same code paths as we do when running outside tests because the tests itself utilize webtest module which has the same bug as webob. This means I needed to patch the code so it utilizes the same version of the functions as we do in prod. Not ideal and also shows why it's importat we also have actual end to end tests for the api.

by end to end tests.

similar name.

the request URL contains invalid or incorrectly URL encoded characters. Previously we didn't handle this scenario which means original UnicodeDecodeError error with the stack trace got propagated to the user which is a no go. Related issue - GoogleCloudPlatform/webapp2#152.

CLAassistant · 2021-04-03T23:20:10Z

All committers have signed the CLA.

tests on Ubuntu 16.04. Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working on a test workaround since it won't be needed in the near future anyway once we remove support for 16.04.

Kami · 2021-04-05T12:18:19Z

Thanks for the contribution.

That change will likely take a while to review since I remember we had edge cases in that code before so we need to make sure it's reviewed and tested well (and we also need a regression test case for this issue).

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv)

Kami · 2021-04-06T19:56:53Z

st2common/st2common/util/green/shell.py

@@ -165,17 +165,14 @@ def on_timeout_expired(timeout):
            # Note: We explicitly set the returncode to indicate the timeout.
            LOG.debug("Command execution timeout reached.")

-            # NOTE: It's important we set returncode twice - here and below to avoid race in this
-            # function because "kill_func()" is async and "process.kill()" is not.
-            process.returncode = TIMEOUT_EXIT_CODE


I'm looking into this tonight.

I started by adding a test case and I was indeed able to reproduce the zombie process test issue with a local command runner like setup where we use a custom kill_func().

It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.

Also, just curious where you encountered this issue?

Was it via local shell command action runner or some other code path?

And under which Python version you encountered this issue?

Since technically now that only support Python >= 3.6, we could also simplify that function a bit and utilize timeout parameter for communicate() method which we couldn't do when we still supported Python 2.7.

@Kami

Also, just curious where you encountered this issue?

That was long-running Python actions, but next morning found zombies and started to investigate root cause of them. Finally I realized those processes just never receive kill signal, checked source of kill/communicate/wait and here we are :)

It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.

No, wait() won't work in both cases (eventlet and gevent) because this method checks returncode against None to proceed, otherwise it will just return existing returncode. Also it should not cause any mess with timeout detection logic becuse it will be set to -9 right after wait().

And under which Python version you encountered this issue?

Python 3.6. I'll be able to look into using communicate() here tonight.

Using communicate() is actually not a blocker, just a potential future simplification :)

And thanks for answering those questions.

I think the change which was merged should cover all those scenarios and I believe we now also have integration tests for most of those situations (shell=True as is the case for local runner, shell=False for python action runner, etc. Only thing we are missing is tests for read_stdout_func + read_stdout_func which verify threads which call those functions are correctly stopped / killed on timeout.).

Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?

Using communicate() is actually not a blocker, just a potential future simplification :)

I see :)

Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?

Just tried to reproduce the issue and everything seem to be fine:

subprocess is getting killed on timeout

it lefts no zombie process after kill process

sure, subprocess might be tracked down within ST2 as timed out process

Thank you for helping with my PR

Great. Thanks for confirming and submitting this bug fix.

…r0m4n-z/st2 into r0m4n-z-fix_zombie_spawning_green_shell

determine if the process has timed out, utilize a special _timed_out attribute instead. This way we don't interfere with upstream logic which may base decisions based on the value of the process returncode value.

function, including regression test case for zombie / stray process issue.

Kami · 2021-04-06T21:11:56Z

I added a integration regression test cases (39e95b4) and made a change to utilize a different variable on the process object to signal if a timeout has occurred (a4e3c98) - this should indeed be safer since it won't affect any upstream or other code which may be basing behavior of the value of that attribute.

In the future when I get a chance, I also plan to simplify that function and utilize timeout parameter of the communicate() method.

Thanks again for reporting this issue.

Kami and others added 12 commits March 13, 2021 12:21

Add WIP fix for handling urls with unicode characters.

b367ef2

Add WIP fix for the client side and make sure urls with unicode

4e0bb04

characters are handled correctly.

Revert "Add WIP fix for the client side and make sure urls with unicode"

20e9b81

This reverts commit 4e0bb04.

Add a fix for CLI which would result in exceptions in various scenarios

cab8b52

where the argument contains unicode character.

Move functionality for surrogate re-encoding to a uility function for

e08f9f1

easier testability, etc. Add unit tests for that function.

Add changelog entry.

3ef58c9

Add an example rule fixture for rule with unicode name which can be used

db8e85e

by end to end tests.

Use a different more unique name to avoid breaking other test with a

f3a92b1

similar name.

Add a test case for it.

15274ea

fix subprocess handling when killed on timeout

ae42830

pull-request-size bot added the size/XS PR that changes 0-9 lines. Quick fix/merge. label Apr 3, 2021

Kami added 2 commits April 5, 2021 12:20

Merge branch 'master' into url_path_unicode_fix

fde03cc

For now, remove test fixture which is causing failure in end to end

c0ed21b

tests on Ubuntu 16.04. Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working on a test workaround since it won't be needed in the near future anyway once we remove support for 16.04.

Merge branch 'master' into fix_zombie_spawning_green_shell

b475c64

Kami added this to the 3.5.0 milestone Apr 5, 2021

Kami added bug fix service: action runner labels Apr 5, 2021

Merge pull request StackStorm#5189 from StackStorm/url_path_unicode_fix

c687259

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv)

Kami self-assigned this Apr 6, 2021

Kami reviewed Apr 6, 2021

View reviewed changes

Kami added 5 commits April 6, 2021 22:35

Merge branch 'fix_zombie_spawning_green_shell' of https://github.com/…

d1912e1

…r0m4n-z/st2 into r0m4n-z-fix_zombie_spawning_green_shell

Instead of overriding process.returncode and utilizing that attribute to

a4e3c98

determine if the process has timed out, utilize a special _timed_out attribute instead. This way we don't interfere with upstream logic which may base decisions based on the value of the process returncode value.

Update docstring.

708fa44

Add integration tests for st2common.util.green.shell.run_command()

39e95b4

function, including regression test case for zombie / stray process issue.

Add TODO comment.

927fe07

pull-request-size bot removed the size/XS PR that changes 0-9 lines. Quick fix/merge. label Apr 6, 2021

pull-request-size bot added the size/L PR that changes 100-499 lines. Requires some effort to review. label Apr 6, 2021

Add changelog entry.

0770969

Kami approved these changes Apr 6, 2021

View reviewed changes

Kami merged commit 4f28540 into StackStorm:master Apr 7, 2021

MichaelMcClure mentioned this pull request Jul 9, 2021

Stackstorm action runner does not return / complete - hangs and process is defunct #5300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix subprocess handling when killed on timeout #5220

Fix subprocess handling when killed on timeout #5220

engineer-roman commented Apr 3, 2021

CLAassistant commented Apr 3, 2021 •

edited

Loading

Kami commented Apr 5, 2021

Kami Apr 6, 2021

Kami Apr 6, 2021

Kami Apr 6, 2021 •

edited

Loading

engineer-roman Apr 8, 2021

Kami Apr 9, 2021

engineer-roman Apr 11, 2021 •

edited

Loading

Kami Apr 11, 2021

Kami commented Apr 6, 2021

Fix subprocess handling when killed on timeout #5220

Fix subprocess handling when killed on timeout #5220

Conversation

engineer-roman commented Apr 3, 2021

CLAassistant commented Apr 3, 2021 • edited Loading

Kami commented Apr 5, 2021

Kami Apr 6, 2021

Choose a reason for hiding this comment

Kami Apr 6, 2021

Choose a reason for hiding this comment

Kami Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

engineer-roman Apr 8, 2021

Choose a reason for hiding this comment

Kami Apr 9, 2021

Choose a reason for hiding this comment

engineer-roman Apr 11, 2021 • edited Loading

Choose a reason for hiding this comment

Kami Apr 11, 2021

Choose a reason for hiding this comment

Kami commented Apr 6, 2021

CLAassistant commented Apr 3, 2021 •

edited

Loading

Kami Apr 6, 2021 •

edited

Loading

engineer-roman Apr 11, 2021 •

edited

Loading