Skip to content

Fix subprocess handling when killed on timeout #5220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

engineer-roman
Copy link
Contributor

Problem:
Running ST2 actions as subprocesses (eventlet.green) spawns zombies on timeout.

Reproduce:

  1. Spawn process which execution is going to take more than command's timeout.
  2. process.kill() will be called.
  3. Spawned process becomes zombie.

Method process.kill() is never handled as well as process.communicate(), which calls process.wait() - all those methods check process.returncode against None and if returncode already set, it just will be returned.

Why do I suggest to remove line 170: it doesn't seem to be necessary to set returncode here to avoid race condition for some kill_func. Probably additional custom attribute could be used here on process to provide that information if it's really necessary, but I don't think ST2 should do that.

Kami and others added 12 commits March 13, 2021 12:21
where the argument contains unicode character.
easier testability, etc.

Add unit tests for that function.
Keep in mind that API tests are problematic - we don't exercise exactly
the same code paths as we do when running outside tests because the
tests itself utilize webtest module which has the same bug as webob.

This means I needed to patch the code so it utilizes the same version of
the functions as we do in prod.

Not ideal and also shows why it's importat we also have actual end to
end tests for the api.
the request URL contains invalid or incorrectly URL encoded characters.

Previously we didn't handle this scenario which means original
UnicodeDecodeError error with the stack trace got propagated to the user
which is a no go.

Related issue -
GoogleCloudPlatform/webapp2#152.
@CLAassistant
Copy link

CLAassistant commented Apr 3, 2021

CLA assistant check
All committers have signed the CLA.

@pull-request-size pull-request-size bot added the size/XS PR that changes 0-9 lines. Quick fix/merge. label Apr 3, 2021
Kami added 2 commits April 5, 2021 12:20
tests on Ubuntu 16.04.

Ubuntu 16.04 is EOL and it makes no sense to spend a lot of time working
on a test workaround since it won't be needed in the near future anyway
once we remove support for 16.04.
@Kami
Copy link
Member

Kami commented Apr 5, 2021

Thanks for the contribution.

That change will likely take a while to review since I remember we had edge cases in that code before so we need to make sure it's reviewed and tested well (and we also need a regression test case for this issue).

Correctly handle unicode characters in the URL path names and the CLI arguments (sys.argv)
@Kami Kami self-assigned this Apr 6, 2021
@@ -165,17 +165,14 @@ def on_timeout_expired(timeout):
# Note: We explicitly set the returncode to indicate the timeout.
LOG.debug("Command execution timeout reached.")

# NOTE: It's important we set returncode twice - here and below to avoid race in this
# function because "kill_func()" is async and "process.kill()" is not.
process.returncode = TIMEOUT_EXIT_CODE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking into this tonight.

I started by adding a test case and I was indeed able to reproduce the zombie process test issue with a local command runner like setup where we use a custom kill_func().

It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just curious where you encountered this issue?

Was it via local shell command action runner or some other code path?

Copy link
Member

@Kami Kami Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And under which Python version you encountered this issue?

Since technically now that only support Python >= 3.6, we could also simplify that function a bit and utilize timeout parameter for communicate() method which we couldn't do when we still supported Python 2.7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Kami

Also, just curious where you encountered this issue?

That was long-running Python actions, but next morning found zombies and started to investigate root cause of them. Finally I realized those processes just never receive kill signal, checked source of kill/communicate/wait and here we are :)

It seems that adding process.wait() after kill fixes the zombie process issue and we can leave that exit code assignment in place - I think that's important since if we remove it process will have a different exit code set which would break our timeout detection logic in some places.

No, wait() won't work in both cases (eventlet and gevent) because this method checks returncode against None to proceed, otherwise it will just return existing returncode. Also it should not cause any mess with timeout detection logic becuse it will be set to -9 right after wait().

And under which Python version you encountered this issue?

Python 3.6. I'll be able to look into using communicate() here tonight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using communicate() is actually not a blocker, just a potential future simplification :)

And thanks for answering those questions.

I think the change which was merged should cover all those scenarios and I believe we now also have integration tests for most of those situations (shell=True as is the case for local runner, shell=False for python action runner, etc. Only thing we are missing is tests for read_stdout_func + read_stdout_func which verify threads which call those functions are correctly stopped / killed on timeout.).

Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?

Copy link
Contributor Author

@engineer-roman engineer-roman Apr 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using communicate() is actually not a blocker, just a potential future simplification :)

I see :)

Can you please confirm the change which has been merged indeed fixes the issue (I believe it should since it's based on your changes)?

Just tried to reproduce the issue and everything seem to be fine:

  1. subprocess is getting killed on timeout
  2. it lefts no zombie process after kill process
  3. sure, subprocess might be tracked down within ST2 as timed out process

Thank you for helping with my PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Thanks for confirming and submitting this bug fix.

Kami added 5 commits April 6, 2021 22:35
determine if the process has timed out, utilize a special _timed_out
attribute instead.

This way we don't interfere with upstream logic which may base decisions
based on the value of the process returncode value.
function, including regression test case for zombie / stray process
issue.
@pull-request-size pull-request-size bot removed the size/XS PR that changes 0-9 lines. Quick fix/merge. label Apr 6, 2021
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines. Requires some effort to review. label Apr 6, 2021
@Kami
Copy link
Member

Kami commented Apr 6, 2021

I added a integration regression test cases (39e95b4) and made a change to utilize a different variable on the process object to signal if a timeout has occurred (a4e3c98) - this should indeed be safer since it won't affect any upstream or other code which may be basing behavior of the value of that attribute.

In the future when I get a chance, I also plan to simplify that function and utilize timeout parameter of the communicate() method.

Thanks again for reporting this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug fix service: action runner size/L PR that changes 100-499 lines. Requires some effort to review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants