Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terminate analyzer invocation once a timeout passes to prevent hanging #1203

Merged
merged 1 commit into from
Jan 23, 2018

Conversation

whisperity
Copy link
Contributor

Resolves #1168.

CodeChecker analyze [...] --timeout 600 will set a 600 seconds (10 minutes, this is the default value!) timeout for each and every analyzer invocation made by CodeChecker.

If the analysis takes longer than this timeout, the process is killed, and the analysis is considered a failed one, with all the "failure-ZIP-creation" and other code executing.

How to spot a timed-out analysis in the failure ZIP?

  • The return-code will be -1.
  • The stderr will contain the following prefix before the analyzer's real stderr:
    >>> CodeChecker: Analysis timed out after 600 seconds. <<<

@sylvestre This method sounds good to you? This will also, hopefully, make it easier to debug why a particular analysis invocation failed.

@whisperity whisperity added enhancement 🌟 analyzer 📈 Related to the analyze commands (analysis driver) labels Dec 1, 2017
@whisperity whisperity added this to the release 6.3 milestone Dec 1, 2017
@sylvestre
Copy link
Contributor

Looks great, thanks!

# Need to capture the "function pointer" returned by
# setup_process_timeout as reference, so that we may call it later.
# (By default, let's say that the process finished gracefully.)
timeout_cleanup = [lambda _: False]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this being a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains it right above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setup_process_timeout returns a function pointer, but the function which calls setup_... is called through a callback. We need to capture this function pointer so we may call it later when the analyzer finishes (a few lines after this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd simply say timeout_cleanup = setup_... in the embedded function callback of run_analyzer, that assignment creates a local scope and shadows this variable, there is no way to retrieve the actual function pointer.

But lists and dicts are captured by reference, so we can assign to it and still use the value later on. By the time run_analyzer() returns, the analyzer has finished: either via exiting by itself, or the watcher killing it. So the callback has executed (because it executes at the creation of the subprocess), which means this function points to an actual closure of setup_process_timeout.__cleanup_timeout(). Which we can use to figure out if the exit of the analyzer (and return from run_analyzer()) happened from the analyzer quitting, or we murdering it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Okay, I put a better explanation about this into the code too.)

@Xazax-hun
Copy link
Contributor

I disagree with the default value. I have seen translation units running longer than that in real-world projects. I think by default we should not have any timeout and the correct value should be decided on a per-project basis.

@Xazax-hun
Copy link
Contributor

Do we have a debugging guide? It would be great to have the info about how to spot these problems elsewhere, not just the PR description.

@whisperity
Copy link
Contributor Author

@gyorb, @dkrupp Input on the timeout thing? Is it indeed better to have no timeouts by default (I think it is dangerous. The inverse of the argument can also be applied: if you think the timeout is too strict, you can always increase it.)?

@whisperity
Copy link
Contributor Author

@Xazax-hun I don't think there is a debugging guide per se. @martong How much do you rely on the format of stderr in your recently introduced scripts? I'm not sure if this extra line breaks them or not.

@martong
Copy link
Contributor

martong commented Dec 1, 2017

@whisperity The debug scripts under scripts/debug_tools do not depend on the stderr. (They depend on the analyzer-command and the sources_root only from the zip)

@gyorb
Copy link
Contributor

gyorb commented Dec 1, 2017

We talked about higher default values too like 30min for a translation unit analysis. What were the analysis times you experienced @Xazax-hun?

Could we extend the debug tools to check the analyzer return values and give some feedback if the analysis failed because of the timeout and not because some failure during the analysis?

@Xazax-hun
Copy link
Contributor

@gyorb I cannot recall the exact values but the deviation of the times to analyzer the TUs can be quite big. So I think we should measure the distribution on some non-trivial projects before setting a default timeout. In case, we do not want to do this now, I would go with no default timeout and having a ticket to do the measurement later and set a reasonable default based on that.

The other problem is while the analyzer is evolving this timeout might need to be reevaluated later on.

@sylvestre
Copy link
Contributor

As I run codechecker with all checkers enabled on Firefox code, I think I have a good use case ;)
So, I volunteer to test the commit once it landed!

@whisperity whisperity force-pushed the kill-hung-analyzers branch 2 times, most recently from b6ddd0d to 3528366 Compare December 2, 2017 11:11
@whisperity
Copy link
Contributor Author

I've updated the code so that "No timeout" is the default behaviour. If we intend to do a default later on, this will be now easier, as the "No default" is handled well in the code, instead of having None as timeout raising an error.

@whisperity whisperity force-pushed the kill-hung-analyzers branch 2 times, most recently from 9b8a29f to f2ffb8d Compare December 2, 2017 11:44
called. Set up a timeout for the analysis.
"""
timeout_cleanup[0] = util.setup_process_timeout(
analyzer_process, analysis_timeout, signal.SIGKILL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A sigterm would not be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Processes can trap SIGTERMs and decide not to give a damn about them, which will result in the process' stop never happening. (The default value for setup_process_timeout is indeed SIGTERM however.)

@gyorb
Copy link
Contributor

gyorb commented Dec 4, 2017

@sylvestre it would be great if you could try it out!

@gyorb gyorb removed this from the release 6.3 milestone Dec 6, 2017
Copy link
Contributor

@gyorb gyorb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with a 30min default timeout for an analysis. We can fine tune it later if needed. Otherwise LGTM.

@whisperity
Copy link
Contributor Author

@gyorb What is the state of this patch right now? Can this go?

type=int,
dest='timeout',
required=False,
default=1800, # 30 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be minutes right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, typo, will fix.

@Xazax-hun
Copy link
Contributor

I am still opposed to having a default timeout otherwise looks good.

@gyorb
Copy link
Contributor

gyorb commented Jan 16, 2018

@Xazax-hun we discussed that a larger default value would be better maybe 1 hour?

@whisperity do we print out something if the analysis was stopped because of the timeout (some type of warning?) so the user knows that the timeout should be incremented or the analyzer configuration should be changed?

@whisperity
Copy link
Contributor Author

@gyorb See analysis_manager.py:297

image

The user is given a warning about the timeout, a clarification that this is counted as a failed analysis (with all its proceedings, such as creation of failure ZIP), and the analyzer STDERR is also prepended with a token, which is then printed to the user, and written into the ZIP.

@Xazax-hun
Copy link
Contributor

The problem is that the timeot is very specific to a project. If a project has big branching factor or using unity build I can imagine multiple hour long analysis actions. It is very rare to have infinite loops in the checks, I have yet to see one. So I feel like we are trying to solve a problem that does not really exists at the cost of making the default settings unsuitable for some legitimate use cases. Or is this a workaround for performance?

@whisperity
Copy link
Contributor Author

I have added a small change so that only greater than 0 timeouts actually create a timeout, specifying 0 explicitly disables it instead of meaning "analysis must happen instantly or failure".

@whisperity
Copy link
Contributor Author

@Xazax-hun So the story of the bug, read #1168. @sylvestre was having his CI loops for Firefox hanging seemingly indefinitely. At first we thought that it was clang hangs or some checker messes the whole thing up, so this patch was implemented. I think it was either @dkrupp's or @gyorb's idea.

We implemented this patch and gave @sylvestre the initial version, with which he tested his build, but it was still hanging. I was closely investigating his Jenkins output, and it turned out that no individual Clang invocation hung in his case (with 10 min timeout if I recall correctly...), but the whole CodeChecker analyze did.

The reason behind the hang was that he is running the static analysis with debug and alpha checkers enabled. Looking at his log files, I'm seeing function traces and a lot of other stack traces. Now CodeChecker's failure ZIP generator was not equipped to handle the format these special internal output generators make (I vaugely remember @martong was looking into it?), so we dropped the whole idea and told @sylvestre to turn off the debug checkers in the nightly build.

But because this patch was already implemented, we did not decide to scrap it, considering it useful to other people who really want to limit their analyses into some timeframe for any reason whatsoever.

@gyorb
Copy link
Contributor

gyorb commented Jan 18, 2018

If it was mainly a configuration problem, maybe we should disable the debug checkers by default because they provide output to debug the compiler and the checkers which we do not plan to parse up now.

Let's use 0 than as the default value so it will not be limited. If we run into this problem again we can change it.

@whisperity
Copy link
Contributor Author

I think forcibly disabling the debug checkers is a bad idea. Currently, even --enable-all does not enable them, so the user must explicitly say --enable debug to have them enabled.

However, it is a valid use case to run CodeChecker with the debug checkers enabled to generate output to help developing a checker, or Clang itself.

I have changed the default to be "No timeout". If a user specifies a positive integer as parameter, it will be the timeout. 0 and negative numbers behave the same way as if no timeout was given.

@gyorb gyorb merged commit c3fbecd into Ericsson:master Jan 23, 2018
@whisperity whisperity deleted the kill-hung-analyzers branch January 30, 2018 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analyzer 📈 Related to the analyze commands (analysis driver) enhancement 🌟
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a watcher to kill stuck jobs
5 participants