Terminate analyzer invocation once a timeout passes to prevent hanging #1203

whisperity · 2017-12-01T13:56:51Z

Resolves #1168.

CodeChecker analyze [...] --timeout 600 will set a 600 seconds (10 minutes, this is the default value!) timeout for each and every analyzer invocation made by CodeChecker.

If the analysis takes longer than this timeout, the process is killed, and the analysis is considered a failed one, with all the "failure-ZIP-creation" and other code executing.

How to spot a timed-out analysis in the failure ZIP?

The return-code will be -1.
The stderr will contain the following prefix before the analyzer's real stderr:
>>> CodeChecker: Analysis timed out after 600 seconds. <<<

@sylvestre This method sounds good to you? This will also, hopefully, make it easier to debug why a particular analysis invocation failed.

sylvestre · 2017-12-01T14:01:21Z

Looks great, thanks!

Xazax-hun · 2017-12-01T14:16:39Z

libcodechecker/analyze/analysis_manager.py

+            # Need to capture the "function pointer" returned by
+            # setup_process_timeout as reference, so that we may call it later.
+            # (By default, let's say that the process finished gracefully.)
+            timeout_cleanup = [lambda _: False]


Any reason for this being a list?

The comment explains it right above.

setup_process_timeout returns a function pointer, but the function which calls setup_... is called through a callback. We need to capture this function pointer so we may call it later when the analyzer finishes (a few lines after this.)

If we'd simply say timeout_cleanup = setup_... in the embedded function callback of run_analyzer, that assignment creates a local scope and shadows this variable, there is no way to retrieve the actual function pointer.

But lists and dicts are captured by reference, so we can assign to it and still use the value later on. By the time run_analyzer() returns, the analyzer has finished: either via exiting by itself, or the watcher killing it. So the callback has executed (because it executes at the creation of the subprocess), which means this function points to an actual closure of setup_process_timeout.__cleanup_timeout(). Which we can use to figure out if the exit of the analyzer (and return from run_analyzer()) happened from the analyzer quitting, or we murdering it.

(Okay, I put a better explanation about this into the code too.)

Xazax-hun · 2017-12-01T14:21:46Z

I disagree with the default value. I have seen translation units running longer than that in real-world projects. I think by default we should not have any timeout and the correct value should be decided on a per-project basis.

Xazax-hun · 2017-12-01T14:23:07Z

Do we have a debugging guide? It would be great to have the info about how to spot these problems elsewhere, not just the PR description.

whisperity · 2017-12-01T14:53:44Z

@gyorb, @dkrupp Input on the timeout thing? Is it indeed better to have no timeouts by default (I think it is dangerous. The inverse of the argument can also be applied: if you think the timeout is too strict, you can always increase it.)?

whisperity · 2017-12-01T14:57:03Z

@Xazax-hun I don't think there is a debugging guide per se. @martong How much do you rely on the format of stderr in your recently introduced scripts? I'm not sure if this extra line breaks them or not.

martong · 2017-12-01T15:26:38Z

@whisperity The debug scripts under scripts/debug_tools do not depend on the stderr. (They depend on the analyzer-command and the sources_root only from the zip)

gyorb · 2017-12-01T15:51:00Z

We talked about higher default values too like 30min for a translation unit analysis. What were the analysis times you experienced @Xazax-hun?

Could we extend the debug tools to check the analyzer return values and give some feedback if the analysis failed because of the timeout and not because some failure during the analysis?

Xazax-hun · 2017-12-01T15:56:18Z

@gyorb I cannot recall the exact values but the deviation of the times to analyzer the TUs can be quite big. So I think we should measure the distribution on some non-trivial projects before setting a default timeout. In case, we do not want to do this now, I would go with no default timeout and having a ticket to do the measurement later and set a reasonable default based on that.

The other problem is while the analyzer is evolving this timeout might need to be reevaluated later on.

sylvestre · 2017-12-01T15:57:56Z

As I run codechecker with all checkers enabled on Firefox code, I think I have a good use case ;)
So, I volunteer to test the commit once it landed!

whisperity · 2017-12-02T11:11:32Z

I've updated the code so that "No timeout" is the default behaviour. If we intend to do a default later on, this will be now easier, as the "No default" is handled well in the code, instead of having None as timeout raising an error.

gyorb · 2017-12-04T08:55:55Z

libcodechecker/analyze/analysis_manager.py

+                    called. Set up a timeout for the analysis.
+                    """
+                    timeout_cleanup[0] = util.setup_process_timeout(
+                        analyzer_process, analysis_timeout, signal.SIGKILL)


A sigterm would not be enough?

Processes can trap SIGTERMs and decide not to give a damn about them, which will result in the process' stop never happening. (The default value for setup_process_timeout is indeed SIGTERM however.)

gyorb · 2017-12-04T09:08:16Z

@sylvestre it would be great if you could try it out!

gyorb

Let's start with a 30min default timeout for an analysis. We can fine tune it later if needed. Otherwise LGTM.

whisperity · 2018-01-16T13:22:25Z

@gyorb What is the state of this patch right now? Can this go?

gyorb · 2018-01-16T14:41:46Z

libcodechecker/libhandlers/analyze.py

+                               type=int,
+                               dest='timeout',
+                               required=False,
+                               default=1800,  # 30 seconds.


It should be minutes right?

Yeah, typo, will fix.

Xazax-hun · 2018-01-16T14:42:48Z

I am still opposed to having a default timeout otherwise looks good.

gyorb · 2018-01-16T14:48:49Z

@Xazax-hun we discussed that a larger default value would be better maybe 1 hour?

@whisperity do we print out something if the analysis was stopped because of the timeout (some type of warning?) so the user knows that the timeout should be incremented or the analyzer configuration should be changed?

whisperity · 2018-01-16T15:09:00Z

@gyorb See analysis_manager.py:297

The user is given a warning about the timeout, a clarification that this is counted as a failed analysis (with all its proceedings, such as creation of failure ZIP), and the analyzer STDERR is also prepended with a token, which is then printed to the user, and written into the ZIP.

Xazax-hun · 2018-01-16T15:10:46Z

The problem is that the timeot is very specific to a project. If a project has big branching factor or using unity build I can imagine multiple hour long analysis actions. It is very rare to have infinite loops in the checks, I have yet to see one. So I feel like we are trying to solve a problem that does not really exists at the cost of making the default settings unsuitable for some legitimate use cases. Or is this a workaround for performance?

whisperity · 2018-01-16T15:28:18Z

I have added a small change so that only greater than 0 timeouts actually create a timeout, specifying 0 explicitly disables it instead of meaning "analysis must happen instantly or failure".

whisperity · 2018-01-16T15:40:21Z

@Xazax-hun So the story of the bug, read #1168. @sylvestre was having his CI loops for Firefox hanging seemingly indefinitely. At first we thought that it was clang hangs or some checker messes the whole thing up, so this patch was implemented. I think it was either @dkrupp's or @gyorb's idea.

We implemented this patch and gave @sylvestre the initial version, with which he tested his build, but it was still hanging. I was closely investigating his Jenkins output, and it turned out that no individual Clang invocation hung in his case (with 10 min timeout if I recall correctly...), but the whole CodeChecker analyze did.

The reason behind the hang was that he is running the static analysis with debug and alpha checkers enabled. Looking at his log files, I'm seeing function traces and a lot of other stack traces. Now CodeChecker's failure ZIP generator was not equipped to handle the format these special internal output generators make (I vaugely remember @martong was looking into it?), so we dropped the whole idea and told @sylvestre to turn off the debug checkers in the nightly build.

But because this patch was already implemented, we did not decide to scrap it, considering it useful to other people who really want to limit their analyses into some timeframe for any reason whatsoever.

gyorb · 2018-01-18T15:58:52Z

If it was mainly a configuration problem, maybe we should disable the debug checkers by default because they provide output to debug the compiler and the checkers which we do not plan to parse up now.

Let's use 0 than as the default value so it will not be limited. If we run into this problem again we can change it.

whisperity · 2018-01-18T16:09:22Z

I think forcibly disabling the debug checkers is a bad idea. Currently, even --enable-all does not enable them, so the user must explicitly say --enable debug to have them enabled.

However, it is a valid use case to run CodeChecker with the debug checkers enabled to generate output to help developing a checker, or Clang itself.

I have changed the default to be "No timeout". If a user specifies a positive integer as parameter, it will be the timeout. 0 and negative numbers behave the same way as if no timeout was given.

whisperity added enhancement 🌟 analyzer 📈 Related to the analyze commands (analysis driver) labels Dec 1, 2017

whisperity added this to the release 6.3 milestone Dec 1, 2017

whisperity requested review from Xazax-hun and gyorb December 1, 2017 13:56

Xazax-hun reviewed Dec 1, 2017

View reviewed changes

whisperity force-pushed the kill-hung-analyzers branch 2 times, most recently from b6ddd0d to 3528366 Compare December 2, 2017 11:11

whisperity force-pushed the kill-hung-analyzers branch 2 times, most recently from 9b8a29f to f2ffb8d Compare December 2, 2017 11:44

gyorb reviewed Dec 4, 2017

View reviewed changes

gyorb removed this from the release 6.3 milestone Dec 6, 2017

gyorb suggested changes Dec 7, 2017

View reviewed changes

whisperity force-pushed the kill-hung-analyzers branch from f2ffb8d to 0211314 Compare December 7, 2017 14:58

whisperity requested a review from gyorb December 7, 2017 14:59

gyorb approved these changes Dec 7, 2017

View reviewed changes

gyorb reviewed Jan 16, 2018

View reviewed changes

whisperity force-pushed the kill-hung-analyzers branch from 0211314 to 0dab71a Compare January 16, 2018 15:10

whisperity force-pushed the kill-hung-analyzers branch from 0dab71a to b464cfd Compare January 16, 2018 15:27

Terminate analyzer invocation once a timeout passes to prevent hanging

1cd3472

whisperity force-pushed the kill-hung-analyzers branch from b464cfd to 1cd3472 Compare January 18, 2018 16:08

Xazax-hun approved these changes Jan 18, 2018

View reviewed changes

gyorb merged commit c3fbecd into Ericsson:master Jan 23, 2018

whisperity mentioned this pull request Jan 23, 2018

Don't collect failure ZIP if "debug" checkers were enabled #1311

Open

whisperity deleted the kill-hung-analyzers branch January 30, 2018 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terminate analyzer invocation once a timeout passes to prevent hanging #1203

Terminate analyzer invocation once a timeout passes to prevent hanging #1203

whisperity commented Dec 1, 2017

sylvestre commented Dec 1, 2017

Xazax-hun Dec 1, 2017

whisperity Dec 1, 2017

whisperity Dec 1, 2017

whisperity Dec 1, 2017

whisperity Dec 2, 2017

Xazax-hun commented Dec 1, 2017

Xazax-hun commented Dec 1, 2017

whisperity commented Dec 1, 2017

whisperity commented Dec 1, 2017

martong commented Dec 1, 2017

gyorb commented Dec 1, 2017

Xazax-hun commented Dec 1, 2017

sylvestre commented Dec 1, 2017

whisperity commented Dec 2, 2017

gyorb Dec 4, 2017

whisperity Dec 4, 2017

gyorb commented Dec 4, 2017

gyorb left a comment •

edited

Loading

whisperity commented Jan 16, 2018

gyorb Jan 16, 2018

whisperity Jan 16, 2018

Xazax-hun commented Jan 16, 2018

gyorb commented Jan 16, 2018

whisperity commented Jan 16, 2018

Xazax-hun commented Jan 16, 2018

whisperity commented Jan 16, 2018

whisperity commented Jan 16, 2018

gyorb commented Jan 18, 2018

whisperity commented Jan 18, 2018

Terminate analyzer invocation once a timeout passes to prevent hanging #1203

Terminate analyzer invocation once a timeout passes to prevent hanging #1203

Conversation

whisperity commented Dec 1, 2017

How to spot a timed-out analysis in the failure ZIP?

sylvestre commented Dec 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xazax-hun commented Dec 1, 2017

Xazax-hun commented Dec 1, 2017

whisperity commented Dec 1, 2017

whisperity commented Dec 1, 2017

martong commented Dec 1, 2017

gyorb commented Dec 1, 2017

Xazax-hun commented Dec 1, 2017

sylvestre commented Dec 1, 2017

whisperity commented Dec 2, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gyorb commented Dec 4, 2017

gyorb left a comment • edited Loading

Choose a reason for hiding this comment

whisperity commented Jan 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Xazax-hun commented Jan 16, 2018

gyorb commented Jan 16, 2018

whisperity commented Jan 16, 2018

Xazax-hun commented Jan 16, 2018

whisperity commented Jan 16, 2018

whisperity commented Jan 16, 2018

gyorb commented Jan 18, 2018

whisperity commented Jan 18, 2018

gyorb left a comment •

edited

Loading