-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminate analyzer invocation once a timeout passes to prevent hanging #1203
Conversation
Looks great, thanks! |
# Need to capture the "function pointer" returned by | ||
# setup_process_timeout as reference, so that we may call it later. | ||
# (By default, let's say that the process finished gracefully.) | ||
timeout_cleanup = [lambda _: False] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this being a list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment explains it right above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setup_process_timeout
returns a function pointer, but the function which calls setup_...
is called through a callback. We need to capture this function pointer so we may call it later when the analyzer finishes (a few lines after this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we'd simply say timeout_cleanup = setup_...
in the embedded function callback of run_analyzer
, that assignment creates a local scope and shadows this variable, there is no way to retrieve the actual function pointer.
But lists and dicts are captured by reference, so we can assign to it and still use the value later on. By the time run_analyzer()
returns, the analyzer has finished: either via exiting by itself, or the watcher killing it. So the callback has executed (because it executes at the creation of the subprocess), which means this function points to an actual closure of setup_process_timeout.__cleanup_timeout()
. Which we can use to figure out if the exit of the analyzer (and return from run_analyzer()
) happened from the analyzer quitting, or we murdering it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Okay, I put a better explanation about this into the code too.)
I disagree with the default value. I have seen translation units running longer than that in real-world projects. I think by default we should not have any timeout and the correct value should be decided on a per-project basis. |
Do we have a debugging guide? It would be great to have the info about how to spot these problems elsewhere, not just the PR description. |
@Xazax-hun I don't think there is a debugging guide per se. @martong How much do you rely on the format of |
@whisperity The debug scripts under |
We talked about higher default values too like 30min for a translation unit analysis. What were the analysis times you experienced @Xazax-hun? Could we extend the debug tools to check the analyzer return values and give some feedback if the analysis failed because of the timeout and not because some failure during the analysis? |
@gyorb I cannot recall the exact values but the deviation of the times to analyzer the TUs can be quite big. So I think we should measure the distribution on some non-trivial projects before setting a default timeout. In case, we do not want to do this now, I would go with no default timeout and having a ticket to do the measurement later and set a reasonable default based on that. The other problem is while the analyzer is evolving this timeout might need to be reevaluated later on. |
As I run codechecker with all checkers enabled on Firefox code, I think I have a good use case ;) |
b6ddd0d
to
3528366
Compare
I've updated the code so that "No timeout" is the default behaviour. If we intend to do a default later on, this will be now easier, as the "No default" is handled well in the code, instead of having |
9b8a29f
to
f2ffb8d
Compare
called. Set up a timeout for the analysis. | ||
""" | ||
timeout_cleanup[0] = util.setup_process_timeout( | ||
analyzer_process, analysis_timeout, signal.SIGKILL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A sigterm would not be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Processes can trap SIGTERM
s and decide not to give a damn about them, which will result in the process' stop never happening. (The default value for setup_process_timeout
is indeed SIGTERM
however.)
@sylvestre it would be great if you could try it out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with a 30min default timeout for an analysis. We can fine tune it later if needed. Otherwise LGTM.
f2ffb8d
to
0211314
Compare
@gyorb What is the state of this patch right now? Can this go? |
type=int, | ||
dest='timeout', | ||
required=False, | ||
default=1800, # 30 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be minutes right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, typo, will fix.
I am still opposed to having a default timeout otherwise looks good. |
@Xazax-hun we discussed that a larger default value would be better maybe 1 hour? @whisperity do we print out something if the analysis was stopped because of the timeout (some type of warning?) so the user knows that the timeout should be incremented or the analyzer configuration should be changed? |
@gyorb See The user is given a warning about the timeout, a clarification that this is counted as a failed analysis (with all its proceedings, such as creation of failure ZIP), and the analyzer STDERR is also prepended with a token, which is then printed to the user, and written into the ZIP. |
0211314
to
0dab71a
Compare
The problem is that the timeot is very specific to a project. If a project has big branching factor or using unity build I can imagine multiple hour long analysis actions. It is very rare to have infinite loops in the checks, I have yet to see one. So I feel like we are trying to solve a problem that does not really exists at the cost of making the default settings unsuitable for some legitimate use cases. Or is this a workaround for performance? |
0dab71a
to
b464cfd
Compare
I have added a small change so that only greater than 0 timeouts actually create a timeout, specifying |
@Xazax-hun So the story of the bug, read #1168. @sylvestre was having his CI loops for Firefox hanging seemingly indefinitely. At first we thought that it was clang hangs or some checker messes the whole thing up, so this patch was implemented. I think it was either @dkrupp's or @gyorb's idea. We implemented this patch and gave @sylvestre the initial version, with which he tested his build, but it was still hanging. I was closely investigating his Jenkins output, and it turned out that no individual Clang invocation hung in his case (with 10 min timeout if I recall correctly...), but the whole The reason behind the hang was that he is running the static analysis with debug and alpha checkers enabled. Looking at his log files, I'm seeing function traces and a lot of other stack traces. Now CodeChecker's failure ZIP generator was not equipped to handle the format these special internal output generators make (I vaugely remember @martong was looking into it?), so we dropped the whole idea and told @sylvestre to turn off the debug checkers in the nightly build. But because this patch was already implemented, we did not decide to scrap it, considering it useful to other people who really want to limit their analyses into some timeframe for any reason whatsoever. |
If it was mainly a configuration problem, maybe we should disable the Let's use |
b464cfd
to
1cd3472
Compare
I think forcibly disabling the debug checkers is a bad idea. Currently, even However, it is a valid use case to run CodeChecker with the debug checkers enabled to generate output to help developing a checker, or Clang itself. I have changed the default to be "No timeout". If a user specifies a positive integer as parameter, it will be the timeout. |
CodeChecker analyze [...] --timeout 600
will set a 600 seconds (10 minutes, this is the default value!) timeout for each and every analyzer invocation made by CodeChecker.If the analysis takes longer than this timeout, the process is killed, and the analysis is considered a failed one, with all the "failure-ZIP-creation" and other code executing.
How to spot a timed-out analysis in the failure ZIP?
return-code
will be-1
.stderr
will contain the following prefix before the analyzer's real stderr:>>> CodeChecker: Analysis timed out after 600 seconds. <<<
@sylvestre This method sounds good to you? This will also, hopefully, make it easier to debug why a particular analysis invocation failed.