-
Notifications
You must be signed in to change notification settings - Fork 807
[tmpnet] Watch for and report FATAL log entries on node startup #3535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
92d85bc
to
689167d
Compare
tests/fixture/tmpnet/node_process.go
Outdated
// Read a line from the file | ||
line, err := reader.ReadString('\n') | ||
if err != nil { | ||
if err.Error() == "EOF" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err == io.EOF ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
time.Sleep(waitInterval) | ||
continue | ||
} else { | ||
_, _ = fmt.Fprintf(w, "error reading %s: %v\n", path, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why doesn't this fail the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is a best-effort attempt to try to provide a reason why a node can't start. I don't think it needs to be as strict as possible, it may well be that the node starts fine and this error is unrelated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e. the function fails startup only when it has a positive indication of failure (FATAL
) rather than potential indication of failure (e.g. can't read file for whatever reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added mention of error-handling strategy to docstring.
return | ||
} | ||
} | ||
if strings.Contains(line, "FATAL") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably consider adding a flag that would fail the test by default unless specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate as to why this would be desirable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because if we have a fatal in the logs and the test passes silently, we might have a bug we're overlooking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could a node FATAL and have the test pass silently? A FATAL always results in node exit, and as per your suggestion, there is now a check at the end of every test that checks that all validators are healthy (i.e. have not FATAL'd).
Note also that this PR is only intended to ensure a useful error when a node fails to start, it does not keep watching the logs during a test run. Doing so would require watching all logs (not just main.log) and to be able to start watching the logs of nodes that were started before a test run (i.e. when reusing a network).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I forgot about that addition, which I also code reviewed myself ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, see if you agree with my other two comments
Why this should be merged
Previously, tmpnet would not report FATAL errors that prevented node startup - the only indication of failure was a context timeout. This change ensures that such errors are reported.
How this works
This solution is a bit of a compromise, as there are some errors - e.g. a config file containing invalid json - that preclude the node from writing to main.log. Reading stdout and stderr directly from the child process in golang seems to prevent the child process from outliving the parent (which is required to start a network that will be reused). A future improvement could include redirecting stderr and stdout to a file (i.e. by running the avalanchego binary with
bash -c "avalanchego --config-file=..." &> out.txt &
) and similarly watching for text indicating node startup failure so that it could be reported to the user.How this was tested
Manually. Failure output:
Need to be documented in RELEASES.md?
N/A