Skip to content

Conversation

marun
Copy link
Contributor

@marun marun commented Nov 11, 2024

Why this should be merged

Previously, tmpnet would not report FATAL errors that prevented node startup - the only indication of failure was a context timeout. This change ensures that such errors are reported.

How this works

  • adds a goroutine to watch for FATAL in a node's main.log until the node reports healthy (or times out waiting for health)
  • adds a shared context that the goroutine can use to signal the loop waiting for the process context file so that the error can be propogated

This solution is a bit of a compromise, as there are some errors - e.g. a config file containing invalid json - that preclude the node from writing to main.log. Reading stdout and stderr directly from the child process in golang seems to prevent the child process from outliving the parent (which is required to start a network that will be reused). A future improvement could include redirecting stderr and stdout to a file (i.e. by running the avalanchego binary with bash -c "avalanchego --config-file=..." &> out.txt &) and similarly watching for text indicating node startup failure so that it could be reported to the user.

How this was tested

Manually. Failure output:

Error: failed to start local node: failed to load process context for node "NodeID-3wZBgT2xsNjA3A4eJt5ahA578y9oGiFhy": [11-11|18:51:19.477] FATAL app/app.go:75 failed to initialize node {"error": "couldn't initialize API server: listen tcp 127.0.0.1:9650: bind: address already in use"}

Need to be documented in RELEASES.md?

N/A

@marun marun added the testing This primarily focuses on testing label Nov 11, 2024
@marun marun self-assigned this Nov 11, 2024
@marun marun force-pushed the tmpnet-report-fatals branch from 92d85bc to 689167d Compare November 11, 2024 17:53
// Read a line from the file
line, err := reader.ReadString('\n')
if err != nil {
if err.Error() == "EOF" {
Copy link
Contributor

@yacovm yacovm Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err == io.EOF ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

time.Sleep(waitInterval)
continue
} else {
_, _ = fmt.Fprintf(w, "error reading %s: %v\n", path, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why doesn't this fail the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is a best-effort attempt to try to provide a reason why a node can't start. I don't think it needs to be as strict as possible, it may well be that the node starts fine and this error is unrelated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e. the function fails startup only when it has a positive indication of failure (FATAL) rather than potential indication of failure (e.g. can't read file for whatever reason).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added mention of error-handling strategy to docstring.

return
}
}
if strings.Contains(line, "FATAL") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably consider adding a flag that would fail the test by default unless specified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate as to why this would be desirable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if we have a fatal in the logs and the test passes silently, we might have a bug we're overlooking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could a node FATAL and have the test pass silently? A FATAL always results in node exit, and as per your suggestion, there is now a check at the end of every test that checks that all validators are healthy (i.e. have not FATAL'd).

Note also that this PR is only intended to ensure a useful error when a node fails to start, it does not keep watching the logs during a test run. Doing so would require watching all logs (not just main.log) and to be able to start watching the logs of nodes that were started before a test run (i.e. when reusing a network).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forgot about that addition, which I also code reviewed myself ;-)

Copy link
Contributor

@yacovm yacovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, see if you agree with my other two comments

@marun marun added this pull request to the merge queue Nov 14, 2024
Merged via the queue into master with commit aaedc23 Nov 14, 2024
23 checks passed
@marun marun deleted the tmpnet-report-fatals branch November 14, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing This primarily focuses on testing
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants