-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Heartbeat] Refactor TCP Monitor #17549
Conversation
Pinging @elastic/uptime (:uptime) |
Socks5 tests added
Loads of complexity comes from the amount of jobs (permutations) created per monitor configuration. Settings contributing to this are
|
if err != nil { | ||
return nil, err | ||
} | ||
jobs = append(jobs, wrappers.WithURLField(url, endpointJob)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this is a change in semantics introduced here or a while ago already. By creating a job per complete URL, we will have each Job do it's own DNS request. Originally only one job was created to do the initial DNS lookup, with follow up tasks per IP:Port combination. On Windows/MacOS DNS is cached automatically, but we have had incidents on Linux overloading DNS servers, due to missing caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the code here, and I do think we can fix this so we restore the behavior before this patch. However, I agree with you that it's not a clear win for ports
because that code is tricky. I think for mode: all
it still makes sense to keep the behavior, because more people use that (and it provides benefits for icmp/tcp/http not just tcp).
However, for TCP, given that this is a very rarely used feature AFAICT I don't think the code complexity is worth maintaining the single DNS lookup. Additionally, there are workarounds in the rare case where DNS load is too high. One could always install a caching resolver on the same box as heartbeat in the worst case.
heartbeat/monitors/resolver.go
Outdated
mapping map[string][]net.IP | ||
} | ||
|
||
func CreateStaticResolver() staticResolver { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not return private types from an exported function/method. Why do you need CreateStaticResolver to be exported? If it is only for testing, how about having a special package with testing utilities, to make the intent more clear?
When introducing something like this resolver (so we don't have to modify the /etc/hosts file), then the resolver should become 'stackable', so we can fallback to another resolver if the hostname has not been configured.
If we use this for testing, how about accepting a map[string][]net.IP
as parameter to CreateStaticResolver
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be exported, just an oversight :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it back, it does need to be exported because it's used by the tcp monitor package tests. I could move it there, but I think it makes more sense to keep it here since we'll eventually use it in tests for http/icmp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively I could extract all this resolver stuff to its own package, but I'm indifferent on that matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's purpose is to be used for testing only, then we should at least have this reflected in the package name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, StdResolver
is used for production, StaticResolver
is used in another package for test purposes. I could break it out to its own package, but that seems a bit excessive, since that package would have to export it.
We could put StaticResolver
in a test file in the tcp
package, but in the future we'll probably have tests that want to reuse it from the http
and icmp
monitors.
What do you think the best way forward is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we think of other testing helpers? E.g. some helper that would drive a 'job' with it's followup tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been moved to the tcp test package for now.
@urso thanks for the review. I'd rather not make too many changes in this PR re mode/ports/hosts . The original purpose here was fixing the TLS hostname bug, and we've already gone far past that point, and cleaned up the code substantially. I agree that it's worth considering deprecating |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only two nit-picks. Ship it!
@@ -45,7 +45,7 @@ import ( | |||
// } | |||
// } | |||
func TCPDialer(to time.Duration) NetDialer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I see that the file net.go
contains only dialers. Do you think we can rename it to dialers.go
?
Also, there are some functions in util.go
that refer to transport.Dialer
, I wonder if we can divide the stuff in these two files into dialers.go
and layers.go
. We could get rid of utils ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good call.
"strconv" | ||
"testing" | ||
"time" | ||
|
||
"github.com/elastic/go-lookslike/validator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please sort imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, I reset my IDE and lost my auto-sorter.
Refactors the TCP monitor to make the code easier to follow, more testable, and fixes elastic#17123 where TLS server name was not correctly sent. This is important because the code had accrued a lot of cruft and become very hard to follow. There were many wrappers and intermediate variable names that often subtly changed names as they crossed various functions. When debugging elastic#17123 I frequently found myself lost tracing the execution. This new code should be simpler to understand for a few reasons: Less code (almost a 2x reduction) and fewer, simpler, better organized functions Less variable passing/renaming due to use of struct for key config variables More consistent and descriptive variable names Creation of the dialer as late as possible, to remove the confusing partial states, and clarity as to when which dialer layers are used. Adds (frustratingly tricky) integration tests for elastic#17123 using mismatched TLS certs, and also against a real SOCKS5 proxy Adds, for testing only, the ability to override the real network resolver for TCP checks which is invaluable in debugging TLS checks that depend on setting hostnames correctly. In the future if we decide to let users use a custom DNS resolver this will be nice. Reorganized giant TCP test file into multiple files (cherry picked from commit dfe8c4b)
Refactors the TCP monitor to make the code easier to follow, more testable, and fixes #17123 where TLS server name was not correctly sent. This is important because the code had accrued a lot of cruft and become very hard to follow. There were many wrappers and intermediate variable names that often subtly changed names as they crossed various functions. When debugging #17123 I frequently found myself lost tracing the execution. This new code should be simpler to understand for a few reasons: Less code (almost a 2x reduction) and fewer, simpler, better organized functions Less variable passing/renaming due to use of struct for key config variables More consistent and descriptive variable names Creation of the dialer as late as possible, to remove the confusing partial states, and clarity as to when which dialer layers are used. Adds (frustratingly tricky) integration tests for #17123 using mismatched TLS certs, and also against a real SOCKS5 proxy Adds, for testing only, the ability to override the real network resolver for TCP checks which is invaluable in debugging TLS checks that depend on setting hostnames correctly. In the future if we decide to let users use a custom DNS resolver this will be nice. Reorganized giant TCP test file into multiple files (cherry picked from commit dfe8c4b)
What does this PR do?
Refactors the TCP monitor to make the code easier to follow, more testable, and fixes #17123 where TLS server name was not correctly sent. This is important because the code had accrued a lot of cruft and become very hard to follow. There were many wrappers and intermediate variable names that often subtly changed names as they crossed various functions. When debugging #17123 I frequently found myself lost tracing the execution.
This new code should be simpler to understand for a few reasons:
Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
Run the sample config in #17123