-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TCP Proxy: Fix corrupted hostname from partial connection read. #10454
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for kubernetes-ingress-nginx canceled.
|
Hi @chotiwat. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
@chotiwat: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @cpanato @tao12345666333, is there anything I need to do on my end to get this PR reviewed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm not familiar with this part of the code
maybe @rikatz can help
Hi @rikatz, would you be able to take a look at this PR when you get a chance? |
Hi @rikatz, ping on this issue 🙏 |
/retest |
|
Hi @longwuyuan, I've answered your questions below:
None. I couldn't find any issue describing the same problem.
Yes, as described in the PR body.
I wonder that myself. I can hazard a few guesses:
This doesn't mean the problem doesn't occur for other users of SSL passthrough though.
The connection handler for the proxy falls back to the default NGINX backend ( When this happens, the default backend will terminate the TLS and send a new request to the upstream per its generated NGINX configuration block, hence the HTTP-sent-to-HTTPS error. Setting backend-protocol to ingress-nginx/pkg/tcpproxy/tcp.go Lines 71 to 76 in 86f3af8
ingress-nginx/pkg/tcpproxy/tcp.go Lines 43 to 56 in 86f3af8
No, there isn't currently. I could add some though (see below).
I could modify the existing test suite if that sounds good to you. I didn't see it initially. It took some digging around the docs. |
|
|
Ok, I'll go ahead and try to add an e2e test for this.
I believe I've explained as clearly as I could in this PR. If you insist, I can create an issue, which would be a copy of this PR body and my previous answers to your questions, but I personally don't see any point in doing so. The mistake in the code should also be pretty clear from the developer's point of view as well. Hopefully, I'd be able to come up with an e2e test that would fail against the current
Yes, let's do that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the e2e tests that would have failed without this fix, and cherry-picked just the tests to #10988.
Unfortunately, it seems that this issue is harder to reproduce on a kind
cluster, perhaps because the traffic doesn't go through many hops and there are no other workloads contending for I/O, so I had to introduce virtual throttling to the tests.
The throttling is done by wrapping the connection returned from the client's DialContext
with a net.Conn
that writes to the connection at most chunkSize
bytes.
These tests would sometimes pass without the fix from this PR so I added retries with as well. Alternatively, we could make the ginkgo.MustPassRepeatedly
decoratorchunkSize
less than len(host)
but I don't know if it would make the tests significantly slower.
@@ -527,7 +527,7 @@ func newDeployment(name, namespace, image string, port int32, replicas int32, co | |||
{ | |||
Name: name, | |||
Image: image, | |||
Env: []corev1.EnvVar{}, | |||
Env: env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a bug in the e2e framework as well. This was preventing httpbun from getting the environment variables so it always ran as an HTTP server even though HTTPBUN_SSL_CERT
and HTTPBUN_SSL_KEY
are specified.
@@ -75,114 +75,211 @@ var _ = framework.IngressNginxDescribe("[Flag] enable-ssl-passthrough", func() { | |||
Status(http.StatusNotFound) | |||
}) | |||
|
|||
ginkgo.It("should pass unknown traffic to default backend and handle known traffic", func() { | |||
ginkgo.Context("when handling traffic", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I grouped these tests with ginkgo.Context()
so that we can reuse some of the common setup code.
"nginx.ingress.kubernetes.io/ssl-passthrough": "true", | ||
} | ||
|
||
ingressDef := framework.NewSingleIngressWithTLS(host, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When SSL passthrough is working correctly, we shouldn't need TLS settings on the NGINX side. The upstream server should be the one who terminates the TLS.
|
||
f.NewDeploymentWithOpts("echopass", | ||
framework.HTTPBunImage, | ||
80, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tripped me up a bit and caused me to spent some time trying to add an HTTPS port to the deployment. It turns out the httpbun image binds to port 80 by default no matter what the protocol it's listening for.
ENV HTTPBUN_BIND=0.0.0.0:80 |
So this sets up an HTTPS server listening on port 80 which technically works for the purpose of the tests.
test/e2e/settings/ssl_passthrough.go
Outdated
ForceResolve(f.GetNginxIP(), 443). | ||
Expect(). | ||
Status(http.StatusOK) | ||
ginkgo.It("should handle known traffic with Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure without the fix:
Get "[https://testpassthrough.com:443/](https://testpassthrough.com/)": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not testpassthrough.com
Reason:
The proxy directs the traffic to NGINX. The configuration for the host specified in the HOST
header doesn't specify any TLS certificate, so NGINX tries to terminate TLS with the default certificate.
test/e2e/settings/ssl_passthrough.go
Outdated
ForceResolve(f.GetNginxIP(), 443). | ||
Expect(). | ||
Status(http.StatusNotFound) | ||
ginkgo.It("should handle insecure traffic with Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure without the fix:
400 Bad Request
Reason:
The proxy directs the traffic to NGINX. Unlike the should handle known traffic with Host header
case above, NGINX doesn't verify certificate validity (InsecureSkipVerify: true
) and is able to terminate TLS, but it tries to send the plain HTTP request to the HTTPS server because the configuration for the host specified in the HOST
header doesn't specify HTTPS
as the upstream protocol.
test/e2e/settings/ssl_passthrough.go
Outdated
} | ||
tries := 3 | ||
|
||
ginkgo.It("should handle known traffic without Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure without the fix:
Get "[https://testpassthrough.com:443/](https://testpassthrough.com/)": tls: failed to verify certificate: x509: certificate is valid for ingress.local, not testpassthrough.com
Reason:
The proxy directs the traffic to NGINX. There's no HOST
header so NGINX tries to terminate TLS with the default certificate because it doesn't know what server block to look for.
test/e2e/settings/ssl_passthrough.go
Outdated
f.WaitForNginxServer(hostBad, | ||
func(server string) bool { | ||
return strings.Contains(server, "listen 442") | ||
ginkgo.It("should handle insecure traffic without Host header", ginkgo.MustPassRepeatedly(tries), func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failure without the fix:
404 Not Found
Reason:
The proxy directs the traffic to NGINX. Unlike the should handle known traffic without Host header
case above, NGINX doesn't verify certificate validity (InsecureSkipVerify: true
) and is able to terminate TLS, but it doesn't have any hostname for routing and returns 404.
Evidence:
There are no instances of this error in this workflow run so I ran it locally with make kind-e2e-test FOCUS='insecure traffic without Host' SKIP_INGRESS_IMAGE_CREATION=false SKIP_E2E_IMAGE_CREATION=false
[FAILED]
Error Trace: /go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/reporter.go:47
/go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/chain.go:37
/go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/response.go:263
/go/src/k8s.io/ingress-nginx/test/e2e/framework/httpexpect/response.go:79
/go/src/k8s.io/ingress-nginx/test/e2e/settings/ssl_passthrough.go:241
/go/src/k8s.io/ingress-nginx/.modcache/github.com/onsi/ginkgo/v2@v2.15.0/internal/node.go:463
/go/src/k8s.io/ingress-nginx/.modcache/github.com/onsi/ginkgo/v2@v2.15.0/internal/suite.go:889
/usr/local/go/src/runtime/asm_arm64.s:1197
Error:
expected status equal to:
"200 OK"
but got:
"404 Not Found"
Test: [Flag] enable-ssl-passthrough With enable-ssl-passthrough enabled when handling traffic on throttled connections should handle insecure traffic without Host header
} | ||
} | ||
|
||
ginkgo.It("should handle known traffic without Host header", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f.WaitForNginxServer(hostBad, | ||
func(server string) bool { | ||
return strings.Contains(server, "listen 442") | ||
ginkgo.It("should handle insecure traffic without Host header", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ForceResolve(f.GetNginxIP(), 443). | ||
Expect(). | ||
Status(http.StatusOK) | ||
ginkgo.It("should handle known traffic with Host header", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ForceResolve(f.GetNginxIP(), 443). | ||
Expect(). | ||
Status(http.StatusNotFound) | ||
ginkgo.It("should handle insecure traffic with Host header", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chotiwat thanks for the contribution. I am copy/pasting my comments on the other PR here as well
|
Of course. I can do that once the PR has been reviewed and I address all the comments.
Our use case is not unique at all. In fact, I believe it's one of the main reasons this SSL passthrough feature exists. Tools like
Here are a few resources that might be helpful for more productive conversations in the future:
|
608f37f
to
6d8a420
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: chotiwat The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherry-pick release-1.12 |
/cherry-pick release-1.11 |
@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.12 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/cherry-pick release-1.10 |
@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.11 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@Gacko: once the present PR merges, I will cherry-pick it on top of release-1.10 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
4aa461b
to
1069c21
Compare
Co-authored-by: Marco Ebert <marco_ebert@icloud.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/triage accepted
/kind bug
/priority backlog
/ok-to-test |
What this PR does / why we need it:
We have run into intermittent "HTTP request sent to an HTTPS server" errors from our SSL passthrough ingresses. We found that the passthrough proxy would sometimes read incomplete data from the connection and cause corrupted hostname.
When this happens, the connection handler would fall back to the default server at
127.0.0.1:442
and cause the error if thenginx.ingress.kubernetes.io/backend-protocol: HTTPS
annotation is not specified.This PR fixes the issue by making sure that we fully read the Client Hello data from the connection by getting the total length from the TLS header.
Types of changes
Which issue/s this PR fixes
fixes #11491
fixes #11424
How Has This Been Tested?
We built a custom image to debug the issue and added some debug logs. For example, we updated
ingress-nginx/pkg/tcpproxy/tcp.go
Line 74 in 4bac120
to
Example log of corrupted data before the fix:
We can reproduce the issue by running
curl
in a loop. I've seen around 5% of errors from my machine with the number of requests as low as 10 enough to trigger the error. This fix eliminates the errors completely for us.Checklist: