Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing connection reset on reconfiguration with wildcard domain #6008

Open
PSanetra opened this issue Dec 11, 2023 · 10 comments
Open

Missing connection reset on reconfiguration with wildcard domain #6008

PSanetra opened this issue Dec 11, 2023 · 10 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.

Comments

@PSanetra
Copy link

PSanetra commented Dec 11, 2023

What steps did you take and what happened:

  • Configure *.example.com Ingress which routes to default-app
  • Open website with iframe to specific.example.com in chrome 119.0.6045.199 on linux
  • Configure specific.example.com Ingress which routes to my-app
  • Reload the iframe on open page (specific.example.com)

What did you expect to happen:

After reload I expected the iframe to show the content, served from the my-app service.

What did actually happen:

The original default-app was shown also after several iframe reloads. Opening the page in an incognito tab results in loading the actual my-app page.

Anything else you would like to add:

It seems like the browser is keeping the connection (probably HTTP 2) open. I would expect envoy to reset the connection if its routing is affected by a new configuration.

The following image shows the dag, when both ingresses are configured (I edited the image, but should show the relevant information)

contour-dag small

The following screenshots show the request timings of the two requests (first load and then reload) in chrome.

Initial request: Includes overhead for intial connection setup

Request on reload: Overhead for connection setup is missing

You see that the first request included additional overhead for Initial connection and SSL. I would expect the same on the second request as the second request should create a new connection for the new underlying service. The previous connection should have been reset.

Environment:

  • Contour version: 1.27.0
  • Envoy version: 1.27.2
  • Kubernetes version: (use kubectl version): v1.26.3
  • Cloud provider or hardware configuration: Azure

Probably related: envoyproxy/envoy#6767

@PSanetra PSanetra added kind/bug Categorizes issue or PR as related to a bug. lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Dec 11, 2023
@PSanetra
Copy link
Author

Note: The issue seems to appear with HTTP 1.1, too. So setting default-http-versions to support HTTP 1.1 exclusively does not work as a workaround.

@sunjayBhatia
Copy link
Member

sunjayBhatia commented Dec 11, 2023

Contour should be detecting mismatched SNI and Host headers (and send 421 responses), we use some custom Lua to configure Envoy to catch this, see:

func FilterMisdirectedRequests(fqdn string) *http.HttpFilter {
var target string
// fqdn can be "*" to match all hostnames or a wildcard prefix
// e.g. "*.foo"
if strings.HasPrefix(fqdn, "*") {
// When we have a wildcard hostname, we will have already matched
// the filter chain on an SNI that falls under the wildcard so we
// retrieve that and make sure the :authority header matches.
// See: https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/lua_filter#requestedservername
target = "request_handle:streamInfo():requestedServerName()"
} else {
// For specific hostnames we know the SNI we need to match the
// :authority header against so we can simplify the code.
target = `"` + strings.ToLower(fqdn) + `"`
}
code := `
function envoy_on_request(request_handle)
local headers = request_handle:headers()
local host = string.lower(headers:get(":authority"))
local target = %s
s, e = string.find(host, ":", 1, true)
if s ~= nil then
host = string.sub(host, 1, s - 1)
end
if host ~= target then
request_handle:respond(
{[":status"] = "421"},
string.format("misdirected request to %%q", host)
)
end
end
`
return &http.HttpFilter{
Name: "envoy.filters.http.lua",
ConfigType: &http.HttpFilter_TypedConfig{
TypedConfig: protobuf.MustMarshalAny(&lua.Lua{
DefaultSourceCode: &envoy_core_v3.DataSource{
Specifier: &envoy_core_v3.DataSource_InlineString{
InlineString: fmt.Sprintf(code, target),
},
},
}),
},
}
}

Mentioning this since it is the main focus of the linked Envoy issue

@PSanetra
Copy link
Author

@sunjayBhatia Can I support you in debugging this issue? I definitely get responses from the wrong backend after an initial connection. I even still receive responses from the wrong backend while I get responses from the correct backend in an incognito window.

I am not sure what you mean with mismatched SNI and Host headers. I guess there will be never a mismatch, as the wildcard domain is also valid for the specific domain, right? The issue is that the target backend changes when a new specific ingress is configured. So as soon as the specific ingress is configured, the old wildcard backend becomes invalid and the new specific backend becomes the only valid backend.

@sunjayBhatia
Copy link
Member

(I only mentioned the mismatched SNI and Host feature to make sure that was covered for posterity, since the linked Envoy issue talks about that case heavily)

The issue here generally is due to connection reuse/coalescing from the browser, I can reproduce this in chrome on mac with HTTP/2 and HTTP/1.1 (HTTP/2 disabled in contour).

If a browser reuses an existing connection for additional requests, Envoy does not "re-triage" further requests on the connection since by design this connection is already tied to a Listener filter chain. In this case that means that subsequent requests on that same connection will be processed by the wildcard *.example.com filter chain and routed to the default app rather than flow through the new specific.example.com filter chain.

By default we have an idle timeout on downstream connections set to 60s so you can observe the requests being routed to the correct backends if you let the connection get timed out.

As a workaround as well, you could use the max-requests-per-connection Listener setting https://projectcontour.io/docs/1.27/configuration/#listener-configuration (though this is a global setting so maybe not ideal)

@sunjayBhatia
Copy link
Member

The structure of our Envoy configuration in this situation is set up as the following hierarchy:

  • HTTPS Listener on secure port (default 8443)
  • Listener has multiple filter chains for different domains, in this case *.example.com and specific.example.com
  • each of these filter chains will match connections with the appropriate SNI
  • each filter chain has an HTTP filter chain that points to a route config
  • Each route config has a list of domain matches and a list of routes with matches

One thing we could do to help with this case is for each specific FQDN, add a domain match and route (that would match the specific host header) to any wildcard routeconfigs that exist (and match the specific route) as a catch all

@sunjayBhatia
Copy link
Member

cc @projectcontour/maintainers @projectcontour/contour-reviewers for any thoughts here

@sunjayBhatia
Copy link
Member

e.g. here is an example wildcard routeconfig

    {
     "version_info": "4",
     "route_config": {
      "@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
      "name": "https/*.example.com",
      "virtual_hosts": [
       {
        "name": "*.example.com",
        "domains": [
         "*.example.com"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/",
           "headers": [
            {
             "name": ":authority",
             "string_match": {
              "safe_regex": {
               "regex": "^[a-z0-9]([-a-z0-9]*[a-z0-9])?\\.example\\.com(:[0-9]+)?"
              }
             }
            }
           ]
          },
          "route": {
           "cluster": "default/echo-1/80/da39a3ee5e"
          },
          "metadata": {
           "filter_metadata": {
            "envoy.access_loggers.file": {
             "io.projectcontour.name": "echo-wildcard",
             "io.projectcontour.namespace": "default",
             "io.projectcontour.kind": "HTTPProxy"
            }
           }
          }
         }
        ]
       }
      ],
      "request_headers_to_add": [
       {
        "header": {
         "key": "x-request-start",
         "value": "t=%START_TIME(%s.%3f)%"
        }
       }
      ],
      "ignore_port_in_host_matching": true
     },
     "last_updated": "2023-12-11T21:30:04.295Z"
    },

in this case, we could duplicate each route on specific.example.com into this config, each of which is augmented with a host/:authority header match on specific.example.com

@PSanetra
Copy link
Author

PSanetra commented Dec 12, 2023

As a workaround as well, you could use the max-requests-per-connection Listener setting https://projectcontour.io/docs/1.27/configuration/#listener-configuration (though this is a global setting so maybe not ideal)

This works for us as a workaround. Thank you!

Edit: We could further improve performance by disabling HTTP/2 and only supporting HTTP/1.1. I guess chrome is expecting short-living connections more on HTTP/1.1 than on HTTP/2.

@PSanetra
Copy link
Author

@sunjayBhatia as expected the workaround is really adding a lot of overhead, especially on many small requests, especially as chrome allows only 6 concurrent connections per origin. So it would still be important for us to get a fix without the need to set max-requests-per-connection to 1.

@sunjayBhatia sunjayBhatia added help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. and removed lifecycle/needs-triage Indicates that an issue needs to be triaged by a project contributor. labels Dec 12, 2023
@sunjayBhatia
Copy link
Member

yep, only allowing one request per connection is definitely not a full solution

we'll have to do some work on an appropriate solution to this, my suggestion above on adding routes to the wildcard routeconfig might work, but needs more thinking/testing

contributions of course welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants