Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Per-Store TLS configuration in Thanos Query #977

Open
daviddyball opened this issue Mar 26, 2019 · 74 comments
Open

[Feature] Per-Store TLS configuration in Thanos Query #977

daviddyball opened this issue Mar 26, 2019 · 74 comments

Comments

@daviddyball
Copy link

Request

Feature request to add support for per-store tls configuration.

Use Case

I have store running alongside query on my main "monitoring" cluster and we don't use TLS for comms between pods on the same cluster. I also want to pull metrics from my remote Prometheus instances which exist on different clusters, but these endpoints are TLS encrypted because they are cross-cluster

Suggestion

Change the --store argument to use a more definitive style for remote endpoints e.g.:

  • grpc://<ip-address>:<port>
  • grpc+tls://<ip-address>:<port>

This would allow specifying per-store TLS settings and allow a combination of secure and insecure comms for store endpoints.

Related PR from @adrien-f : #899

@daviddyball
Copy link
Author

Definitely going to try and throw something together for this.

Ideas so far are:

a) use JDBC-like URLs to configure individual stores (e.g. grpc+tls://<addr>:<port>/?Option1=y&Option2=n)

or

b) use file-based configuration like the object store.

@bwplotka If I go for option "b" should I re-use the object-store file config stuff, or are you happy for me to implement a query-store specific config file?

@bwplotka
Copy link
Member

So I think we talked about it here: #899

I would say for this specifc use case, let's start with option B and essentially limit, by saying --store are explicitly out of this feature.

I re-use the object-store file config stuff

What? We should never mix object store stuff with discovery stuff. In fact you have ready FileSD logic which is compatible with Prometheus FileSD. It allows arbitrary labels, so we can implement easily this:

One solution would be to do that work in the FileSD config and have labels such as tls_server_name to override TLS config.

I would say this is the way to go. What do you think? (:

Thanks for proposing this and sorry for massive delay!

@daviddyball
Copy link
Author

@bwplotka Apologies, I meant I can look at writing some similar code to what is in place for FileSD to implement FileStoreConfig or something :-P Definitely don't want to mix the two 😆

@bwplotka
Copy link
Member

Yea, I would say no specifc config just literally special labels we document properly in our docs

@povilasv
Copy link
Member

IMO people just coud just run envoy/nginx/etc proxy and handle TLS outside of Thanos. As then we don't have to work on cert reloading all the different options, etc.

@bwplotka wdyt?

@mleklund
Copy link

mleklund commented Apr 16, 2019

I attempted to use nginx, but it appears that the grpc call is not sending a host header. Can anyone confirm this is the case? (Or at least anyone with better knowledge of the code and go than myself)

@povilasv
Copy link
Member

povilasv commented Apr 16, 2019

Why does it need that?

I think you should be able to proxy all the traffic to GPRC backend.

Something like:

http {
    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent"';
 
    server {
        listen 80 http2;
 
        access_log logs/access.log main;
 
        location / {
            # Replace localhost:50051 with the address and port of your gRPC server
            # The 'grpc://' prefix is optional; unencrypted gRPC is the default
            grpc_pass grpc://localhost:50051;
        }
    }
}

@mleklund
Copy link

In my case I would like to park the grpc on a virtual host, instead of running a special nginx instance just for grpc.

@povilasv
Copy link
Member

I don't think we pass host headers nor we have an option to configure them. So right now I don't think it will work

@daviddyball
Copy link
Author

@mleklund Try getambassador.io for a nice GRPC proxy. It's what we ended up using to get around the problem of not being able to mix TLS and non-TLS stores.

@povilasv I'm not sure I understand your objection to adding this feature. Do you think it serves no purpose?

@povilasv
Copy link
Member

povilasv commented Apr 17, 2019

I'ts not an objection, I just wanted to point out that it won't work right now.

Whether we want to add host header or no is a seperate discussion and it will take a bit of time.

BTW a lot of people right now run isitio / linkerd, the sidecar per service approach, so I think it's not a wide problem.

@mleklund
Copy link

mleklund commented Apr 17, 2019

Up to this point nginx ingress has worked for us and I would rather not complicate things any more then I need to. I have been looking at envoy and ambassador, and it may be a good fit down the road. For the moment I am using port based tcp for nginx ingress since most of our use cases are behind vpn or firewall.

I was curious if I was missing something or if there were plans to pass header info since grpc is built on top of http(2).

@benjaminhuo
Copy link
Contributor

benjaminhuo commented May 8, 2019

@daviddyball , I'm not sure if I understand your use case correctly, but does this pr #508 solve your problem?

There are flags like --grpc-server-tls-cert/--grpc-client-tls-cert added by this pr in query doc and other components' docs

@daviddyball
Copy link
Author

@benjaminhuo I don't think it does solve the problem here. The issue is that I'd like query to talk to different stores in different ways (e.g. TLS vs non-TLS) and there is no way to configure per-store endpoint TLS configuration

@povilasv
Copy link
Member

povilasv commented May 8, 2019

For now we won't be implementining Per Store TLS configuration. Our general suggestion is run sidecar proxy envoy / nginx / etc.

Hopefully we will update docs soon. We will close the ticket once the docs are there.

@caarlos0
Copy link

I would also like something like this.

My idea, which may be wrong, is:

  • a "central" cluster which gathers metrics from all other clusters
  • query instances on the central cluster talk to query instances in other clusters over TLS
  • query instances on the central cluster also talks to local cluster's store and sidecars
    • right now this needs to be over TLS as well, and needs to have one endpoint for each sidecar, an extra hop et al

ideally, I would like to define something like:

    - query.cluster1.foobar.com:443
    - query.cluster2.foobar.com:443
    - query.cluster3.foobar.com:443
    - query.cluster4.foobar.com:443
    - dnssrv+_grpc._tcp.thanos-sidecar-grpc.thanos.svc.cluster.local
    - dnssrv+_grpc._tcp.thanos-query-grpc.thanos.svc.cluster.local

so, it seems that without setting --grpc-client-tls-secure the secure endpoints won't work, and if I enable it the dnssrv records won't work :(

as far as I understand there is no way of doing this right now.

@benjaminhuo
Copy link
Contributor

benjaminhuo commented Jul 16, 2019

Try getambassador.io for a nice GRPC proxy. It's what we ended up using to get around the problem of not being able to mix TLS and non-TLS stores.

@daviddyball I'm considering using ambassador as an edge proxy in each cluster too.
Ambassador has tls termination ability, and we can use it to accept tls connection from central monitoring cluster.

But for the central monitoring cluster, is it possible for ambassador to originate tls connection to the remote cluster(egress grpc tls traffic)?

Would you share your config? thanks.

@bwplotka @caarlos0 , I found a blog post https://improbable.io/blog/thanos-architecture-at-improbable saying

In terms of requests to a remote cluster, Envoy has been used securely to proxy our request between many clusters; meaning that a request will go via an Envoy sidecar, an edge Envoy egress proxy, and over the public internet to an edge Envoy ingress proxy (all over a secure connection). This latter will then forward the request onto the Thanos Sidecar to retrieve the data for the time period and labels specified. All of this is done using server streaming gRPC.

Just wondering where I can find envoy config examples like this for ingress and especially egress tls traffic?

Thanks very much
Ben

@daviddyball
Copy link
Author

@benjaminhuo I had thanos-query talking directly to ambassador directly, so didn't use Ambassador for outbound. You'd probably want to use Envoy directly for that part, no?

@benjaminhuo
Copy link
Contributor

benjaminhuo commented Jul 16, 2019

@benjaminhuo I had thanos-query talking directly to ambassador directly, so didn't use Ambassador for outbound. You'd probably want to use Envoy directly for that part, no?

Yeah, I'm considering using envoy to manage outbound tls.

Then you've the same tls client config for all the store specified in thanos query including the one in the same cluster as thanos query?

@daviddyball
Copy link
Author

We're not validating TLS at the moment @benjaminhuo , so there's no TLS config on query, just a list of store endpoints 👍

@benjaminhuo
Copy link
Contributor

We're not validating TLS at the moment @benjaminhuo , so there's no TLS config on query, just a list of store endpoints 👍

I see :)

@bwplotka
Copy link
Member

It looks like there are mix of things in this feature requests.

So is it about host header or per store TLS config?

@daviddyball
Copy link
Author

Per-store TLS config. The ability to specify TLS for some stores and disable it for others. We've since worked around the issue, so if it doesn't fit with the intended design of Thanos, then it can be closed as Won't Fix 👍

@max77p
Copy link

max77p commented May 12, 2021

Has anyone been assigned to this? Would love this feature. At the moment we can't connect http store if grpc tls config is passed in

@jppitout
Copy link

Just wondering if there has been any more progress?

@max77p
Copy link

max77p commented Jun 21, 2021

@jppitout i am waiting for the same. @bwplotka any update on this?

@Namanl2001
Copy link
Contributor

I have started working and currently designing the proposal for this issue.

At the moment we can't connect http store if grpc tls config is passed in

@max77p which store endpoints are you referring to here? endpoints on local instance/remote instances/both?

@max77p
Copy link

max77p commented Jun 22, 2021

@Namanl2001 by local instance do you mean side car that is in same namespace? and by remote do you mean side car in different cluster altogether?

To clarify from my side, I can add multiple sidecars to thanos query IF I use service name in my kubernetes cluster. But the moment I add a sidecar which is behind https and the respective TLS certs, I am unable to add any other side car.

@stale
Copy link

stale bot commented Sep 3, 2021

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 3, 2021
@Namanl2001
Copy link
Contributor

Working PR #4389

@stale stale bot removed the stale label Sep 3, 2021
@stale
Copy link

stale bot commented Jan 9, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jan 9, 2022
@stale
Copy link

stale bot commented Mar 2, 2022

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Mar 2, 2022
@matej-g
Copy link
Collaborator

matej-g commented Mar 3, 2022

Looks like this is still unresolved

@matej-g matej-g reopened this Mar 3, 2022
@stale stale bot removed the stale label Mar 3, 2022
@stale
Copy link

stale bot commented May 2, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label May 2, 2022
@GiedriusS GiedriusS added dont-go-stale Label for important issues which tells the stalebot not to close them and removed stale labels May 2, 2022
@tal-ayalon
Copy link

What is the status?

@MysterioO
Copy link

Hi all, is somebody working on this? We also have indications that this issue is not resolved.

@rgarcia89
Copy link
Contributor

This is a very necessary feature

@brandonjwk
Copy link

New to setting up Thanos and I've just spent a good amount of time trying to debug my ingress and TLS setup only to discover this issue and realising that you can only either have TLS-enabled stores with the --grpc-client-tls-secure flag or only non-TLS stores. Definitely a gotcha, probably deserving of a troubleshooting entry in the wiki.

@doctorpangloss
Copy link

I also got burned by this.

@doctorpangloss
Copy link

This is the extremely painful workaround: we're going to set up an envoyproxy proxy which forwards plaintext http/2 grpc traffic to your TLS grpc thanos-sidecar backend, no matter where it is.

In your cluster, you can create the envoy proxy this way:

# the purpose of this manifest is to convert
# https://some-address.example.com with a given Basic auth
# into a service http://some-thanos-sidecar.monitoring.svc.cluster.local with no SSL
# this works around https://github.com/thanos-io/thanos/issues/977
apiVersion: v1
kind: ConfigMap
metadata:
  name: some-thanos-sidecar-envoy
  namespace: monitoring
data:
  envoy.yaml: |
    static_resources:
      listeners:
        - name: listener_0
          address:
            socket_address: { address: 0.0.0.0, port_value: 8080 }
          filter_chains:
            - filters:
                - name: envoy.filters.network.http_connection_manager
                  typed_config:
                    "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                    stat_prefix: ingress_http
                    codec_type: AUTO
                    route_config:
                      name: local_route
                      virtual_hosts:
                        - name: local_service
                          domains: [ "*" ]
                          routes:
                            - match: { prefix: "/" }
                              route:
                                cluster: service_cluster
                                timeout: 0s
                                host_rewrite_literal: some-address.example.com
                          request_headers_to_add:
                            - header:
                                key: "Authorization"
                                # use the following to encode the value after Basic 
                                # echo -n "username:_some_password_" | base64
                                value: "Basic abcdefgxyz=="
                    http_filters:
                      - name: envoy.filters.http.router
                        typed_config:
                          "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
      clusters:
        - name: service_cluster
          http2_protocol_options: {}
          connect_timeout: 0.25s
          type: LOGICAL_DNS
          # Comment out the following line to test on v6 networks
          dns_lookup_family: V4_ONLY
          lb_policy: ROUND_ROBIN
          load_assignment:
            cluster_name: service_cluster
            endpoints:
              - lb_endpoints:
                  - endpoint:
                      address:
                        socket_address:
                          address: some-address.example.com
                          port_value: 443
          transport_socket:
            name: envoy.transport_sockets.tls
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
              common_tls_context:
                alpn_protocols: h2
              sni: some-address.example.com
    admin:
      address:
        socket_address: { address: 0.0.0.0, port_value: 8001 }
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: some-thanos-sidecar-envoy-proxy
  namespace: monitoring
spec:
  replicas: 1
  selector:
    matchLabels:
      app: some-thanos-sidecar-envoy-proxy
  template:
    metadata:
      labels:
        app: some-thanos-sidecar-envoy-proxy
    spec:
      containers:
        - name: envoy
          image: envoyproxy/envoy:v1.26.2
          ports:
            - containerPort: 8080
          volumeMounts:
            - name: envoy-config
              mountPath: /etc/envoy
      volumes:
        - name: envoy-config
          configMap:
            name: some-thanos-sidecar-envoy
---
apiVersion: v1
kind: Service
metadata:
  name: some-thanos-sidecar-envoy-proxy
  namespace: monitoring
spec:
  selector:
    app: some-thanos-sidecar-envoy-proxy
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: ClusterIP

Observe an Authorization header is added. This makes sense if you are protecting your thanos sidecar in a remote cluster using authentication.

Now, add the following to your query service:

        - --endpoint=some-thanos-sidecar-envoy-proxy.monitoring.svc.cluster.local:80

Here's an example from my Flux HelmRelease manifest using the bitnami thanos chart:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: thanos
  namespace: monitoring
spec:
  interval: 5m
  chart:
    spec:
      version: "12.6.2"
      chart: thanos
      sourceRef:
        kind: HelmRepository
        name: bitnami
      interval: 60m
  install:
    crds: Create
  upgrade:
    crds: CreateReplace
  # https://github.com/prometheus-community/helm-charts/blob/main/charts/kube-prometheus-stack/values.yaml
  values:
    # this is specific to your cluster
    existingObjstoreSecret: thanos-objstore-config
    query:
      enabled: true
      extraFlags:
        # works around https://github.com/thanos-io/thanos/issues/977
        - --endpoint=some-thanos-sidecar-envoy-proxy.monitoring.svc.cluster.local:80
        - --query.timeout=5m
        - --query.lookback-delta=15m
      dnsDiscovery:
        enabled: true
        sidecarsService: kube-prometheus-stack-thanos-discovery
        sidecarsNamespace: monitoring
    queryFrontend:
      enabled: true
      extraFlags:
        - --query-frontend.compress-responses
        - --query-range.split-interval=12h
        - --labels.split-interval=12h
        - --query-range.max-retries-per-request=10
        - --labels.max-retries-per-request=10
        - --query-frontend.log-queries-longer-than=10s
    storegateway:
      enabled: true
    compactor:
      enabled: true

To secure your thanos sidecar with basic auth, use an appropriate ingress configuration.

@sourcehawk
Copy link

sourcehawk commented Apr 15, 2024

I would also like something like this.

My idea, which may be wrong, is:

  • a "central" cluster which gathers metrics from all other clusters

  • query instances on the central cluster talk to query instances in other clusters over TLS

  • query instances on the central cluster also talks to local cluster's store and sidecars

    • right now this needs to be over TLS as well, and needs to have one endpoint for each sidecar, an extra hop et al

ideally, I would like to define something like:

    - query.cluster1.foobar.com:443
    - query.cluster2.foobar.com:443
    - query.cluster3.foobar.com:443
    - query.cluster4.foobar.com:443
    - dnssrv+_grpc._tcp.thanos-sidecar-grpc.thanos.svc.cluster.local
    - dnssrv+_grpc._tcp.thanos-query-grpc.thanos.svc.cluster.local

so, it seems that without setting --grpc-client-tls-secure the secure endpoints won't work, and if I enable it the dnssrv records won't work :(

as far as I understand there is no way of doing this right now.

5 years and still the exact same issue 😢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet