Add multi tenant query federation #3250

simonswine · 2020-09-29T17:48:34Z

What this PR does:

This intends to allow cross tenant querying. The implemenation follows this proposal

Which issue(s) this PR fixes:

Fixes #923

Checklist/TODO

Merge Add tenant resolver #3486 first
Provide a design document with the details of the the form of the implementation (esp. what this would mean for configured tenant limits and metrics exposed about queries)
Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

docs/configuration/config-file-reference.md

integration/querier_tenant_federation_test.go

simonswine · 2020-12-01T09:36:01Z

pkg/api/middlewares.go

+			var cacheGenNumber string
+			// join the strings for multiple tenants
+			// TODO: Figure out how sensible that is
+			for _, tenantID := range tenantIDs {
+				cacheGenNumber += cacheGenNumbersLoader.GetResultsCacheGenNumber(tenantID)
+			}


I am not too sure if I can cacheGenNumber like that

I don't think it's likely it will lead to collisions. However, it will increase the number of characters that prefix the Memcached item key string. This means that for multi-tenant queries across enough tenants caching could break due to the item key being too long. The existing limit is not very high at 150 characters. Multi-tenant queries will already have larger cache item keys since it will contain the concatenation of multiple tenant IDs. The cachegen number will exacerbate this issue further.

There are two options for dealing with this:

Ignore the cache generation for multi-tenant queries in this PR. It's not a commonly used feature and IMO it would be fine to ignore it for this PR. It

Consolidate the cachegen number into a small, consistent, unique string. One option off the top of my head is to convert the strings to integers and adding them together then use a string of the resulting value. The

My preference would be option 1 since it would reduce the scope of the PR a bit more and make it a bit easier to review. Cache generation handling can be added in a future PR.

I have deferred that to a future PR as suggested and added a TODO comment

pkg/cortex/cortex.go

jtlisi

This looks great, I added a few questions and nits. I want to go through it one more time before I add an approval. However, I didn't any major issues in my first pass.

docs/configuration/config-file-reference.md

pkg/querier/merge_queryable.go

pkg/cortex/cortex.go

pkg/querier/merge_queryable.go

jtlisi

LGTM

Most of the nits/suggestions are very minor except for the cachegen topic. Once that is sorted I'll approve and see if we can get a second reviewer on this.

jtlisi · 2020-12-04T16:54:38Z

pkg/api/middlewares.go

+			var cacheGenNumber string
+			// join the strings for multiple tenants
+			// TODO: Figure out how sensible that is
+			for _, tenantID := range tenantIDs {
+				cacheGenNumber += cacheGenNumbersLoader.GetResultsCacheGenNumber(tenantID)
+			}


I don't think it's likely it will lead to collisions. However, it will increase the number of characters that prefix the Memcached item key string. This means that for multi-tenant queries across enough tenants caching could break due to the item key being too long. The existing limit is not very high at 150 characters. Multi-tenant queries will already have larger cache item keys since it will contain the concatenation of multiple tenant IDs. The cachegen number will exacerbate this issue further.

There are two options for dealing with this:

Ignore the cache generation for multi-tenant queries in this PR. It's not a commonly used feature and IMO it would be fine to ignore it for this PR. It

Consolidate the cachegen number into a small, consistent, unique string. One option off the top of my head is to convert the strings to integers and adding them together then use a string of the resulting value. The

My preference would be option 1 since it would reduce the scope of the PR a bit more and make it a bit easier to review. Cache generation handling can be added in a future PR.

pkg/cortex/cortex.go

jtlisi · 2020-12-04T16:58:54Z

integration/querier_tenant_federation_test.go

+		"-querier.split-queries-by-interval": "24h",
+		"-querier.query-ingesters-within":    "12h", // Required by the test on query /series out of ingesters time range
+		"-frontend.memcached.addresses":      "dns+" + memcached.NetworkEndpoint(e2ecache.MemcachedPort),
+		"-tenant-federation.enabled":         "true",


thought: We should be able to enable this feature for every e2e test and not have any failures. Doing so could provide a good smoke test for the feature.

@jtlisi Not entirely sure if you meant to do this locally once or do you want to have that as part of the integration test run by CI?

This is still a TODO, will look into this tomorrow

No need to commit any changes here. Just smoke testing that the tests pass if you enable it for globally on your local should be enough.

I have done that today and it didn't not come up with any unexpected failure (apart from the Backward_compatibility tests, which is expected as the feature flag is not recognised by them)

pkg/querier/tenantfederation/tenant_federation.go

csmarchbanks

Nice work!

Do we want to have an entirely new package for tenantfederation? It seems like all that is there is a new type of queryable which could also just be added to the querier package. Similarly, should the configuration be a new top level, or nested under querier? One thing that would help me decide is if there is more cross cutting flags that will be needed? Right now the flag is only used in querier related code.

pkg/frontend/v1/frontend.go

pstibrany

Thanks for addressing my feedback. LGTM with few nit comments left.

pkg/querier/tenantfederation/merge_queryable.go

pstibrany · 2020-12-17T11:27:37Z

pkg/querier/tenantfederation/tenant_federation.go

+}
+
+func (cfg *Config) RegisterFlags(f *flag.FlagSet) {
+	f.BoolVar(&cfg.Enabled, "tenant-federation.enabled", false, "If enabled on all components, queries can be federated across multiple tenants. The tenant IDs involved need to be specified separated by a `|` character in the `X-Scope-OrgID` header (experimental).")


If enabled on all components

All components is somewhat wide... this is only required in query-frontend and querier, right? No other components can use this at the moment. I think we should be bit more specific, to allow people configure it correctly and also not give them false impression that other components use it somehow.

I have aligned that to the CHANGELOG, see #3250 (comment)

pstibrany · 2020-12-17T11:31:20Z

pkg/util/validation/limits.go

+// given tenant IDs. A value of 0 means unlimted queriers, so the method will
+// return 0 only if all inputs have a limit of 0. Otherwise it is the minimum
+// of > 0 values.
+func MaxQueriersPerTenants(tenantIDs []string, maxQueriersPerTenant func(string) int) int {


I would suggest different name: MinimumOfMaxQueriersPerTenants or SmallestMaxQueriersPerTenant. Now it looks like returning "max" value.

pstibrany · 2020-12-17T11:33:00Z

CHANGELOG.md

@@ -5,6 +5,7 @@
 * [CHANGE] Querier: it's not required to set `-frontend.query-stats-enabled=true` in the querier anymore to enable query statistics logging in the query-frontend. The flag is now required to be configured only in the query-frontend and it will be propagated to the queriers. #3595
 * [CHANGE] Blocks storage: compactor is now required when running a Cortex cluster with the blocks storage, because it also keeps the bucket index updated. #3583
 * [CHANGE] Blocks storage: block deletion marks are now stored in a per-tenant global markers/ location too, other than within the block location. The compactor, at startup, will copy deletion marks from the block location to the global location. This migration is required only once, so you can safely disable it via `-compactor.block-deletion-marks-migration-enabled=false` once new compactor has successfully started once in your cluster. #3583
+* [FEATURE] Querier: Queries can be federated across multiple tenants. The tenants IDs involved need to be specified separated by a `|` character in the `X-Scope-OrgID` request header. This is an experimental feature, which can be enabled by setting `-tenant-federation.enabled=true` on all Cortex services. #3250


Why all? Let's be specific to avoid confusing users.

I think it needs to be enabled on all Cortex services, because otherwise e.g. the write path would interpret multiple tenants as a single new tenant. Once it is enabled the write path fails as it only supports a single tenant. I have aligned the CHANGELOG with the flags description.

This experimental feature allows queries to cover data from more than a single Cortex tenant and can be activated by supplying `-tenant-federation.enabled` to all cortex instances. To query multiple tenants a `|` separated tenant list can be specified in the `X-Scope-OrgID` header. The source tenant of a metric will be exposed in the label `__tenant_id__`. Signed-off-by: Christian Simon <simon@swine.de>

This ensures the limit is aggregated correctly of the setting of each individual tenant. It also implements the logic for the v2 query frontend. Signed-off-by: Christian Simon <simon@swine.de>

Signed-off-by: Christian Simon <simon@swine.de>

Regexp matcher need to be initialized, this adds a test for regexp matcher and fixes the underlying issue. Signed-off-by: Christian Simon <simon@swine.de>

Signed-off-by: Christian Simon <simon@swine.de>

…ication Signed-off-by: Christian Simon <simon@swine.de>

Signed-off-by: Christian Simon <simon@swine.de>

To avoid any future regressions in the multi tenant support within the query path, a faillint command tests if TenantID is used and if it finds one, it suggestest using TenantIDs instead> Signed-off-by: Christian Simon <simon@swine.de>

Signed-off-by: Christian Simon <simon@swine.de>

simonswine · 2020-12-23T12:32:22Z

* `pkg/frontend/transport/handler.go` is still usig `tenant.TenantID`. How should it behave there?

* What do you think about adding a `faillint` rule to fail on `tenant.TenantID` used in `pkg/frontend/...`, `pkg/scheduler/...` and `pkg/querier/tenantfederation/...` to make sure we handle it everywhere (in the future)?

* What's about `tenant.TenantID`  usage in `pkg/querier/queryrange`? Have you checked where we should handle it? For example, I think `limitsMiddleware` should be tenants federation aware at least. This is another package where having a `faillint` may help to prevent future regressions.

@pracucci I think I have addressed all of the remaining points (I think 🙂 ).

Esp those ones had the biggest changes, so certainly are in need for another review:

The result cache gets skipped once there are more than one tenant b41a4df
limitsMiddleware is now using the "strictest" of the tenant's limits 0667242
Fail lint is checking the packages in the query path: e312687

pracucci · 2020-12-23T14:43:14Z

Thanks a lot @simonswine for your hard work and patiently address all comments! I agree on skipping the query results caching in this PR and address it in a follow up PR (given we have to design how to deal with cache invalidation done for deleted series).

This PR looks in a good shape, considering it's an experimental feature. We can merge this PR, so that we begin testing it, and in the meanwhile you can work on follow up work (eg. query results caching).

Signed-off-by: Christian Simon <simon@swine.de>

bhiravabhatla · 2021-01-20T10:35:26Z

Is this released as part of 1.6. Could not find this in release notes. Its already present in configuration documentation though.

pstibrany · 2021-01-20T10:36:52Z

Is this released as part of 1.6. Could not find this in release notes. Its already present in configuration documentation though.

It will be part of Cortex 1.7.

Documentation on the website currently reflects master branch, with latest (unreleased) changes.

cabrinha · 2021-02-11T02:24:20Z

In order to add Cortex as a datasource for grafana, when using auth and tenants... Whats the config for the datasource need to look like in order to see every tenant in one datasource?

simonswine · 2021-02-22T12:15:52Z

In order to add Cortex as a datasource for grafana, when using auth and tenants... Whats the config for the datasource need to look like in order to see every tenant in one datasource?

@cabrinha unfortunately there is no way to specify this currently. You need to list the tenants explicitly (separated by a |). If you care about this enough it would be good to create a new issue with a feature request and detailing your use case a bit more.

* Add tenant query federation This experimental feature allows queries to cover data from more than a single Cortex tenant and can be activated by supplying `-tenant-federation.enabled` to all cortex instances. To query multiple tenants a `|` separated tenant list can be specified in the `X-Scope-OrgID` header. The source tenant of a metric will be exposed in the label `__tenant_id__`. Signed-off-by: Christian Simon <simon@swine.de> * Aggregate the limit of maxQueriers correctly This ensures the limit is aggregated correctly of the setting of each individual tenant. It also implements the logic for the v2 query frontend. Signed-off-by: Christian Simon <simon@swine.de> * Fix tenant labels and make LabelNames more efficient Signed-off-by: Christian Simon <simon@swine.de> * Use tsdb_errors for error handling Signed-off-by: Christian Simon <simon@swine.de> * Fix uninitialized label matcher Regexp matcher need to be initialized, this adds a test for regexp matcher and fixes the underlying issue. Signed-off-by: Christian Simon <simon@swine.de> * Use map for filterValuesByMatchers to reduce allocations Signed-off-by: Christian Simon <simon@swine.de> * Address review suggestions Signed-off-by: Christian Simon <simon@swine.de> * Add validation.SmallestPositiveNonZeroIntPerTenant to avoid code duplication Signed-off-by: Christian Simon <simon@swine.de> * Add limitations and experimental status to docs Signed-off-by: Christian Simon <simon@swine.de> * Remove unnecessary cast of defaultTenantLabel Signed-off-by: Christian Simon <simon@swine.de> * Handle query-range limits for multi tenant queries Signed-off-by: Christian Simon <simon@swine.de> * Skip results cache, if multi tenants query Signed-off-by: Christian Simon <simon@swine.de> * Add failint to ensure query path supports multiple tenants To avoid any future regressions in the multi tenant support within the query path, a faillint command tests if TenantID is used and if it finds one, it suggestest using TenantIDs instead> Signed-off-by: Christian Simon <simon@swine.de> * Align CHANGELOG line with the flag description Signed-off-by: Christian Simon <simon@swine.de> * Add a limitation about missing results cache Signed-off-by: Christian Simon <simon@swine.de>

* Add tenant resolver (cortexproject/cortex#3486) * Add tenant resolver package This implements the multi tenant resolver as described by the [proposal] for multi tenant query-federation. By default it behaves like before, but it's implementation can be swapped out. [proposal]: cortexproject/cortex#3364 Signed-off-by: Christian Simon <simon@swine.de> * Replace usages of `ExtractOrgID` Use TenantID or UserID depending on which of the methods are meant to be used. Signed-off-by: Christian Simon <simon@swine.de> * Replace usages of `ExtractOrgIDFromHTTPRequest` This is replaced by ExtractTenantIDFromHTTPRequest, which makes sure that exactly one tenant ID is set. Signed-off-by: Christian Simon <simon@swine.de> * Add methods to `tenant` package to use resolver directly Signed-off-by: Christian Simon <simon@swine.de> * Remove UserID method from Resolver interface We need a better definition for what we are trying to achieve with UserID before we can add it to the interface Signed-off-by: Christian Simon <simon@swine.de> * Update comment on the TenantID/TenantIDs Signed-off-by: Christian Simon <simon@swine.de> * Improve performance of NormalizeTenantIDs - reduce allocations by reusing the input slice during de-duplication Signed-off-by: Christian Simon <simon@swine.de> * Add multi tenant query federation (cortexproject/cortex#3250) * Add tenant query federation This experimental feature allows queries to cover data from more than a single Cortex tenant and can be activated by supplying `-tenant-federation.enabled` to all cortex instances. To query multiple tenants a `|` separated tenant list can be specified in the `X-Scope-OrgID` header. The source tenant of a metric will be exposed in the label `__tenant_id__`. Signed-off-by: Christian Simon <simon@swine.de> * Aggregate the limit of maxQueriers correctly This ensures the limit is aggregated correctly of the setting of each individual tenant. It also implements the logic for the v2 query frontend. Signed-off-by: Christian Simon <simon@swine.de> * Fix tenant labels and make LabelNames more efficient Signed-off-by: Christian Simon <simon@swine.de> * Use tsdb_errors for error handling Signed-off-by: Christian Simon <simon@swine.de> * Fix uninitialized label matcher Regexp matcher need to be initialized, this adds a test for regexp matcher and fixes the underlying issue. Signed-off-by: Christian Simon <simon@swine.de> * Use map for filterValuesByMatchers to reduce allocations Signed-off-by: Christian Simon <simon@swine.de> * Address review suggestions Signed-off-by: Christian Simon <simon@swine.de> * Add validation.SmallestPositiveNonZeroIntPerTenant to avoid code duplication Signed-off-by: Christian Simon <simon@swine.de> * Add limitations and experimental status to docs Signed-off-by: Christian Simon <simon@swine.de> * Remove unnecessary cast of defaultTenantLabel Signed-off-by: Christian Simon <simon@swine.de> * Handle query-range limits for multi tenant queries Signed-off-by: Christian Simon <simon@swine.de> * Skip results cache, if multi tenants query Signed-off-by: Christian Simon <simon@swine.de> * Add failint to ensure query path supports multiple tenants To avoid any future regressions in the multi tenant support within the query path, a faillint command tests if TenantID is used and if it finds one, it suggestest using TenantIDs instead> Signed-off-by: Christian Simon <simon@swine.de> * Align CHANGELOG line with the flag description Signed-off-by: Christian Simon <simon@swine.de> * Add a limitation about missing results cache Signed-off-by: Christian Simon <simon@swine.de> * Restrict path segments in TenantIDs (cortexproject/cortex#4375) (cortexproject/cortex#4376) This prevents paths generated from TenantIDs to become vulnerable to path traversal attacks. CVE-2021-36157 Signed-off-by: Christian Simon <simon@swine.de> * Update nolint statement Co-authored-by: Bryan Boreham <bjboreham@gmail.com>

pull-request-size bot added the size/L label Sep 29, 2020

simonswine force-pushed the crosstenant branch from a72bd88 to 761a1d4 Compare October 23, 2020 11:03

pull-request-size bot added size/XXL and removed size/L labels Oct 23, 2020

simonswine force-pushed the crosstenant branch 3 times, most recently from fc84860 to 023f1f1 Compare October 23, 2020 11:29

simonswine mentioned this pull request Oct 23, 2020

Support multiple org IDs weaveworks/common#201

Closed

simonswine force-pushed the crosstenant branch 2 times, most recently from 1d43854 to 3ceeb5e Compare October 26, 2020 16:07

simonswine force-pushed the crosstenant branch from 3ceeb5e to 62363b0 Compare November 13, 2020 17:52

simonswine changed the title ~~Multi tenant queries proof-of-concept~~ WIP: Multi tenant query federation Nov 13, 2020

simonswine force-pushed the crosstenant branch 2 times, most recently from 2a41025 to d815857 Compare November 26, 2020 18:06

pull-request-size bot added size/XL and removed size/XXL labels Nov 26, 2020

simonswine force-pushed the crosstenant branch from d815857 to c3dfd31 Compare November 27, 2020 17:19

simonswine commented Nov 27, 2020

View reviewed changes

docs/configuration/config-file-reference.md Outdated Show resolved Hide resolved

simonswine commented Dec 1, 2020

View reviewed changes

integration/querier_tenant_federation_test.go Show resolved Hide resolved

simonswine commented Dec 1, 2020

View reviewed changes

pkg/cortex/cortex.go Outdated Show resolved Hide resolved

simonswine changed the title ~~WIP: Multi tenant query federation~~ Add multi tenant query federation Dec 1, 2020

simonswine force-pushed the crosstenant branch from c3dfd31 to c3432b1 Compare December 1, 2020 10:10

jtlisi reviewed Dec 2, 2020

View reviewed changes

simonswine force-pushed the crosstenant branch from c3432b1 to a6211ef Compare December 3, 2020 18:15

simonswine marked this pull request as ready for review December 3, 2020 18:20

jtlisi reviewed Dec 4, 2020

View reviewed changes

csmarchbanks reviewed Dec 4, 2020

View reviewed changes

pkg/frontend/v1/frontend.go Outdated Show resolved Hide resolved

simonswine force-pushed the crosstenant branch from a6211ef to 3bc1204 Compare December 8, 2020 17:51

pull-request-size bot removed the size/XL label Dec 8, 2020

pstibrany approved these changes Dec 17, 2020

View reviewed changes

simonswine added 14 commits December 23, 2020 12:18

Aggregate the limit of maxQueriers correctly

f047968

This ensures the limit is aggregated correctly of the setting of each individual tenant. It also implements the logic for the v2 query frontend. Signed-off-by: Christian Simon <simon@swine.de>

Fix tenant labels and make LabelNames more efficient

64c3918

Signed-off-by: Christian Simon <simon@swine.de>

Use tsdb_errors for error handling

42663d1

Signed-off-by: Christian Simon <simon@swine.de>

Fix uninitialized label matcher

7b94e53

Regexp matcher need to be initialized, this adds a test for regexp matcher and fixes the underlying issue. Signed-off-by: Christian Simon <simon@swine.de>

Use map for filterValuesByMatchers to reduce allocations

b22a22c

Signed-off-by: Christian Simon <simon@swine.de>

Address review suggestions

52b53a6

Signed-off-by: Christian Simon <simon@swine.de>

Add validation.SmallestPositiveNonZeroIntPerTenant to avoid code dupl…

9826281

…ication Signed-off-by: Christian Simon <simon@swine.de>

Add limitations and experimental status to docs

9bfc3af

Signed-off-by: Christian Simon <simon@swine.de>

Remove unnecessary cast of defaultTenantLabel

e51291b

Signed-off-by: Christian Simon <simon@swine.de>

Handle query-range limits for multi tenant queries

0667242

Signed-off-by: Christian Simon <simon@swine.de>

Skip results cache, if multi tenants query

b41a4df

Signed-off-by: Christian Simon <simon@swine.de>

Align CHANGELOG line with the flag description

54b9981

Signed-off-by: Christian Simon <simon@swine.de>

simonswine force-pushed the crosstenant branch from 891e692 to 54b9981 Compare December 23, 2020 12:20

Add a limitation about missing results cache

298c214

Signed-off-by: Christian Simon <simon@swine.de>

pracucci merged commit 6614def into cortexproject:master Dec 23, 2020

simonswine mentioned this pull request Dec 31, 2020

Implement result cache for tenant query federation #3640

Merged

3 tasks

midnightconman mentioned this pull request Mar 10, 2021

Tenant Federation grafana/loki#3462

Closed

simonswine mentioned this pull request Mar 22, 2022

Add tenant package grafana/dskit#143

Merged

Add multi tenant query federation #3250

Add multi tenant query federation #3250

Uh oh!

Conversation

simonswine commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jtlisi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtlisi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csmarchbanks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pstibrany left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simonswine commented Dec 23, 2020

Uh oh!

pracucci commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhiravabhatla commented Jan 20, 2021

Uh oh!

pstibrany commented Jan 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cabrinha commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonswine commented Feb 22, 2021

Uh oh!

Uh oh!

simonswine commented Sep 29, 2020 •

edited

Loading

pracucci commented Dec 23, 2020 •

edited

Loading

pstibrany commented Jan 20, 2021 •

edited

Loading

cabrinha commented Feb 11, 2021 •

edited

Loading