Skip to content

[Feature Request][RFC] Multi-tenancy as a construct in OpenSearch #13341

Open
@msfroh

Description

Is your feature request related to a problem? Please describe

I've been involved with multiple projects and issues recently that try deal with the notion of "multi-tenancy" in OpenSearch. That is, they are concerned with identifying, categorizing, managing, and analyzing subsets of traffic hitting an OpenSearch cluster -- usually based on the some information about the source of the traffic (a particular application, a user, a class of users, a particular workload).

Examples include:

  1. [PROPOSAL][Query Sandboxing] Query Sandboxing high level approach. #11173 -- Query sandboxing aims to prevent some subset(s) of traffic from overwhelming the cluster or individual nodes, by imposing resource limits.
  2. [RFC] Query insights framework  #11429 -- Query insights can already capture the top N most expensive queries, but it will be more helpful once we can associate queries with a source. (This may also become an input to sandboxing decisions down the line.)
  3. [Search Pipelines] Add a processor that provides fine-grained control over what queries are allowed #10938 -- Some OpenSearch administrators have asked for a feature that would let them restrict what kind of queries specific users/groups are allowed to send. In order to that, we first need to identify the users/groups.
  4. [RFC] User Behavior Insights #12084 -- User behavior logging is a little different, because we expect it to deal with a large number of users who probably aren't hitting the cluster directly. This is aimed users of a search application, where the search application hits the cluster.
  5. Slow logs -- While we don't have an open issue for it (I think), it would be worthwhile to include information about the source of a query in slow logs.

Across these different areas, we've proposed various slightly different ways of identifying the source of traffic to feed the proposed features.

I would like to propose that we first solve the problem of associating traffic with a specific user/workload (collectively "source").

Describe the solution you'd like

We've discussed various approaches to labeling the source of search traffic.

Let the client do it

In user behavior logging, the traffic is coming from a search application, which can presumably identify a user that has logged in to the search application. The application can provide identifying information in the body of a search request. This would also work for any other workload where the administrator has control over the clients that call OpenSearch.

Pros:

  • Very easy to implement in OpenSearch. We just need to add a new property to SearchRequest (or more likely SearchSourceBuilder, since it probably belongs in the body). In the simplest case, this property could just be a string. For more flexibility (e.g. to support the full suite of attributes in the UBI proposal), the property could be an object sent as JSON. Of course, as these labeling properties grow more complex, it also becomes harder for downstream consumers (like query insights) to know which object fields are relevant for categorization.
  • For application builders, it provides a lot of flexibility. Application builders know how they want to categorize workloads for later analysis.

Cons:

  • Doesn't work in an environment where administrators have granted direct cluster access to many users, since they can't assume that users will provide accurate identifying labels.

Rule-based labeling

This is the approach that @kaushalmahi12 proposed in his query sandboxing RFC. A component running on the cluster will inspect the incoming search request and assign a label. (Okay, in that proposal, it would assign a sandbox, but it's the same idea targeted to that specific feature.)

Pros:

  • Does not require changes to application code.
  • Does not rely on trusting clients of the cluster to do the right thing.
  • We could provide some sensible defaults. The coordinator node could tag the request with the source IP, for example. It would be great if we could take user identity information from the security plugin, but @peternied keeps telling me that's harder than it sounds (since the identity information might be a monstrous certificate).

Cons:

  • Developing a rule engine (however simple) is more complicated than just adding a property to a search request.
  • Need to worry about rule precedence.

Custom endpoints for different workloads

This is an evolution of what @peternied proposed in his RFC on views. Over on opensearch-project/security#4069, I linked a Google doc with my proposal for an entity that combines authorization (to the entity), access-control (to a set of indices or index patterns), document-level security (via a filter query), query restrictions, sandbox association, and more.

Pros:

  • Easy to administer. For a given tenant, everything is defined in one place.
  • No ambiguity. No conflicts. When you search via the given endpoint, the specified behavior is exactly what runs.
  • Reduces authorization load for the security plugin. No need to check every single index/alias when you're defining access control on the endpoint.

Cons:

  • Requires an endpoint per differentiated workload. Obviously won't work for something like UBI, where we're trying to learn from a larger user base. Gives a lot of power to administrators, but also a lot of responsibility. Assuming we store these endpoint definitions in cluster state (or any structure available to all coordinator nodes), you probably can't configure more than a few dozen or hundreds of them.
  • Administrators may need to worry about index patterns "accidentally" picking up access to new indices.

What do I recommend?

All of it! Or at least all of it in a phased approach, where we learn as we go. The above proposals are not mutually-exclusive and I can easily imagine scenarios where each is the best option. In particular, if we deliver the "Let the client do it" solution, we immediately unblock all the downstream projects, since all of the proposed options essentially boil down to reacting to labels attached to the SearchRequest (or more likely SearchSourceBuilder).

I think we should start with the first one (Let the client do it), since it's easy to implement. The rule-based approach can coexist, since it runs server-side and can override any client-provided information (or fail the request if the client is trying to be sneaky). I would recommend that as a fast-follow.

The last option is (IMO) nice to have, but limited to a somewhat niche set of installations. It's probably overkill for a small cluster with a few different sources of traffic, but it would be helpful for enterprise use-cases, where it's important to know exactly how a given tenant workload will behave.

Related component

Search

Describe alternatives you've considered

The above discussion covers three alternatives and suggests doing all three. If anyone else has suggestions for other alternatives, please comment!

What about indexing?

I only covered searches above, but there may be some value in applying the same logic to indexing, to identify workloads that are putting undue load on the cluster by sending too many and/or excessively large documents. My preferred approach to avoiding load from indexing is flipping the model from push-based to pull-based (so indexers manage their own load), but that's probably not going to happen any time soon. Also, a pull-based approach means that excessive traffic leads to indexing delays instead of indexers collapsing under load -- you still want to find out who is causing the delays.

@Bukhtawar, you're our resident indexing expert. Do you think we might be able to apply any of the approaches above to indexing? Ideally whatever we define for search would have a clear mirror implementation on the indexing side to provide a consistent user experience.

Metadata

Assignees

No one assigned

    Labels

    RFCIssues requesting major changesRoadmap:Stability/Availability/ResiliencyProject-wide roadmap labelSearchSearch query, autocomplete ...etcenhancementEnhancement or improvement to existing feature or requestv2.15.0Issues and PRs related to version 2.15.0

    Type

    No type

    Projects

    • Status

      Later (6 months plus)
    • Status

      In-Review

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions