-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Add Search to Remote-backed Storage #6528
Comments
Would particularly appreciate feedback from @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, and @reta, so tagging for visibility. Thanks! |
Not very clear on this. With your reference to searchable remote index being successor of searchable snapshot is the idea that for rolled over remote backed index, user will need to basically follow the same mechanism to delete the rollover over index and re-open using some API (similar to searchable snapshot) to make this index searchable directly from remote store ? Also it seems like the RFC is tying the concept of a node (search/data node) with the type of read capabilities supported for an index. Instead I think we should have different shard types something like local reader/remote reader shard. Then these shards can be allocated to any types of node configuration data only, search only, data+search to provide more flexibility. With this for use case 1: We can have all the shards as remote reader shard for an index and it may be allocated to data/search nodes based on cluster configuration. For use case 2: An index can have write shards and remote reader shards but both on data nodes or split between data+search node |
I think we're on the same page here as this is what I'm trying (but not doing well!) to describe here. For use case 1 I'm thinking something like the following (specific names/structure just used for illustration): Step 1:
Step 2:
The allocator would allocate these shards as appropriate across nodes (e.g. a remote_replica would require a node with the "search" role with a file cache configured). There would be no explicit delete/remount for the first index across steps, and most users would use ISM for orchestrating this similar to how they do today. |
Thanks for sharing the example, couple of points:
Step 1_2:
|
Yes, definitely. And I think we'll want future investment here to allow for configurations where the writer in that example can be lighter weight and maybe not require a full copy of the index data in local storage.
Yes, there are cost and performance considerations, but both configurations are possible. |
Thanks @andrross for the proposal. Regarding
We definitely need to understand the write use case better. Given the immutable property of segments, we can't modify them.
|
Thanks @andrross for the proposal. Looks real good. Few more questions which we need to answer in design doc :
|
Great proposal @andrross , it really make sense at high level, just one question on the subject which you have briefly mentioned
This is really useful feature, but it may introduce the need for capacity planning for search replicas, for example having a large index spread over 5 data nodes and trying to fit the searchable replica into 1 search node may not be practical. Any constraints we could identify here upfront? (somewhat related to @gbbafna comments). Thank you. |
@sachinpkale @gbbafna re: Segment replication There is definitely overlap in some of the use cases here and segment replication. The fundamental difference as I see it, is that remote searchers fetch data as needed, whereas a segment replication replica is proactively keeping an up-to-date copy of all data local. A segment replication replica is also able to be be promoted to act as the primary, whereas a node with a remote search shard is inherently only ever serving reads. To answer the question about push vs pull, I think the remote searchers will be implemented to poll the remote object store in order to decouple them from the primaries.
These are great questions! These will need to be solved if we do want to make the writer not require a full copy of the index data, but that is something to be designed and built later and not strictly needed as a part of this design.
I think once you factor in the polling model plus any eventual consistency delays in the remote object store, the answer is that new writes will probably take longer to become searchable than in the cluster-based node-to-node replication model. This is likely one of the tradeoffs involved with this architecture.
This is a great question. My inclination is to by default send searches to the search node/shards, but I suspect there may be need to allow for overriding that behavior and directing a search directly to a primary/replica.
@reta Do you have anything specific in mind here? Since the searchers are relatively stateless with the remote object store holding the authoritative state of the data, they can potentially be more easily added and removed allowing for a cluster admin or managed service operator to offer more autoscaling-like features, but do you have something in mind for building directly into the OpenSearch software itself? |
Thanks @andrross, I think the general question which I have no clear answer yet: how many nodes with |
Thanks @andrross for the proposal. I see two ideas clubbed in this proposal:
I feel Reader/ Writer separation shouldn't be tied to remote store only. In segment replication, most of the work (in terms of indexing) is done by primary and replicas are copying the segments over. Now, user might want to still have Reader/ Writer separation with segrep indices backed by local storage. One of the key reason for the same could be to get predictable performance for both type of workloads and sudden surge in one shouldn’t impact the other. This way user can still ensure search-only replica will never be promoted to primary. Reader/ writer should be thought as roles and user can decide if he/ she wants to co-locate the roles on same node (same jvm) vs different nodes.
This could be request param where user can choose to always use read_replica or fallback to primary in case no read_replica is available.
+1, Earlier in-sync allocation used to keep track of all in sync copies. Here also, there should be a way to detect and fail deep read replicas if they are not catching up due to some reason. Also, same search request results could vary significantly across requests depending on which read_replica the request has landed and how far that replica has caught up with primary. Basically, the eventual consistency would be more visible now. Btw, this might require Cluster State changes to track read-replicas and also they will not be accounted in in-sync allocations as well. |
@shwetathareja Definitely! I think there are three dimensions here:
Not every permutation across these options make sense though. For example, the "partial" data loading strategy likely only makes sense when pulling from a remote store with a durable copy of the data, and I doubt we would ever implement a version of that functionality that would pull from a peer. Similarly, "fetch from peer nodes" is probably only desirable when not using a remote store where you could get the data without burdening a peer. Expanding on that and to make it more concrete, the following are some sketches (some details elided for clarity) of what different index configurations could look like: Seg-rep only, no remote, separate searchers:
Read-only warm-like tier:
Writable warm-like tier:
Deep replicas:
Doing all of this is a lot of work and may take a long time, but getting the abstractions right is important. Also this gets quite complicated, so we'd want to be able to define a simple orchestration policy with something like ISM. What do you think?
For sure. This is a problem today for segment replication, and some strategies for dealing with this have been implemented. I think we'll want similar strategies here to fail replicas when the fall behind and can't catch up. |
Yes, agreed there could be multiple permutations and we need to decide which ones we should focus in Phase1. I don't think we can offload the responsibility to ISM completely but it can definitely handle scenarios like when moving hot to warm (read only), it can clear the write copies to 0. Also it can handle the loading_strategy based on customer policy configuration e.g. on-demand load segments (partial) vs always on disk (full). Customer can also define defaults counts/ loading strategy in templates for write and search copies which are picked up during index creation else fall back to system defaults. nitpick on the naming: Don't overload the existing terminology like using primaries for both write copies (primary & replica). It will create confusion. |
I think the role needs to be extended to both Local store based seg-rep index. For each shard there will be 2 copies. Both copies will server both the roles so they can participate in recovery.
Local store backed index with writer & searchers separation and HA. Each shard has 4 copies with 2 writers and 2 searchers in it.
|
@sohami : As a user either I am looking for writer/ search separation or not irrespective of local or remote storage. If you need separation you will configure no. of writers and searchers separately else you only need to configure writers which will serve read and write both. Now, the question comes for read-only indices, for that, you only need to configure searchers explicitly or ISM policy can do it during hot to warm migration. In case writers are configured for read-only, should that be treated as no-op? (more like an optimization) Do you still feel the need for separate role at shard level as well. I am trying to simplify for the user where they don't have to configure or understand too many configurations. |
This was my thinking as well, and the reason I used the (admittedly bad) "primaries" name above because writers can implicitly do both reads and writes. The new concept being introduced here is the "searcher", and the other "writer" is essentially the same that exists today in that it is capable of doing both reads and writes. As @shwetathareja said, provisioning only "writers" gives you the same behavior today of serving both reads and writes. Adding "searchers" gives you read/write separation by default (with the likely option to direct searches to the writers via a request-level override parameter). All of this is irrespective of local vs remote and even full vs partial data loading, though we'll need to prioritize the work here. I do wonder if this would box us in by codifying the behavior that writers are capable of doing reads as well. Given that update and updateByQuery exist in the write API, it seems reasonable to me to bake in the behavior that the writers are capable of serving reads as well. If, however, we envision a future capability where writers do not have the ability to do reads, then we would need the flexibility that @sohami described above. |
Sorry for the long delay here as I'm just getting time to start reviewing this RFC. First, thanks for starting this discussion! We've needed a place to start pulling a part these mechanisms. I haven't read through all comments here so I'll start with some meta comments / questions.
That's my round 1. I'll look at comments and add more color as this marinates. |
I am treating writer and searchers as shard level roles here (different from primary and replica). "Writer" shard role is what we have today. "searcher" shard role can be a new shard type which can be added. For default case where separation is not needed user do not have to configure anything specially, we may use writer shard role which will serve both write and search traffic.
I was originally confused by the "primaries" being used for "writers", so was calling out that we need some mechanism to define shard level role (or different shard types) like writer/searcher separately from "primary" and "replica". Essentially the main callout is failover will happen within same shard role (or type) not across shard role. So primary/replica or similar concept will exist within a role group.
I think writer shard should be able to perform both write and read of the data to serve different APIs like bulk, update or updateByQuery using its view of data. Being able to read the data doesn't mean it has to serve search API as well or being writer doesn't mean it cannot serve search API. In the complete separate writer/searcher world, we can make writers shards not serve the search API if separate searcher shards are available by controlling the shard resolution at coordinator. This shard resolution can be a pluggable rule with some defaults |
Why? You're not going to write to a replica. I think what you're referring to is a "Replica only" Data node. This is highly discouraged because it flies in the face of read/write scalability (e.g., not all nodes created equally). I think we should pin this mechanism to a serverless plugin and discourage on-prem users from scaling up "read only" nodes unless they are operating in a serverless environment? |
Replica will be used for availability of a shard, so that it can be promoted to primary during failover and start accepting new writes. Meanwhile a new replica will be assigned to serve the search traffic. So for a writer shard (default or what exists today) we can have pri + rep, where pri serves write&search and replica serves search. Failover will affect both write & search traffic. If I need complete separation of writer/searcher (irrespective of separate data node with different roles) with high availability of both i.e. during failover as well my searcher shard should not get affected by writer failover. Then the replica of the writer (if available) will be used for failover instead of searcher shard. There could be other reasons like my searcher shard could be using
Sorry I didn't understand this part |
This specific issue is meant to get at the feature to "add replicas to a remote(i.e. repository)-backed index that can fetch data on-demand at search time via the repository interface". What this exposes is that the current index settings API, with its Also totally agree that the default behavior all nodes being equal should not change, nor should the experience of configuring or operating such a cluster be made any more complicated. |
What specifically is it lacking? Keep this simple and don't worry about the separation of reader/writers in this issue. |
@sohami : Shard level role could be inherent and implementation details how we implement shard allocation during role separation. @andrross : To your point where if we foresee writers not doing reads at all. I feel we shouldn't add the restriction to start with and let this feature evolve over time. Customers could have use cases where they want more latest copy to serve their request (writers copy) and may want to decide at request level. As a platform, we shouldn't restrict that capability.
@nknize : I would agree to keep this discussion separate as reader/ writer separation is independent of underlying storage mechanism.
@nknize : Once we start diving more into reader/ writer separation. We would evaluate if overloading existing settings are enough vs addition of new settings would help users configure better. I am planning to create the separate RFC for Reader/ Writer separation, lets debate more there. |
💯 Thanks 🙏🏻 @shwetathareja! |
Just dropping this brainstorm here until @shwetathareja opens the issue, then I'll move it over. Another thought I've been mulling over is (in a cloud hosted or on-prem serverless environment) how we could parallelize primaries to achieve what (I think) is being described here with the read/write separation. In the face of write congestion can we auto-scale up additional "primary" writer nodes that concurrently write to the "primary"! In this scenario we would effectively have multiple primaries writing to the same shard. We could logically think of this as sharding a shard. That is, keep the same shard id range for the shard but split it into a "super shard set". Add a new |
Let me try to clarify my thinking here and why I'm (over?) complicating this issue with reader/writer separation. First, let's assuming the following definitions:
For this feature, I'm envisioning a third case, with essentially a sole responsibility of "handle searches". These pure searchers can be scaled independently, and don't have to worry about participating in the write path at all, nor maintain any state that allows them to be quickly promoted to act as primary. They can exist by themselves for an isolated "warm" search experience (e.g. the "time series" use case above) or be used alongside a |
I don't think that's a third case. That's what replicas are for. Are you just saying you want a replica to never be primary-eligible, and only pull/cache on-demand (at search time)? If that's the case that's a variant on what I'm describing and I'd separate it from the remote storage mechanism and outline that use case under the issue @shwetathareja is going to open. There's no reason we couldn't add a flag to prevent a replica node from being primary eligible - though I think that's a cloud hosted / serverless feature where you have "unlimited" resources because I'm not sure I'd penalize an active node from becoming primary if writes are bottlenecked. |
Well, it's slightly more complicated than that. Listing the cases detailed in the use cases above:
I do think these cases have merit (though likely mainly for large clusters where you can specialize hardware for each role). They are also not strictly required for pull/cache on-demand storage feature, though that capability slots in very nicely for the searchers in an architecture where you have differentiated nodes. We can certainly make the pull/cache on-demand an index-level property of a remote-backed index. The initial phase would likely not allow writes for such an index and provide the read-only "warm" experience. The next phase would be to implement the logic to support pulling/caching on-demand in the write path (for updates/update-by-query, etc) while also offloading the newly written segments from local storage when required (which is briefly mentioned in the "data nodes" section above). |
Now that the reader/writer separation has been pulled out to a separate issue (#7258), here is an trimmed down proposal for just the storage-related portion of this original RFC. @nknize @shwetathareja Let me know what you think. If we get alignment on this after some discussion and feedback I'll update the main content on this GitHub issue and close it and we'll proceed with the design and implementation for this effort. GoalThe goal of this effort is to enable a node to read and write from a remote-backed index that is larger than its local storage. This feature will be part of enabling an "initite storage" user experience where the host resources required to host an index are driven by the usage and access pattern rather than by the total size of the data in the index. TerminologyLocal storage: The storage available to an OpenSearch node via the operating system's file system. BackgroundOpenSearch nodes read and write index data via the operating system's file system, and therefore a node must have sufficient disk space to store all the necessary data. OpenSearch implements three "disk watermark" thresholds which, when breached, will stop allocating new shards to a node, attempt to allocate existing shards away, and as a last resort block writes in order to prevent the node from running out of disk space. The remote-backed storage feature introduced a new capability to automatically back up data in a durable remote store. However, there is not yet the capability for an index to grow beyond the size of the cluster's local as remote-backed storage relies on having a full copy of the data in local storage for serving read and write requests. The searchable snapshot feature has introduced the capability to pull data from the remote store into local storage on-demand in order to serve search queries. The remote store remains the authoritative store of the data, while local storage is used as an ephemeral "cache" for the (parts of) files that are necessary for the given workload. ProposalThe goal of this feature is add an option to remote-backed indices that enables the system to intelligently offload index data before hitting disk watermarks, allowing more frequently used and more-likely-to-be-used data to be stored in faster local storage, while less frequently used data can be removed from local storage since the authoritative copy lives in the remote store. Any time a request is encountered for data not in local storage it will be re-hydrated into local storage on-demand in order to serve the request. The first phase of this proposal will implement the pull data on-demand logic for remote-backed indices, but with the constraint that these indices are read-only. This will solve a common use case for time-series workloads where indexes are rolled over and not written to again. Users can modify their ISM policies to apply this new index setting to implement a type of "warm" tier where the full dataset of these indexes can be offloaded from local storage and only the data that is needed will be hydrated into the "cache" on demand at search time. No complex workflows or data snapshotting is necessary here as the data already exists in remote storage. This work will leverage the functionality implemented in searchable snapshots to pull parts of files into local storage on demand at search time. The next phase will remove the read-only constraint. This will involve implementing a composite Directory implementation that can combine the functionality of the normal file system-based directories (from Lucene), the RemoteDirectory implementation that mirrors data into the remote store, and the on demand block-based directory added in searchable snapshots. This will involve some significant design and prototyping, but the ideal goal is to provide an implementation that can intelligently manage resources and provide a performant "warm" tier that can serve both reads and writes. |
Thanks @andrross for the concise proposal and capturing the first phase clearly.
There will be an API provided to users to migrate their indices to warm tier. ISM would automate this warm migration. Is that right? Couple of questions:
|
@shwetathareja Thanks for the comments!
With this capability, I see two states that the user can choose from for a remote-backed index:
Let's call these "hot" and "warm" to use common terms, but I'm not committing to settings or API names here. So given this, there will be a new index setting, say
Replicas will still work the same as they do today. The warm replicas would pull data on demand and cache as necessary. I think a common use case, particularly for a read-only warm experience, will be to reduce replicas to zero upon configuring the index to warm. A replica could still be added here for a faster fail over experience, but we'll probably want a default search routing policy to send searches to the primary to get the full benefit of the cache. Longer term, I think we'll need smarter strategies for replicas to pro-actively hydrate data when storage is available if warm performance is to be on par with hot.
Indexes that are never queried may eventually have most of their data evicted from storage, but would still be tracked in cluster state and still consume some resources. We'll need an archival/cold tier as there will definitely be limits here. I've kept that out of scope here. Do you think an archival tier needs to be a part of this proposal?
There shouldn't be any user-facing changes, but the system would be able to remove unreferenced data files from local storage from warm indexes before it had to take more aggressive action like block allocations, re-allocate shards away, or block writes.
Yes, users should be able to dynamically change the setting in either direction. |
@andrross Lets say a customer who is rolling over on daily basis and moving older indices to warm. Over time, it can accumulate lot of indices considering warm is cheaper. It would be good to know per shard memory footprint in terms of shard lucene metadata (irrespective of cluster state). For example, if per shard memory footprint is 50MB and per index there are 10 shards, then over a period of 10 days, it can start taking 50MB * 10 *10 = 5GB heap un-necessarily. It is something to POC and think about and based on the impact we can decide whether it should be scoped in. |
@andrross Thanks for the proposal
Nice to see this. +1 |
To add to @sohami point:
This should consider memory footprint per shard in this calculation. I think we should give a reference calculation to user so that they can estimate properly. @sohami can we give this information in API per shard heap usage? |
@shwetathareja @sohami Thanks for the feedback! I think I've answered your questions with some more details below. Please let me know what you think. User ExperienceUser can tag a remote-backed index as "warm", allowing the system to offload unused data from local storage Detailed BehaviorAdding/removing tagFor all cases, the index should remain usable during transitions. There is modality here through: for cases where file cache-capable nodes are separate, then state transitions will be shard relocations. For a homogenous fleet, state transitions will happen in-place. The in-place hot->warm transition (while likely technically complex) should be minimally invasive from a user's perspective as it is essentially an accounting change for existing data on disk: All files on disk are logically moved to the file cache, and are subject to eviction if not actively referenced. The in-place warm->hot transition may require a large data transfer from remote. Query routingRouting queries to the primary by default is likely the right initial default in order to maximize cache re-use. Replicas can still exist for faster failover, but may have a cold cache upon being promoted. ReplicationThis should be a variant of segment replication-via-remote-store, except that as a "warm" replica it would not need to proactively pull the latest data into local storage. It can use the same mechanism to become aware of new segments and the pull the data on-demand. (Note: to get to the warm-by-default goal we probably won't be able to rely purely on pull on-demand and will need more sophisticated strategies for keeping likely-to-be-accessed data in local storage.)
|
@shwetathareja I was thinking this to only consider storage and not memory footprint. The setting |
@andrross For my understanding, does this mean that the local files will move to the FileCache as part of transition ? But isn't FileCache just to keep track of block files instead of full segment files and also in different location ? Seems to me that transition should essentially remove the local lucene files and start using the block mechanism. Also just calling out that given we will need to enforce some limits or back-pressure mechanism as part of transition, we may need new API instead of just being able to use the existing update setting API mechanism.
|
Thanks everybody for the feedback here. I'm going to close this issue, but please do continue to follow along the work in this area in the linked issues. |
Overview
The storage vision document describes the high level thinking for storage in OpenSearch, and the roadmap introduced the first phases of that work. Today, remote-backed storage and searchable snapshots have been implemented and are nearing general availability. The next phases of the roadmap describe the concept of a “searchable remote index” which will add functionality to natively search data in the object store for remote-backed indexes. To that end, this proposal builds on the search node functionality introduced in searchable snapshots to allow for searching data natively in remote-backed storage.
Proposed Solution
This document details a proposal to add search replicas to remote-backed indexes. In this context, a search replica is a host in the cluster that can serve search queries from the data in the remote object store without fully copying the data to disk ahead of time. A node in the cluster can be dedicated to performing the search role, but need not be. From a 10,000ft view, a remote-backed index with search replicas will behave the same as any other index: a user can submit reads and writes and the cluster will behave as expected. Zoom in a bit and there are cost and performance considerations. Below are two primary use cases:
Time Series Data
This is the canonical use case in log analytics workloads. A user starts with a regular remote-backed index serving both reads and writes while seamlessly mirroring the data into the remote object store. When this index is rolled over it will no longer be written to but can still be searched. The searchable remote index allows for keeping the ability to search this index while offloading the full data storage to the remote object store. Just as remote-backed indexes are envisioned as a successor to snapshots, the searchable remote index is a successor to searchable snapshots as the continuous snapshot-like feature of a remote-backed index allow for remote searching without additional workflow steps or orchestration.
Step 1:
Step 2:
Deep Replicas
This use case essentially combines the two phases of the time series use case into a single index: a remote-backed index is serving writes and mirroring the data to the remote object store, while search replicas are serving search requests from the remote data. This configuration can solve the use case for very high search throughput workloads, as the remote search replicas can be scaled independently from the data nodes. The tradeoffs with an all data node architecture is that there will likely be increased replication delay for search replicas to become aware of new writes, and individual search requests can see higher latencies depending on workload (see the “search node” section below for future performance work). There is also an entire future work stream that will benefit this architecture mentioned in the “data node” section below for changes to make the writers more remote native.
Automation
The introduction of index configurations with cost and performance benefits and tradeoffs will be a natural fit for automation with ISM. This document is focused on defining the right core concepts and features, but ISM integration is a priority and will be implemented.
Design Considerations
Search Node
The search node will continue to build on the functionality introduced in searchable snapshots. It will use the block-based fetching strategy to retrieve the necessary data to satisfy search requests on-demand, and the file cache logic to keep frequently accessed blocks on disk for quick access.
From the perspective of the OpenSearch engine, Lucene’s Directory abstraction is the key interface point that the complexities of different storage implementations are hidden behind. For example, any storage technology that is accessed through the operating system’s file system (such as physically attached disks or cloud-based block stores like EBS), the FSDirectoryFactory is used to create the appropriate Directory implementation for listing and accessing files. For remote-backed storage a RemoteDirectory implementation is used for uploading and downloading files used by data nodes in a remote-backed index. Searchable snapshots implemented a RemoteSnapshotDirectory that provides access to files that are physically represented as a snapshot in a repository (the specific repository implementation is provided via a storage plugin). We will create a similar remote search-focused Directory implementation for searching remote-backed indexes stored in a repository, using the same underpinnings of RemoteSnapshotDirectory to provided the blocked-based fetch and caching functionality. The bulk of the logic for the Lucene abstraction that implements the on-demand fetching and file caching is implemented in the class OnDemandBlockIndexInput and will be reused here.
Some of the changes required (relative to searchable snapshots) are to use the remote-backed storage metadata to identify segment file locations as opposed to snapshot metadata, and the ability to refresh that metadata when data in the remote object store changes. Critical performance work will also continue in this area, such as:
Data Node
As this proposal is focusing on introducing remote search to remote-backed indexes, no changes are necessary for the data nodes in the first steps. The data nodes will continue to keep a full copy of the index data on local disk while mirroring it into the remote object store. Replica data nodes can be provisioned in order to allow for fast fail-over without the need to restore, and node-to-node replication will keep the replica nodes up-to-date. However, there is a lot of potential future work in this area, such as offloading older data to keep the writers lean, mechanisms to perform merges directly in the remote object store by a separate worker fleet, etc. That work is beyond the scope of this phase, but the goal here is to build incremental pieces that are extensible to allow such architectures.
Shard Allocation
Searchable snapshots introduced the concept of remote shards to be allocated to search nodes. This will have be built on and expanded for searchable remote indexes. In particular, the case where some shards are allocated to data nodes (local) and some are allocated to search nodes (remote) within the same index will be novel.
Cluster State
This proposal does not change the cluster state architecture: the cluster state will still be managed by a leader elected cluster manager node. As remote search capabilities remove the size of the local disk as a constraint for holding indexes, keeping the index metadata in the cluster state may well become a bottleneck. This is another area for future work to offload the metadata in order to offer a storage tier that can scale beyond the limits of what the cluster manager can manage.
How Can You Help?
Any general comments about the overall direction are welcome. Some specific questions:
Next Steps
We will incorporate feedback and continue with more concrete designs and prototypes.
The text was updated successfully, but these errors were encountered: