[Feature Proposal] Writable Remote Index #7804

sachinpkale · 2023-05-29T08:43:03Z

This feature proposal is WIP. We will continue to add details to Sections that are marked with ToDo.

Goal

As an extension to remote store feature, searchable remote index will introduce data tier support in OpenSearch. Hot index has data in local disk as well as remote store whereas warm index has data only in the remote store. The next step is writable warm index. This RFC talks about the requirement of writable warm, different approaches to support writes, pros/cons of each of the approaches and recommends an approach.

Background

This doc assumes following index structure with data tiers. Example provided is just to highlight sample pattern and can be changed as per user’s requirements

orders - Live index, normal writes go to this index.
order-history-<DATE> - orders index is rotated on a daily basis and rotated index is suffixed with the date.
orders-alias points to indexes containing last 30 days of data. orders is added to this alias with is_write_index=true . That means, if we use alias to write data, it will always write to orders index.
Last 7 days of data is kept in the hot tier. That means indexes between order-history-2023-02-22 to order-history-2023-02-16 are hot indexes and can be written in the same way we write data to an index today.
Data that is 7 to 30 days old will be removed from local nodes, index metadata is still part of the cluster state. This becomes part of warm tier. In this example, indexes between order-history-2023-02-15 to order-history-2023-01-16 are warm indexes.

Requirements

Functional

Support updates to existing documents without any changes at client side
Support append data to warm index
Optimised append-only writes based on auto-generated ids/data streams
Refresh data post writes after a configurable period or based on explicitly defined policies

Non-Functional

Shouldn’t interfere with read performance
Impact on write latency should be predictable and/or configurable
Time required to make new changes visible should be configurable
Minimal storage overhead in append/updates

Non-Requirements

Using the same index name (or alias) to write to hot/warm index.
- In phase 1, user needs to provide the exact index to write data to. For example, writing to warm index order-history-2023-02-22 would need the same index name to be provided. Writing to alias will only write to live hot index.
- In next phase, we can support writing to a single index (orders alias as per the example above). Based on a configured field (like timestamp), OpenSearch decides which index to write the data to. Even though this is valid requirement, this can be built incrementally.

Use Cases

Write New Data

Add new documents to the existing warm index. This use case is mostly driven by back-filling data that was not ingested earlier due to some reason. This assumes that user knows which index to use for writing the new data.

Update Existing Data

To update existing data, we need to fetch the existing document first. To improve the latency we need to perform block-level fetches. Once the document is fetched and new changes are applied to it, the next step would be same as Write New Data

Potential Approaches

These approaches provide solution for Write New Data use case only as Update Existing Data use case internally depends on write new data.

[Recommended]

Once the request to write hits the warm index, we open the engine in read-write mode, with the metadata from local disk. We can potentially have warm index have engine open in read-write mode from the start to support writes.
For non-append-only cases we do a block fetch of the document that needs to be updated. Then perform an update of the document, by writing to remote translog before we ack back.
For append-only uses cases, we can skip the block fetch part altogether since we know its a new document and write directly to remote translog. Based on configurable delay we refresh the segments and move the newly created segments and updated bitsets to remote segment store. More details of this approach will be covered in the design review.

Alternative Approaches

Download All Data
In this approach, we make the index hot by downloading all data from remote store to local disk. Once data is downloaded, new data is ingested into it. As this is warm index, we can’t keep the data forever on the local disk. We wait for X mins after last write to avoid frequent downloading of the data then flush and delete data (and metadata based on the data tier type type) from local disk.

Comparison

Potential Issues

Both of the above approaches can result in too many small segments. This will impact query performance. Even with concurrent segment search, if the number of segments is high, it would impact overall performance. We need a way to limit number of segments with the help of background segment merger.
Time to make documents visible will increase (it would not be same as refresh_interval of an index)

Next Steps

POC to check feasibility of using RemoteDirectory instead of FSDirectory in IndexShard.Store
Once concurrent segment search is introduced, we need to understand impact of 1/5/10/100 segments on search and overall node performance (CPU, JVM etc.)

The text was updated successfully, but these errors were encountered:

sachinpkale added enhancement Enhancement or improvement to existing feature or request RFC Issues requesting major changes labels May 29, 2023

github-actions bot added the untriaged label May 29, 2023

sachinpkale added distributed framework and removed untriaged labels May 29, 2023

andrross changed the title ~~[Featuer Proposal] Writable Remote Index~~ [Feature Proposal] Writable Remote Index May 30, 2023

sachinpkale mentioned this issue Jun 16, 2023

[Feature Proposal] Merging Segments in Remote Store #8105

Open

Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023

anasalkouz removed the distributed framework label Sep 19, 2023

andrross mentioned this issue Dec 22, 2023

Define a high level plan for remote search index #11460

Closed

andrross mentioned this issue Jan 2, 2024

Introduce a new index setting/property for toggling a remote store index as "writable warm” #11703

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal] Writable Remote Index #7804

[Feature Proposal] Writable Remote Index #7804

sachinpkale commented May 29, 2023 •

edited

Loading

[Feature Proposal] Writable Remote Index #7804

[Feature Proposal] Writable Remote Index #7804

Comments

sachinpkale commented May 29, 2023 • edited Loading

Goal

Background

Requirements

Functional

Non-Functional

Non-Requirements

Use Cases

Write New Data

Update Existing Data

Potential Approaches

[Recommended]

Alternative Approaches

Comparison

Potential Issues

Next Steps

sachinpkale commented May 29, 2023 •

edited

Loading