Open
Description
Is your feature request related to a problem? Please describe
Today, OpenSearch exposes an HTTP-based API for indexing, in which users invoke the endpoint to push changes. It also has a “_bulk” API to group multiple operations in a single request.
There are some shortcomings with the push-based API, especially for complex applications at scale:
- Ingestion spikes beyond the server capacity can result in request rejections and backpressure to the clients. Thus it adds complexity to the application producer to properly handle the back pressures.
- In complex applications, not all index requests are equally important. However, today there’s no good way to differentiate the index requests based on the priority with the current HTTP API. This issue is particularly challenging for heavy ingestion workloads.
- Some cases would require replaying the indexing requests. For example, the cluster is restored from a previous snapshot, and the ingestion history needs to be replayed. Another example is the live cluster migration in which the indexing needs to be applied to two clusters.
- In the segment-replication mode, translog is used for the durability of staged index changes before they are committed. However, excessive Translog can result in overflow under heavy ingestion. And in the document-replication mode, it needs to wait for the completion of the document. Neither translog nor synchronous replication is necessary in the streaming ingestion mode, because the streaming buffer can provide the durability guarantee.
Describe the solution you'd like
In general, a streaming ingestion solution could bring in multiple values and address the aforementioned challenges:
- By introducing a streaming buffer, it’s possible to decouple the producers (i.e. applications who generate index operations) and consumers (i.e. OpenSearch server). With that, the ingestion spike can be smoothed and buffered, without blocking the producer.
- With streaming ingestion, it’s possible to enhance it with priority-aware ingestion by employing multiple Kafka topics to isolate the events based on priority, and applying different rate-limit thresholds to prioritize important events.
- A streaming platform like Kafka stores messages in a durable log, and allows consumers to re-read messages from the past via offset. This is a powerful feature that offers reprocessing and replayability.
- A durable messaging system like Kafka can provide the durability guarantee of the messages and still maintain a high write throughput. Thus it’s possible to simplify the ingestion component in OpenSearch to delegate the message durability problem to the messaging system.
More details of the solution can be found in this document
Related component
Indexing
Describe alternatives you've considered
As an alternative, the plugin-based approach starts a streaming-ingester process as a sidecar to the OpenSearch server in the same host, which is described in this section.
Additional context
No response
Metadata
Assignees
Labels
Type
Projects
Status
New
Activity