[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948

Bukhtawar · 2024-03-27T18:20:44Z

Is your feature request related to a problem? Please describe

While writing data in Lucene format enables faster queries, it also limits queries to use a compatible Lucene query engine. As data grows over time the need to keep engine compatible with the older data format imposes another constraint, preventing users to choose between getting benefits from newer versions vs keeping older format data readable.
Then in order to upgrade the engine, the data indexed in older formats need to be re-indexed, which requires data to be read from the source field with a compatible Lucene engine before individual documents can be re-indexed into a target version.

Describe the solution you'd like

The source field stores the raw doc as a spl field, however this field can only be read by a compatible Lucene version. It be good if we could store this field in open/neutral format. This would enable users to

Use a query engine of their choice to query original data, even if the Lucene data formats changed
Be able to re-index seamlessly without having to get locked in by the source data format. For instance check the complexity involved

There could be caveats though with the query performance where actual doc needs to be returned, based on the data format, which needs to be evaluated further

Related component

Storage

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

backslasht · 2024-03-28T03:08:32Z

@Bukhtawar - Thanks for the proposal. Format could mean two things here, i) format of the data represented as part of the document, ii) format of the data in rest (compressed and stored). Currently index codec defines both, are you suggesting to change both or just the first one?

Bukhtawar · 2024-03-28T08:39:54Z

Thanks @backslasht here I intend to keep the data stored at rest in a format that makes it easier for diverse query engines to be plugged in and helps data break free from the Lucene version compatibility constraints as much as possible.

Looping @reta @andrross @msfroh @tharejas @sachinpkale @gbbafna for thoughts

sachinpkale · 2024-03-28T10:09:25Z

Nice proposal!

I am trying to understand scope of this feature request with following questions:

For my understanding, is the source field part of Lucene segment today? if yes, even if we change its type from special field to a neutral type, say JSON, we still need Lucene to read the field first, right? Or are we proposing to store the source independent of segments?

Use a query engine of their choice to query original data, even if the Lucene data formats changed

Does querying original data from another query engine bypass OpenSearch or this also means OpenSearch support pluggable query engines?

reta · 2024-03-28T13:36:41Z

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

gbbafna · 2024-03-28T13:38:03Z

+1 . I like the overall idea of decoupling the source from the engine. Couple of questions/thoughts

Would that mean storing the open data format always as opposed to making it optional : The reason I am asking is if we are able to reindex without source itself somehow theoretically, that could make it a cheaper alternate.
Do we need to explore multiple/pluggable Lucene engines to get around this problem of incomptability ?

andrross · 2024-03-28T16:17:36Z

it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

@reta @Bukhtawar There indeed seems to be some overlap here with an ingestion tool like data prepper where you can configure another sink along side OpenSearch and store the data in a neutral, analytics-friendly format. The two use cases listed in this issue ("use any query engine" and "reindex seamlessly") could be solved by ingesting the original data into an additional sink. However, in that case OpenSearch has no knowledge of the other data and cannot use it the way that it uses the source field today. It's an interesting thought to consider if we can replace the existing source field that OpenSearch knows about and uses with a neutral, more future-proof format and kind of get the best of both worlds.

shwetathareja · 2024-03-28T16:29:34Z

Thanks @Bukhtawar for the proposal.

I definitely see the value of storing _source field in a data format (considering it is just document blob) which is not bound to lucene engine version, especially for re-indexing..

anasalkouz · 2024-03-28T17:24:53Z

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Thats true, I don't think you can rely on the _source field, since it can be disabled.
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field

Bukhtawar · 2024-03-28T17:59:53Z

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Thats true, I don't think you can rely on the _source field, since it can be disabled. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field

Obviously we are talking about the new data format which will be applicable for newer version onwards. Based on how the proposal goes we can always decide to change that if we see good benefits espl as OpenSearch has good support for durability but gets constrained on data compatibility

chelma · 2024-04-03T18:05:03Z

A few thoughts:

Being able to guarantee access to the original documents ingested by a cluster would be awesome for enabling multi-version upgrades going forward, but all storage has a cost. I can see arguments that the feature would need to be optional for that reason.
If we consider an alternative store than Lucene segments, we'll also need to tackle some of the things those currently give us - like handling updates/deletes. Not necessarily an argument against the approach, but just wanted to point out some additional work to be done to support.
Having the original docs outside of Lucene segments would make parsing/reindexing easier and allows us more flexibility to split up reindexing across many workers instead of having a strong inclination to have a worker-per-shard due to wanting to treat the shards as separate Lucene indices in order to extract the docs.

Question:
@Bukhtawar Do we think that storing the original docs outside of Lucene would enable us to compress them better, reducing the burden of storage?

samuel-oci · 2024-04-29T16:25:02Z

Hi @Bukhtawar that's a very interesting suggestion! Some clarification questions to make sure I get it right:

Today _source is a codec that extends StoredFieldFormat in Lucene. Are you suggesting to move entirely from Lucene interface of StoredFieldFormat into a new interface?
Or are you suggesting to keep Lucene interface and only extend the StoredInterfaceFormat with a non default Lucene codec that can be more easily read by other systems?

Context: I currently have a working POC in which I extended the _source field to work with Parquet format. I have done so by extending the StoredFieldFormat in Lucene interfaces. I would love to share any cons/pros I have seen.

Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 27, 2024

github-actions bot added the Search Search query, autocomplete ...etc label Mar 27, 2024

Bukhtawar added Storage Issues and PRs relating to data and metadata storage and removed untriaged Search Search query, autocomplete ...etc labels Mar 27, 2024

shwetathareja added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 28, 2024

reta mentioned this issue Apr 11, 2024

[Feature] Hybrid Compression #13110

Open

Bukhtawar mentioned this issue May 14, 2024

[RFC] Parquet/Avro Storage Extension With External Writer #13668

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948

[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948

Bukhtawar commented Mar 27, 2024 •

edited

Loading

backslasht commented Mar 28, 2024

Bukhtawar commented Mar 28, 2024 •

edited

Loading

sachinpkale commented Mar 28, 2024 •

edited

Loading

reta commented Mar 28, 2024

gbbafna commented Mar 28, 2024

andrross commented Mar 28, 2024

shwetathareja commented Mar 28, 2024

anasalkouz commented Mar 28, 2024

Bukhtawar commented Mar 28, 2024

chelma commented Apr 3, 2024 •

edited

Loading

samuel-oci commented Apr 29, 2024

[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948

[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948

Comments

Bukhtawar commented Mar 27, 2024 • edited Loading

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

backslasht commented Mar 28, 2024

Bukhtawar commented Mar 28, 2024 • edited Loading

sachinpkale commented Mar 28, 2024 • edited Loading

reta commented Mar 28, 2024

gbbafna commented Mar 28, 2024

andrross commented Mar 28, 2024

shwetathareja commented Mar 28, 2024

anasalkouz commented Mar 28, 2024

Bukhtawar commented Mar 28, 2024

chelma commented Apr 3, 2024 • edited Loading

samuel-oci commented Apr 29, 2024

Bukhtawar commented Mar 27, 2024 •

edited

Loading

Bukhtawar commented Mar 28, 2024 •

edited

Loading

sachinpkale commented Mar 28, 2024 •

edited

Loading

chelma commented Apr 3, 2024 •

edited

Loading