Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support for open/neutral data formats for engine agnostic reads #12948

Open
Bukhtawar opened this issue Mar 27, 2024 · 11 comments
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Storage Issues and PRs relating to data and metadata storage

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented Mar 27, 2024

Is your feature request related to a problem? Please describe

While writing data in Lucene format enables faster queries, it also limits queries to use a compatible Lucene query engine. As data grows over time the need to keep engine compatible with the older data format imposes another constraint, preventing users to choose between getting benefits from newer versions vs keeping older format data readable.
Then in order to upgrade the engine, the data indexed in older formats need to be re-indexed, which requires data to be read from the source field with a compatible Lucene engine before individual documents can be re-indexed into a target version.

Describe the solution you'd like

The source field stores the raw doc as a spl field, however this field can only be read by a compatible Lucene version. It be good if we could store this field in open/neutral format. This would enable users to

  1. Use a query engine of their choice to query original data, even if the Lucene data formats changed
  2. Be able to re-index seamlessly without having to get locked in by the source data format. For instance check the complexity involved

There could be caveats though with the query performance where actual doc needs to be returned, based on the data format, which needs to be evaluated further

Related component

Storage

Describe alternatives you've considered

No response

Additional context

No response

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 27, 2024
@github-actions github-actions bot added the Search Search query, autocomplete ...etc label Mar 27, 2024
@Bukhtawar Bukhtawar added Storage Issues and PRs relating to data and metadata storage and removed untriaged Search Search query, autocomplete ...etc labels Mar 27, 2024
@backslasht
Copy link
Contributor

@Bukhtawar - Thanks for the proposal. Format could mean two things here, i) format of the data represented as part of the document, ii) format of the data in rest (compressed and stored). Currently index codec defines both, are you suggesting to change both or just the first one?

@Bukhtawar
Copy link
Collaborator Author

Bukhtawar commented Mar 28, 2024

Thanks @backslasht here I intend to keep the data stored at rest in a format that makes it easier for diverse query engines to be plugged in and helps data break free from the Lucene version compatibility constraints as much as possible.

Looping @reta @andrross @msfroh @tharejas @sachinpkale @gbbafna for thoughts

@sachinpkale
Copy link
Member

sachinpkale commented Mar 28, 2024

Nice proposal!

I am trying to understand scope of this feature request with following questions:

For my understanding, is the source field part of Lucene segment today? if yes, even if we change its type from special field to a neutral type, say JSON, we still need Lucene to read the field first, right? Or are we proposing to store the source independent of segments?

Use a query engine of their choice to query original data, even if the Lucene data formats changed

Does querying original data from another query engine bypass OpenSearch or this also means OpenSearch support pluggable query engines?

@shwetathareja shwetathareja added the Indexing Indexing, Bulk Indexing and anything related to indexing label Mar 28, 2024
@reta
Copy link
Collaborator

reta commented Mar 28, 2024

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

@gbbafna
Copy link
Collaborator

gbbafna commented Mar 28, 2024

+1 . I like the overall idea of decoupling the source from the engine. Couple of questions/thoughts

  1. Would that mean storing the open data format always as opposed to making it optional : The reason I am asking is if we are able to reindex without source itself somehow theoretically, that could make it a cheaper alternate.
  2. Do we need to explore multiple/pluggable Lucene engines to get around this problem of incomptability ?

@andrross
Copy link
Member

it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

@reta @Bukhtawar There indeed seems to be some overlap here with an ingestion tool like data prepper where you can configure another sink along side OpenSearch and store the data in a neutral, analytics-friendly format. The two use cases listed in this issue ("use any query engine" and "reindex seamlessly") could be solved by ingesting the original data into an additional sink. However, in that case OpenSearch has no knowledge of the other data and cannot use it the way that it uses the source field today. It's an interesting thought to consider if we can replace the existing source field that OpenSearch knows about and uses with a neutral, more future-proof format and kind of get the best of both worlds.

@shwetathareja
Copy link
Member

Thanks @Bukhtawar for the proposal.

I definitely see the value of storing _source field in a data format (considering it is just document blob) which is not bound to lucene engine version, especially for re-indexing..

@anasalkouz
Copy link
Member

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Thats true, I don't think you can rely on the _source field, since it can be disabled.
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field

@Bukhtawar
Copy link
Collaborator Author

As far as I remember, the source field could be disabled (and often is), so it looks to me the proposal is more about having a side store for raw data that is being ingested into the index?

Thats true, I don't think you can rely on the _source field, since it can be disabled. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/mapping-source-field.html#disable-source-field

Obviously we are talking about the new data format which will be applicable for newer version onwards. Based on how the proposal goes we can always decide to change that if we see good benefits espl as OpenSearch has good support for durability but gets constrained on data compatibility

@chelma
Copy link
Member

chelma commented Apr 3, 2024

A few thoughts:

  • Being able to guarantee access to the original documents ingested by a cluster would be awesome for enabling multi-version upgrades going forward, but all storage has a cost. I can see arguments that the feature would need to be optional for that reason.
  • If we consider an alternative store than Lucene segments, we'll also need to tackle some of the things those currently give us - like handling updates/deletes. Not necessarily an argument against the approach, but just wanted to point out some additional work to be done to support.
  • Having the original docs outside of Lucene segments would make parsing/reindexing easier and allows us more flexibility to split up reindexing across many workers instead of having a strong inclination to have a worker-per-shard due to wanting to treat the shards as separate Lucene indices in order to extract the docs.

Question:
@Bukhtawar Do we think that storing the original docs outside of Lucene would enable us to compress them better, reducing the burden of storage?

@samuel-oci
Copy link
Contributor

Hi @Bukhtawar that's a very interesting suggestion! Some clarification questions to make sure I get it right:

  1. Today _source is a codec that extends StoredFieldFormat in Lucene. Are you suggesting to move entirely from Lucene interface of StoredFieldFormat into a new interface?
  2. Or are you suggesting to keep Lucene interface and only extend the StoredInterfaceFormat with a non default Lucene codec that can be more easily read by other systems?

Context: I currently have a working POC in which I extended the _source field to work with Parquet format. I have done so by extending the StoredFieldFormat in Lucene interfaces. I would love to share any cons/pros I have seen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing Indexing, Bulk Indexing and anything related to indexing Storage Issues and PRs relating to data and metadata storage
Projects
Status: 🆕 New
Development

No branches or pull requests

10 participants