Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending Neural Search pipeline to Named entity recognition and other metadata extracting models #134

Open
navneet1v opened this issue Mar 13, 2023 · 10 comments
Assignees
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Mar 13, 2023

Copying the customer request from Forum post: https://forum.opensearch.org/t/extending-neural-search-pipeline-to-named-entity-recognition-and-other-metadata-extracting-models/13078

I have a usecase to involve a named entity recognition model for documents and queries while indexing and querying. The documents will be filtered based on the presence of extracted entities against the query’s extracted entities. The pipeline will work similar to the existing neural search pipeline with one difference that in this usecase, the queries and documents will be passed through a NER (Named entity recogntion) model and added with extra metadata such as entities instead of vectors provided by an embedding model.

So if we are able to extend the usecase of neural-search pipeline to include model(s) that enable named entities extraction, embeddings, image segments (finding image components for image search) etc., so that the query/document extracts enough metadata through various models in the list of my neural search pipeline before matching.

Please do a +1 if you are looking for this feature. If possible do a comment explaining your usecase.

@navneet1v navneet1v added Enhancements Increases software capabilities beyond original client specifications untriaged labels Mar 13, 2023
@navneet1v
Copy link
Collaborator Author

@ylwu-amzn do ML plugin API support Named entity recognition model?

@MShyani how do we think this can impact the indexing and queries?

@MilindShyani
Copy link

MilindShyani commented Mar 13, 2023

I am not sure what's the best way to implement this. Perhaps one method would be to use a cross encoder model.

In this architecture, you first retrieve the top k documents d_i for a query and then pass (q,d_i) where i ranges from 1 to k to the model. This model, which can be an NER model, can be used to rerank the passages. I don't this is straight forward to implement with the current plugins also it is computationally expensive (since the transformer makes k passes).

Note that there is another way where a model can read the queries and find the named entities and looks for those entities in the document corpus. But this is (almost) exactly what a neural retriever does when it creates a vector for the query and looks for nearest neighbors!

There could be other ways but I can't think of any on top of my head.

@navneet1v
Copy link
Collaborator Author

navneet1v commented Mar 13, 2023

@MilindShyani thanks for the update.

Let me do some research on how NER model works and see if I can come up with some proposed solution which can be added as a feature in Neural Search Plugin.

@ylwu-amzn
Copy link

ml-commons doesn't support named entity recognition model now.

@prasadnu
Copy link

prasadnu commented Mar 14, 2023

To be bit more clear, I was thinking for neural search pipeline to be extended so that it can be used not only for retrieving vectors from an embedding model, but also for retrieving any other metadata such as entities (for both docs and queries) from a NER model.

Now, before creating a neural search pipeline, we should upload and load a ML model that provides embeddings (refer to screenshot). Here this is limited to only models that provides embeddings, if this can be extended to upload any metadata models like NER and use that model to create a neural search pipeline, it would be generic.
image
image

@navneet1v navneet1v added the backlog All the backlog features should be marked with this label label Mar 22, 2023
@CodeAKrome
Copy link

CodeAKrome commented Mar 23, 2023

I'm doing NER by putting my opensearch data stream through a container which injects the entities during forwarding. So [data src] -> [injector] -> [opensearch/_bulk]. Would this be of any use to anyone, do you think? I looked at the PRs and poked around a bit and didn't see anything but this thread. I'm pulling RSS feeds. My goal is to get this working in kubernetes so I can scale it.

@navneet1v navneet1v added the Features Introduces a new unit of functionality that satisfies a requirement label Mar 28, 2023
@rs-amundaware
Copy link

https://www.elastic.co/blog/how-to-deploy-nlp-named-entity-recognition-ner-example
ES provides this solution. Do we or can we have this featre in opensearch as well. please let me know if it already exisits.

@navneet1v
Copy link
Collaborator Author

@rs-amundaware I think there was some issue in ML-Commons that was tracking adding new types of Model via MLCommons plugin. opensearch-project/ml-commons#1164

@rs-amundaware
Copy link

@navneet1v Thanks. yes. waiting for that feature eagarly.

@q-andy
Copy link
Contributor

q-andy commented Jan 10, 2025

Hi, could you assign this to me?

@heemin32 heemin32 assigned q-andy and unassigned vibrantvarun Jan 10, 2025
@heemin32 heemin32 moved this from Backlog to Backlog(Hot) in Neural Search RoadMap Jan 10, 2025
@heemin32 heemin32 moved this from Backlog(Hot) to 3.0 in Neural Search RoadMap Jan 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog All the backlog features should be marked with this label Enhancements Increases software capabilities beyond original client specifications Features Introduces a new unit of functionality that satisfies a requirement neural-search
Projects
Status: 3.0
Development

No branches or pull requests

9 participants