forked from microsoft/semantic-kernel
-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Python: Introducing vector and text search (microsoft#9345)
### Motivation and Context <!-- Thank you for your contribution to the semantic-kernel repo! Please help reviewers and future users, providing the following information: 1. Why is this change required? 2. What problem does it solve? 3. What scenario does it contribute to? 4. If it fixes an open issue, please link to the issue here. --> This PR does the following things: - Introduces TextSearch abstractions, including implementation for Bing - This consists of the TextSearch class, which implements three public search methods, and handles the internals, the search methods are: 'search' returns a string, 'get_text_search_results' returns a TextSearchResult object and 'get_search_results' returns a object native to the search service (i.e. BingWebPages for Bing) - This also has a method called "create_{search_method}' which returns a KernelFunction based on the search method. This function can be adapted by setting the parameters and has several adaptability options and allows you to create a RAG pipeline easily with custom names and descriptions of both the functions and the parameters! - Introduces VectorSearch abstractions, including implementation for Azure AI Search - This consists of a VectorStoreBase class which handles all the internal and three public interfaces, vectorized_search (supply a vector), vectorizable_text_search (supply a string that get's vectorized downstream), vector_text_search (supply a string), each vector store record collection can pick and choose which ones they need to support by importing one or more next to the VectorSearchBase class. - Introduces VectorStoreTextSearch as a way to leverage text search against vector stores - Since this builds on TextSearch this is now the best way to create a super powerfull RAG setup with your own data model! - Adds all the related classes, samples and tests for the above. - Also reorders the data folder, which might cause some slight breaking changes for the few stores that have the new vector store model. - Adds additional IndexKinds and DistanceFunctions to stay in sync with dotnet. - Renames VolatileStore and VolatileCollection to InMemoryVectorStore and InMemoryVectorCollection. Closes microsoft#6832 microsoft#6833 ### Contribution Checklist <!-- Before submitting this PR, please make sure: --> - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 --------- Co-authored-by: Tao Chen <taochen@microsoft.com>
- Loading branch information
1 parent
7ca11a9
commit c8b4094
Showing
128 changed files
with
5,020 additions
and
2,064 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
63 changes: 63 additions & 0 deletions
63
python/samples/concepts/memory/azure_ai_search_hotel_samples/step_0_data_model.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Copyright (c) Microsoft. All rights reserved. | ||
|
||
|
||
from typing import Annotated, Any | ||
|
||
from pydantic import BaseModel | ||
|
||
from semantic_kernel.connectors.ai.open_ai import OpenAIEmbeddingPromptExecutionSettings | ||
from semantic_kernel.data import ( | ||
VectorStoreRecordDataField, | ||
VectorStoreRecordKeyField, | ||
VectorStoreRecordVectorField, | ||
vectorstoremodel, | ||
) | ||
|
||
### | ||
# The data model used for this sample is based on the hotel data model from the Azure AI Search samples. | ||
# When deploying a new index in Azure AI Search using the import wizard you can choose to deploy the 'hotel-samples' | ||
# dataset, see here: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal. | ||
# This is the dataset used in this sample with some modifications. | ||
# This model adds vectors for the 2 descriptions in English and French. | ||
# Both are based on the 1536 dimensions of the OpenAI models. | ||
# You can adjust this at creation time and then make the change below as well. | ||
### | ||
|
||
|
||
@vectorstoremodel | ||
class HotelSampleClass(BaseModel): | ||
hotel_id: Annotated[str, VectorStoreRecordKeyField] | ||
hotel_name: Annotated[str | None, VectorStoreRecordDataField()] = None | ||
description: Annotated[ | ||
str, | ||
VectorStoreRecordDataField( | ||
has_embedding=True, embedding_property_name="description_vector", is_full_text_searchable=True | ||
), | ||
] | ||
description_vector: Annotated[ | ||
list[float] | None, | ||
VectorStoreRecordVectorField( | ||
dimensions=1536, | ||
local_embedding=True, | ||
embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)}, | ||
), | ||
] = None | ||
description_fr: Annotated[ | ||
str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="description_fr_vector") | ||
] | ||
description_fr_vector: Annotated[ | ||
list[float] | None, | ||
VectorStoreRecordVectorField( | ||
dimensions=1536, | ||
local_embedding=True, | ||
embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)}, | ||
), | ||
] = None | ||
category: Annotated[str, VectorStoreRecordDataField()] | ||
tags: Annotated[list[str], VectorStoreRecordDataField()] | ||
parking_included: Annotated[bool | None, VectorStoreRecordDataField()] = None | ||
last_renovation_date: Annotated[str | None, VectorStoreRecordDataField()] = None | ||
rating: Annotated[float, VectorStoreRecordDataField()] | ||
location: Annotated[dict[str, Any], VectorStoreRecordDataField()] | ||
address: Annotated[dict[str, str | None], VectorStoreRecordDataField()] | ||
rooms: Annotated[list[dict[str, Any]], VectorStoreRecordDataField()] |
104 changes: 104 additions & 0 deletions
104
...ples/concepts/memory/azure_ai_search_hotel_samples/step_1_interact_with_the_collection.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Copyright (c) Microsoft. All rights reserved. | ||
|
||
import asyncio | ||
|
||
### | ||
# The data model used for this sample is based on the hotel data model from the Azure AI Search samples. | ||
# When deploying a new index in Azure AI Search using the import wizard you can choose to deploy the 'hotel-samples' | ||
# dataset, see here: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal. | ||
# This is the dataset used in this sample with some modifications. | ||
# This model adds vectors for the 2 descriptions in English and French. | ||
# Both are based on the 1536 dimensions of the OpenAI models. | ||
# You can adjust this at creation time and then make the change below as well. | ||
# This sample assumes the index is deployed, the vector fields can be empty. | ||
# If the vector fields are empty, change the first_run parameter to True to add the vectors. | ||
### | ||
from step_0_data_model import HotelSampleClass | ||
|
||
from semantic_kernel import Kernel | ||
from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbedding | ||
from semantic_kernel.connectors.memory.azure_ai_search import AzureAISearchCollection | ||
from semantic_kernel.data import ( | ||
VectorSearchOptions, | ||
VectorStoreRecordUtils, | ||
) | ||
|
||
first_run = False | ||
|
||
|
||
async def add_vectors(collection: AzureAISearchCollection, vectorizer: VectorStoreRecordUtils): | ||
"""This is a simple function that uses the VectorStoreRecordUtils to add vectors to the records in the collection. | ||
It first uses the search_client within the collection to get a list of ids. | ||
and then uses the upsert to add the vectors to the records. | ||
""" | ||
ids: list[str] = [res.get("hotel_id") async for res in await collection.search_client.search(select="hotel_id")] | ||
print("sample id:", ids[0]) | ||
|
||
hotels = await collection.get_batch(ids) | ||
if hotels is not None and isinstance(hotels, list): | ||
for hotel in hotels: | ||
if not hotel.description_vector or not hotel.description_fr_vector: | ||
hotel = await vectorizer.add_vector_to_records(hotel, HotelSampleClass) | ||
await collection.upsert(hotel) | ||
|
||
|
||
async def main(query: str, first_run: bool = False): | ||
# Create the kernel | ||
kernel = Kernel() | ||
# Add the OpenAI text embedding service | ||
embeddings = OpenAITextEmbedding(service_id="embedding", ai_model_id="text-embedding-3-small") | ||
kernel.add_service(embeddings) | ||
# Create the VectorStoreRecordUtils object | ||
vectorizer = VectorStoreRecordUtils(kernel) | ||
# Create the Azure AI Search collection | ||
collection = AzureAISearchCollection[HotelSampleClass]( | ||
collection_name="hotels-sample-index", data_model_type=HotelSampleClass | ||
) | ||
# Check if the collection exists. | ||
if not await collection.does_collection_exist(): | ||
raise ValueError( | ||
"Collection does not exist, please create using the " | ||
"Azure AI Search portal wizard -> Import Data -> Samples -> hotels-sample." | ||
"During creation adopt the schema to add the description_vector and description_fr_vector fields." | ||
"Then run this sample with `first_run=True` to add the vectors." | ||
) | ||
|
||
# If it is the first run and there are no vectors, add them. | ||
if first_run: | ||
await add_vectors(collection, vectorizer) | ||
|
||
# Search using just text, by default this will search all the searchable text fields in the index. | ||
results = await collection.text_search(search_text=query) | ||
print("Search results using text: ") | ||
async for result in results.results: | ||
print( | ||
f" {result.record.hotel_id} (in {result.record.address['city']}, " | ||
f"{result.record.address['country']}): {result.record.description} (score: {result.score})" | ||
) | ||
|
||
print("\n") | ||
|
||
# Generate the vector for the query | ||
query_vector = (await embeddings.generate_raw_embeddings([query]))[0] | ||
|
||
print("Search results using vector: ") | ||
# Use vectorized search to search using the vector. | ||
results = await collection.vectorized_search( | ||
vector=query_vector, | ||
options=VectorSearchOptions(vector_field_name="description_vector"), | ||
) | ||
async for result in results.results: | ||
print( | ||
f" {result.record.hotel_id} (in {result.record.address['city']}, " | ||
f"{result.record.address['country']}): {result.record.description} (score: {result.score})" | ||
) | ||
|
||
# Delete the collection object so that the connection is closed. | ||
del collection | ||
await asyncio.sleep(2) | ||
|
||
|
||
if __name__ == "__main__": | ||
query = "swimming pool and good internet connection" | ||
asyncio.run(main(query=query, first_run=first_run)) |
Oops, something went wrong.