Python: Introducing vector and text search (microsoft#9345)

### Motivation and Context  This PR does the following things: - Introduces TextSearch abstractions, including implementation for Bing - This consists of the TextSearch class, which implements three public search methods, and handles the internals, the search methods are: 'search' returns a string, 'get_text_search_results' returns a TextSearchResult object and 'get_search_results' returns a object native to the search service (i.e. BingWebPages for Bing) - This also has a method called "create_{search_method}' which returns a KernelFunction based on the search method. This function can be adapted by setting the parameters and has several adaptability options and allows you to create a RAG pipeline easily with custom names and descriptions of both the functions and the parameters! - Introduces VectorSearch abstractions, including implementation for Azure AI Search - This consists of a VectorStoreBase class which handles all the internal and three public interfaces, vectorized_search (supply a vector), vectorizable_text_search (supply a string that get's vectorized downstream), vector_text_search (supply a string), each vector store record collection can pick and choose which ones they need to support by importing one or more next to the VectorSearchBase class. - Introduces VectorStoreTextSearch as a way to leverage text search against vector stores - Since this builds on TextSearch this is now the best way to create a super powerfull RAG setup with your own data model! - Adds all the related classes, samples and tests for the above. - Also reorders the data folder, which might cause some slight breaking changes for the few stores that have the new vector store model. - Adds additional IndexKinds and DistanceFunctions to stay in sync with dotnet. - Renames VolatileStore and VolatileCollection to InMemoryVectorStore and InMemoryVectorCollection. Closes microsoft#6832 microsoft#6833 ### Contribution Checklist  - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [ ] I didn't break anyone 😄 --------- Co-authored-by: Tao Chen <taochen@microsoft.com>
Richasy · Nov 6, 2024 · c8b4094 · c8b4094
1 parent 7ca11a9
commit c8b4094
Show file tree

Hide file tree

Showing 128 changed files with 5,020 additions and 2,064 deletions.
diff --git a/.github/workflows/python-integration-tests.yml b/.github/workflows/python-integration-tests.yml
diff --git a/python/.cspell.json b/python/.cspell.json
@@ -63,6 +63,14 @@
         "vectorizer",
         "vectorstoremodel",
         "vertexai",
-        "Weaviate"
+        "Weaviate",
+        "qdrant",
+        "huggingface",
+        "pytestmark",
+        "contoso",
+        "opentelemetry",
+        "SEMANTICKERNEL",
+        "OTEL",
+        "vectorizable"
     ]
-}
+}
diff --git a/python/.pre-commit-config.yaml b/python/.pre-commit-config.yaml
@@ -49,7 +49,7 @@ repos:
     - id: mypy
       files: ^python/semantic_kernel/
       name: mypy
-      entry: uv run mypy -p semantic_kernel --config-file python/mypy.ini
+      entry: cd python && uv run mypy -p semantic_kernel --config-file mypy.ini
       language: system
       types: [python]
       pass_filenames: false

diff --git a/python/.vscode/launch.json b/python/.vscode/launch.json
@@ -10,7 +10,7 @@
             "request": "launch",
             "program": "${file}",
             "console": "integratedTerminal",
-            "justMyCode": true
+            "justMyCode": false
         }
     ]
 }
diff --git a/python/pyproject.toml b/python/pyproject.toml
@@ -56,7 +56,7 @@ azure = [
     "azure-cosmos ~= 4.7"
 ]
 chroma = [
-    "chromadb >= 0.4,<0.6"
+    "chromadb >= 0.5,<0.6"
 ]
 google = [
     "google-cloud-aiplatform ~= 1.60",
@@ -79,7 +79,7 @@ milvus = [
     "milvus >= 2.3,<2.3.8; platform_system != 'Windows'"
 ]
 mistralai = [
-    "mistralai >= 0.4,< 2.0"
+    "mistralai >= 0.4,< 1.0"
 ]
 ollama = [
     "ollama ~= 0.2"
@@ -140,6 +140,7 @@ environments = [
 
 [tool.pytest.ini_options]
 addopts = "-ra -q -r fEX"
+asyncio_default_fixture_loop_scope = "function"
 
 [tool.ruff]
 line-length = 120

diff --git a/python/samples/concepts/memory/azure_ai_search_hotel_samples/__init__.py b/python/samples/concepts/memory/azure_ai_search_hotel_samples/__init__.py
diff --git a/python/samples/concepts/memory/azure_ai_search_hotel_samples/step_0_data_model.py b/python/samples/concepts/memory/azure_ai_search_hotel_samples/step_0_data_model.py
@@ -0,0 +1,63 @@
+# Copyright (c) Microsoft. All rights reserved.
+
+
+from typing import Annotated, Any
+
+from pydantic import BaseModel
+
+from semantic_kernel.connectors.ai.open_ai import OpenAIEmbeddingPromptExecutionSettings
+from semantic_kernel.data import (
+    VectorStoreRecordDataField,
+    VectorStoreRecordKeyField,
+    VectorStoreRecordVectorField,
+    vectorstoremodel,
+)
+
+###
+# The data model used for this sample is based on the hotel data model from the Azure AI Search samples.
+# When deploying a new index in Azure AI Search using the import wizard you can choose to deploy the 'hotel-samples'
+# dataset, see here: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal.
+# This is the dataset used in this sample with some modifications.
+# This model adds vectors for the 2 descriptions in English and French.
+# Both are based on the 1536 dimensions of the OpenAI models.
+# You can adjust this at creation time and then make the change below as well.
+###
+
+
+@vectorstoremodel
+class HotelSampleClass(BaseModel):
+    hotel_id: Annotated[str, VectorStoreRecordKeyField]
+    hotel_name: Annotated[str | None, VectorStoreRecordDataField()] = None
+    description: Annotated[
+        str,
+        VectorStoreRecordDataField(
+            has_embedding=True, embedding_property_name="description_vector", is_full_text_searchable=True
+        ),
+    ]
+    description_vector: Annotated[
+        list[float] | None,
+        VectorStoreRecordVectorField(
+            dimensions=1536,
+            local_embedding=True,
+            embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},
+        ),
+    ] = None
+    description_fr: Annotated[
+        str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name="description_fr_vector")
+    ]
+    description_fr_vector: Annotated[
+        list[float] | None,
+        VectorStoreRecordVectorField(
+            dimensions=1536,
+            local_embedding=True,
+            embedding_settings={"embedding": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},
+        ),
+    ] = None
+    category: Annotated[str, VectorStoreRecordDataField()]
+    tags: Annotated[list[str], VectorStoreRecordDataField()]
+    parking_included: Annotated[bool | None, VectorStoreRecordDataField()] = None
+    last_renovation_date: Annotated[str | None, VectorStoreRecordDataField()] = None
+    rating: Annotated[float, VectorStoreRecordDataField()]
+    location: Annotated[dict[str, Any], VectorStoreRecordDataField()]
+    address: Annotated[dict[str, str | None], VectorStoreRecordDataField()]
+    rooms: Annotated[list[dict[str, Any]], VectorStoreRecordDataField()]
diff --git a/...ples/concepts/memory/azure_ai_search_hotel_samples/step_1_interact_with_the_collection.py b/...ples/concepts/memory/azure_ai_search_hotel_samples/step_1_interact_with_the_collection.py
@@ -0,0 +1,104 @@
+# Copyright (c) Microsoft. All rights reserved.
+
+import asyncio
+
+###
+# The data model used for this sample is based on the hotel data model from the Azure AI Search samples.
+# When deploying a new index in Azure AI Search using the import wizard you can choose to deploy the 'hotel-samples'
+# dataset, see here: https://learn.microsoft.com/en-us/azure/search/search-get-started-portal.
+# This is the dataset used in this sample with some modifications.
+# This model adds vectors for the 2 descriptions in English and French.
+# Both are based on the 1536 dimensions of the OpenAI models.
+# You can adjust this at creation time and then make the change below as well.
+# This sample assumes the index is deployed, the vector fields can be empty.
+# If the vector fields are empty, change the first_run parameter to True to add the vectors.
+###
+from step_0_data_model import HotelSampleClass
+
+from semantic_kernel import Kernel
+from semantic_kernel.connectors.ai.open_ai import OpenAITextEmbedding
+from semantic_kernel.connectors.memory.azure_ai_search import AzureAISearchCollection
+from semantic_kernel.data import (
+    VectorSearchOptions,
+    VectorStoreRecordUtils,
+)
+
+first_run = False
+
+
+async def add_vectors(collection: AzureAISearchCollection, vectorizer: VectorStoreRecordUtils):
+    """This is a simple function that uses the VectorStoreRecordUtils to add vectors to the records in the collection.
+
+    It first uses the search_client within the collection to get a list of ids.
+    and then uses the upsert to add the vectors to the records.
+    """
+    ids: list[str] = [res.get("hotel_id") async for res in await collection.search_client.search(select="hotel_id")]
+    print("sample id:", ids[0])
+
+    hotels = await collection.get_batch(ids)
+    if hotels is not None and isinstance(hotels, list):
+        for hotel in hotels:
+            if not hotel.description_vector or not hotel.description_fr_vector:
+                hotel = await vectorizer.add_vector_to_records(hotel, HotelSampleClass)
+                await collection.upsert(hotel)
+
+
+async def main(query: str, first_run: bool = False):
+    # Create the kernel
+    kernel = Kernel()
+    # Add the OpenAI text embedding service
+    embeddings = OpenAITextEmbedding(service_id="embedding", ai_model_id="text-embedding-3-small")
+    kernel.add_service(embeddings)
+    # Create the VectorStoreRecordUtils object
+    vectorizer = VectorStoreRecordUtils(kernel)
+    # Create the Azure AI Search collection
+    collection = AzureAISearchCollection[HotelSampleClass](
+        collection_name="hotels-sample-index", data_model_type=HotelSampleClass
+    )
+    # Check if the collection exists.
+    if not await collection.does_collection_exist():
+        raise ValueError(
+            "Collection does not exist, please create using the "
+            "Azure AI Search portal wizard -> Import Data -> Samples -> hotels-sample."
+            "During creation adopt the schema to add the description_vector and description_fr_vector fields."
+            "Then run this sample with `first_run=True` to add the vectors."
+        )
+
+    # If it is the first run and there are no vectors, add them.
+    if first_run:
+        await add_vectors(collection, vectorizer)
+
+    # Search using just text, by default this will search all the searchable text fields in the index.
+    results = await collection.text_search(search_text=query)
+    print("Search results using text: ")
+    async for result in results.results:
+        print(
+            f"    {result.record.hotel_id} (in {result.record.address['city']}, "
+            f"{result.record.address['country']}): {result.record.description} (score: {result.score})"
+        )
+
+    print("\n")
+
+    # Generate the vector for the query
+    query_vector = (await embeddings.generate_raw_embeddings([query]))[0]
+
+    print("Search results using vector: ")
+    # Use vectorized search to search using the vector.
+    results = await collection.vectorized_search(
+        vector=query_vector,
+        options=VectorSearchOptions(vector_field_name="description_vector"),
+    )
+    async for result in results.results:
+        print(
+            f"    {result.record.hotel_id} (in {result.record.address['city']}, "
+            f"{result.record.address['country']}): {result.record.description} (score: {result.score})"
+        )
+
+    # Delete the collection object so that the connection is closed.
+    del collection
+    await asyncio.sleep(2)
+
+
+if __name__ == "__main__":
+    query = "swimming pool and good internet connection"
+    asyncio.run(main(query=query, first_run=first_run))