This project demonstrates the creation of an image similarity search application utilizing Azure Cosmos DB for PostgreSQL as a vector database and Azure AI Vision for generating embeddings. It serves as a starting point that can be used for the development of more sophisticated vector search solutions.
In this sample application, we will explore image similarity search on Azure Cosmos DB for PostgreSQL using the SemArt Dataset. This dataset contains approximately 21k paintings gathered from the Web Gallery of Art. Each painting comes with various attributes, like a title, description, and the name of the artist.
Before you start, ensure that you have the following prerequisites installed and configured:
-
An Azure subscription - Create an Azure free account or an Azure for Students account.
-
An Azure AI Vision resource or a multi-service resource for Azure AI services - It is recommended to use the standard tier because the free tier allows only 20 transactions per minute.
The multi-modal embeddings APIs are available in the following regions: East US, France Central, Korea Central, North Europe, Southeast Asia, West Europe, West US.
-
An Azure Storage account - Create an Azure Storage account using the Azure CLI.
-
An Azure Cosmos DB for PostgreSQL cluster - Create an Azure Cosmos DB for PostgreSQL cluster in the Azure portal. You should also activate the pgvector extension.
-
Python 3.10, Visual Studio Code, Jupyter Notebook, and Jupyter Extension for Visual Studio Code.
Before running the Python scripts and Jupyter Notebooks, you should:
-
Clone this repository to to have it locally available.
-
Download the SemArt Dataset into the semart_dataset directory.
-
Create a virtual environment and activate it.
-
Install the required Python packages using the following command:
pip install -r requirements.txt
-
Generate a .env file by using the provided .env.sample file from this repository.
Sample | Description |
---|---|
Data Preprocessing | Cleans up the SemArt Dataset and creates the final dataset that is utilized in our application. |
Embeddings Generation | Generates vector embeddings for the images in the dataset using the Azure AI Vision Vectorize Image API and creates the final dataset that is utilized in the image search application. |
Sample | Description |
---|---|
Upload images to Azure Blob Storage | Creates an Azure Blob Storage container and uploads the paintings' images. |
Insert data to Azure Cosmos DB for PostgreSQL | Creates a table in the Azure Cosmos DB for PostgreSQL cluster and populates it with data from the dataset. |
Insert data to Azure Cosmos DB for PostgreSQL and create IVFFlat index | Creates a table in the Azure Cosmos DB for PostgreSQL cluster, populates it with data from the dataset, and creates an IVFFlat index. |
Insert data to Azure Cosmos DB for PostgreSQL and create HNSW index | Creates a table in the Azure Cosmos DB for PostgreSQL cluster, populates it with data from the dataset, and creates an HNSW index. |
Sample | Description |
---|---|
Exact nearest neighbor search with pgvector | Demonstrates text-to-image and image-to-image search approaches, along with a simple method for metadata filtering. |
Approximate Nearest Neighbor Search with IVFFlat Index | Demonstrates text-to-image and image-to-image search approaches utilizing the IVFFlat index and compares the results with those retrieved through exact search. |
Approximate Nearest Neighbor Search with HNSW Index | Demonstrates text-to-image and image-to-image search approaches utilizing the HNSW index and compares the results with those retrieved through exact search. |
Title | Summary |
---|---|
Use the Azure AI Vision multi-modal embeddings API for image retrieval | Explore the basics of vector search and generate vector embeddings for images and text using the Azure AI Vision multi-modal embeddings APIs. |
Generate embeddings with Azure AI Vision multi-modal embeddings API | Discover the art of generating vector embeddings for paintings’ images using the Azure AI Vision multi-modal embeddings APIs in Python. |
Store embeddings in Azure Cosmos DB for PostgreSQL with pgvector | Learn how to configure Azure Cosmos DB for PostgreSQL as a vector database and insert embeddings into a table using the pgvector extension. |
Use pgvector for searching images on Azure Cosmos DB for PostgreSQL | Learn how to write SQL queries to search for and identify images that are semantically similar to a reference image or text prompt using pgvector. |
Use IVFFlat index on Azure Cosmos DB for PostgreSQL for similarity search | Explore vector similarity search using the Inverted File with Flat Compression (IVFFlat) index of pgvector on Azure Cosmos DB for PostgreSQL. |
Use HNSW index on Azure Cosmos DB for PostgreSQL for similarity search | Explore vector similarity search using the Hierarchical Navigable Small World (HNSW) index of pgvector on Azure Cosmos DB for PostgreSQL. |
- Azure AI Vision multi-modal embeddings - Microsoft Docs
- Call the multi-modal embeddings APIs – Microsoft Docs
- How to use pgvector on Azure Cosmos DB for PostgreSQL – Microsoft Docs
- Official GitHub repository of the pgvector extension
- Vector Similarity Search and Faiss Course by James Briggs
Feel free to experiment with the project and modify the code to meet your specific use cases and requirements!