Skip to content

axwhyzee/multi-modal-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Modal Retrieval System

This project is a demonstration of a Multi-Modal Retrieval System, where documents of various modalities like image, text, video, image+text, can be retrieved using text in natural language. It can be used in corporate intranet document lookup, cloud storage search, or even enhancing Google search by going beyond simple text matching.

Component Description GitHub Repo
Event Core Common code for:
  • Domain models
  • Event schemas
  • API clients for:
    • Storage Service
    • Embedding Service
    • Meta Service
https://github.com/axwhyzee/multi-modal-retrieval-event-core
Gateway Service
  • API gateway
  • Coordinate calls to various services to aggregate a response
https://github.com/axwhyzee/multi-modal-retrieval-gateway-service
Storage Service Remote object repository using AWS S3 buckets https://github.com/axwhyzee/multi-modal-retrieval-storage-service
Embedding Service
  • On ChunkStored events, index the chunk using Pinecone
  • Given a query, fetch most relevant objects in 2-stage retrieval
https://github.com/axwhyzee/multi-modal-retrieval-embedding-service
Preprocessor Service On DocStored events, chunk document, carry out text and image preprocessing on chunks, and generate thumbnails https://github.com/axwhyzee/multi-modal-retrieval-preprocessor-service
Meta Service Holds mapping of objects to their meta data N/A (using Redis server)
Frontend GUI https://github.com/axwhyzee/multi-modal-retrieval-frontend

This repo orchestrates the system on a single machine with containerized services. To run a distributed version, each service (each git repo) can run on its own box.


1. Demo

1.1 Sony WH-1000XM4 Help Guide

Use case: Customer support
Dataset:

demo_sony_150_speed.mp4

1.2 Pinecone Python Client Codebase

Use case: Internal technical documentation search
Dataset:

demo_pinecone_150_speed.mp4

2. System

2.1 System Overview

FYP System Design v2 (14)

2.2 Detailed Architecture

The hybrid architecture can be divided into the write and read paths, which are event-driven and request-response based respectively.

2.2.1 Write Path

FYP System Design v2 (11)

The write path is designed to be event-driven because processing bottlenecks like chunking and embedding can be called asynchronously; all steps within the write path are idempotent; and eventual consistency is sufficient.

2.2.1.1 Write Path: Document Upload

When a document is uploaded to the Gateway Service (API gateway), the Gateway Service stores the document into the Storage Service, which emits a DocStored event.

2.2.1.2 Write Path: Pre-processing

FYP System Design v2 (8)

DocStored events are received by the Preprocessor Service, which extracts elements like images, texts, plots and code blocks are from the document. Assets like document thumbnails and element thumbnails are also generated. All these objects are stored in the Storage Service, and meta data is inserted into the Meta Service where applicable. When element objects are inserted into the Storage Service, ElementStored events are emitted.

2.2.1.3 Write Path: Indexing

FYP System Design v2 (10)

ElementStored events are received by the Embedding Service, which embeds the elements using the corresponding embedding models. The embeddings are indexed in the corresponding {ELEMENT_TYPE}/{USER} namespace in vector database Pinecone which allows multi-tenancy.

2.2.2 Read Path

FYP System Design v2 (13)

The read path is synchronous because it must respond back to the user ASAP. Hence, it follows a simple and traditional request-response design.

2.2.2.1 Read Path: Query

User sends a text query to the Gateway Service, which forwards the request to the Embedding Service.

2.2.2.2 Read Path: Retrieval

FYP System Design v2 (15)

For each element type, Embedding Service embeds the text query using the corresponding embedding model. The text embedding is used to query in the namespace corresponding to the element type and user, fetching top-k elements most similar to the text query. The top-k elements are reranked by the element-specific rerankers, and only the top-n ranked elements are returned as response.

Note: top_k = top_n * MULTIPLIER, where MULTIPLIER is an int > 1

2.2.2.3 Read Path: Transform & Aggregate

On receiving the results from the Embedding Service, the Gateway Service transforms the response and fetches corresponding asset and element meta data from the Meta Service.


Setup

  1. Create a .env file with the following env vars:
AWS_S3_BUCKET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=...
AWS_S3_BUCKET_REGION=...
AWS_S3_BUCKET_SECRET_ACCESS_KEY=...
EMBEDDING_SERVICE_API_URL=http://embedding_service_api-1:5000/  # use generated name of docker container
ENV=DEV                                                         # use local file system instead of S3 for object storage
PINECONE_API_KEY=...
REACT_APP_API_URL=http://localhost:5001                         # URL to Gateway Service, has port forwarding 5001:5000 by default (configure in `docker-compose.yml`)
REACT_APP_USER=...
REDIS_HOST=...
REDIS_PASSWORD=...
REDIS_PORT=...
REDIS_USERNAME=...
STORAGE_SERVICE_API_URL=http://storage_service_api-1:5000/      # use generated name of docker container
  1. Install Docker
  2. Increase Docker memory limit to at least 12GB
  3. Run source run.sh to clone the services + build and/or start the docker containers
  4. Insert dummy data by running python insert.py
  5. Go to http://localhost:3000 to access the web-based GUI

Workers

To scale up a particular service like embedding_service_event_consumer, change the docker command in run.sh as shown

docker-compose up -d --scale embedding_service_event_consumer=3

Future Works

This system is designed such that it is not required for documents of all modalities live in the same embedding space. This means that new modalities can be introduced, as long as there exists a dual-modal text-<NEW MODAL> model. For instance, the audio modality can be introduced as long as there exists a suitable text-audio embedding model and reranker. This also means that custom document formats like proprietary ones, can make use of this search system as well.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published