This project is a demonstration of a Multi-Modal Retrieval System, where documents of various modalities like image, text, video, image+text, can be retrieved using text in natural language. It can be used in corporate intranet document lookup, cloud storage search, or even enhancing Google search by going beyond simple text matching.
Component | Description | GitHub Repo |
---|---|---|
Event Core | Common code for:
|
https://github.com/axwhyzee/multi-modal-retrieval-event-core |
Gateway Service |
|
https://github.com/axwhyzee/multi-modal-retrieval-gateway-service |
Storage Service | Remote object repository using AWS S3 buckets | https://github.com/axwhyzee/multi-modal-retrieval-storage-service |
Embedding Service |
|
https://github.com/axwhyzee/multi-modal-retrieval-embedding-service |
Preprocessor Service | On DocStored events, chunk document, carry out text and image preprocessing on chunks, and generate thumbnails | https://github.com/axwhyzee/multi-modal-retrieval-preprocessor-service |
Meta Service | Holds mapping of objects to their meta data | N/A (using Redis server) |
Frontend | GUI | https://github.com/axwhyzee/multi-modal-retrieval-frontend |
This repo orchestrates the system on a single machine with containerized services. To run a distributed version, each service (each git repo) can run on its own box.
Use case: Customer support
Dataset:
- PDFs scraped from Sony WH-1000XM4 Help Guide
- Videos from Sony Europe Youtube Channel
demo_sony_150_speed.mp4
Use case: Internal technical documentation search
Dataset:
demo_pinecone_150_speed.mp4
The hybrid architecture can be divided into the write and read paths, which are event-driven and request-response based respectively.
The write path is designed to be event-driven because processing bottlenecks like chunking and embedding can be called asynchronously; all steps within the write path are idempotent; and eventual consistency is sufficient.
When a document is uploaded to the Gateway Service (API gateway), the Gateway Service stores the document into the Storage Service, which emits a DocStored event.
DocStored events are received by the Preprocessor Service, which extracts elements like images, texts, plots and code blocks are from the document. Assets like document thumbnails and element thumbnails are also generated. All these objects are stored in the Storage Service, and meta data is inserted into the Meta Service where applicable. When element objects are inserted into the Storage Service, ElementStored events are emitted.
ElementStored events are received by the Embedding Service, which embeds the elements using the corresponding embedding models. The embeddings are indexed in the corresponding {ELEMENT_TYPE}/{USER} namespace in vector database Pinecone which allows multi-tenancy.
The read path is synchronous because it must respond back to the user ASAP. Hence, it follows a simple and traditional request-response design.
User sends a text query to the Gateway Service, which forwards the request to the Embedding Service.
For each element type, Embedding Service embeds the text query using the corresponding embedding model. The text embedding is used to query in the namespace corresponding to the element type and user, fetching top-k elements most similar to the text query. The top-k elements are reranked by the element-specific rerankers, and only the top-n ranked elements are returned as response.
Note: top_k
= top_n
* MULTIPLIER, where MULTIPLIER is an int > 1
On receiving the results from the Embedding Service, the Gateway Service transforms the response and fetches corresponding asset and element meta data from the Meta Service.
- Create a
.env
file with the following env vars:
AWS_S3_BUCKET_ACCESS_KEY=...
AWS_S3_BUCKET_NAME=...
AWS_S3_BUCKET_REGION=...
AWS_S3_BUCKET_SECRET_ACCESS_KEY=...
EMBEDDING_SERVICE_API_URL=http://embedding_service_api-1:5000/ # use generated name of docker container
ENV=DEV # use local file system instead of S3 for object storage
PINECONE_API_KEY=...
REACT_APP_API_URL=http://localhost:5001 # URL to Gateway Service, has port forwarding 5001:5000 by default (configure in `docker-compose.yml`)
REACT_APP_USER=...
REDIS_HOST=...
REDIS_PASSWORD=...
REDIS_PORT=...
REDIS_USERNAME=...
STORAGE_SERVICE_API_URL=http://storage_service_api-1:5000/ # use generated name of docker container
- Install Docker
- Increase Docker memory limit to at least 12GB
- Run
source run.sh
to clone the services + build and/or start the docker containers - Insert dummy data by running
python insert.py
- Go to
http://localhost:3000
to access the web-based GUI
To scale up a particular service like embedding_service_event_consumer
, change the docker command in run.sh
as shown
docker-compose up -d --scale embedding_service_event_consumer=3
This system is designed such that it is not required for documents of all modalities live in the same embedding space. This means that new modalities can be introduced, as long as there exists a dual-modal text-<NEW MODAL>
model. For instance, the audio modality can be introduced as long as there exists a suitable text-audio
embedding model and reranker. This also means that custom document formats like proprietary ones, can make use of this search system as well.