Skip to content

Vision2Audio - Giving the blind an understanding through AI. Utilizing the server-side implementation of llama.cpp through llava to describe the image using Riva Speech AI SDK

Notifications You must be signed in to change notification settings

shahizat/Vision2Audio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision2Audio - Giving the blind an understanding through AI

More details can be found here - Hackster.io

Vision2Audio is a web application designed to enhance the lives of visually impaired and blind individuals by enabling them to capture images, ask questions about them, and receive spoken answers using cutting-edge AI technologies.

The application leverages NVIDIA's Riva Automatic Speech Recognition (ASR) to convert spoken questions into text. This text is then fed into the LLaVA (Large Language-and-Vision Assistant) multimodal model using llama.cpp server implementation, which provides comprehensive image description. Finally, NVIDIA's Riva Text-to-Speech (TTS) technology converts the generated text into spoken audio, delivering the answers to the user in an accessible format.

Alt text

Usage

For simplicity we will assume everything is installed. Start Nvidia Riva server by running the command:

bash riva_start.sh

Once the Riva server status is running, open another terminal and execute the following command to run llava server via llama.cpp:

./bin/server -m models/llava1.5-13B/ggml-model-q4_k.gguf
   --mmproj models/llava1.5-13B/mmproj-model-f16.gguf
   --port 8080
   -ngl 35

You can download the models from here. Keep the server running in the background. Open another terminal and run:

python3 -m flask run --host=0.0.0.0 --debug

Open another terminal and run cloudflared tunnel using the following command:

cloudflared tunnel --url http://127.0.0.1:5000

Acknowledgements

The implementation of the project relies on:

I thank the original authors for their open-sourcing.

About

Vision2Audio - Giving the blind an understanding through AI. Utilizing the server-side implementation of llama.cpp through llava to describe the image using Riva Speech AI SDK

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published