A multimodal medical chatbot that takes an audio recording and an image as input, processes them using AI, and provides an audio-based medical response.
-
Voice-Based Interaction -- Users can ask medical queries through voice recordings.
-
Image Analysis -- Upload medical images for AI-based analysis.
-
AI-Powered Medical Insights -- Provides a concise, professional medical response as if from a real doctor.
-
Text-to-Speech Output -- Generates an audio response using ElevenLabs or Google Text-to-Speech.
-
Seamless User Experience -- Integrated with a Gradio-based UI for easy interaction.
| Component | Technology |
|---|---|
| AI Inference | Groq API |
| Vision Model | LLaMA-3 90B Vision Preview |
| Speech Recognition | Whisper Large v3 |
| Text-to-Speech | ElevenLabs OR gTTS |
| Web Interface | Gradio |
| Audio Processing | PyAudio, pydub |
git clone https://github.com/Raza-Aziz/Multimodal-Medical-Chatbot-using-Groq.git
cd Multimodal-Medical-Chatbot-using-Groqconda create --name medibot python=3.11
conda activate medibotAll required dependencies are listed in requirements.txt. Install them with:
pip install -r requirements.txtCreate a .env file in the root directory and add your Groq and ElevenLabs credentials:
GROQ_API_KEY="your_groq_api_key"
ELEVENLABS_API_KEY="your_elevenlabs_api_key"python gradio_app.py
Open your web browser and go to:
http://localhost:7860
-
Record a voice query or upload an audio file.
-
Upload a medical image.
-
The AI analyzes both inputs and generates a professional medical response.
-
The response is provided as both text and synthesized speech.
1️⃣ Audio Processing:
-
The patient's voice is recorded via Gradio or uploaded as an MP3 file.
-
The speech-to-text model (Whisper v3 via Groq) transcribes the query.
2️⃣ Image Analysis:
-
The uploaded medical image is encoded in base64 format.
-
It is analyzed using Groq's multimodal model (LLaMA 3.2 Vision).
3️⃣ AI-Generated Medical Response:
-
The transcribed text and image data are passed to the LLaMA model.
-
The AI provides a medical assessment and suggested remedies.
4️⃣ Text-to-Speech Conversion:
-
The AI-generated response is converted into speech using ElevenLabs.
-
The synthesized audio is played back to the user.
| Feature | Screenshot |
|---|---|
| Chat Interface | ![]() |
| Image Analysis Example | ![]() |
This project is licensed under the MIT License -- see the LICENSE file for details.
For any questions or contributions, feel free to reach out:
GitHub: @Raza-Aziz
Email: razaaziz9191@gmail.com

