Generate short stories from images and transform them into lifelike audio using Parler TTS, Groq AI, and Hugging Face models, all within an interactive Streamlit app.
- Image-to-Text Interpretation: Convert images to textual descriptions with the
Salesforce/blip-image-captioning-base
model. - Story Generation: Generate short, creative stories (less than 20 words) based on the interpreted text using the
llama3-8b-8192
model from Groq. - Text-to-Speech: Convert generated stories into lifelike audio using the locally-run
parler-tts/parler-tts-large-v1
model, leveraging your GPU/CPU for fast processing. - Streamlit Integration: Interactive UI for uploading images, generating stories, and downloading audio.
- Error Handling: Automatically switches to CPU if GPU isn't available and handles model loading errors.
- Python 3.8+
- Hugging Face Transformers library
- Parler TTS for conditional generation
- Groq API for story generation
- SoundFile for audio output
- Streamlit for the UI
-
Clone the repository:
git clone https://github.com/rk-vashista/TTS-Story_Generator cd TTS-Story_Generator
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
- Create a
.env
file:touch .env
- Add your API keys to
.env
:HUGGING_FACE=<your_hugging_face_api_key> GROQ_API=<your_groq_api_key>
- Create a
-
Run the app:
streamlit run app.py
-
Upload an image: Upload a
.jpg
,.png
, or.jpeg
image to generate a scenario description. -
Generate a short story: The app will automatically interpret the uploaded image and generate a short story using Groq's LLaMA model.
-
Convert the story into audio: The app will convert the story into a
.wav
audio file using Parler TTS, which you can listen to and download.
- Upload Image ➡️ Text Description
- Text Description ➡️ Short Story
- Short Story ➡️ Lifelike Audio
The final audio will be saved as parler_tts_out.wav
, available for download directly through the app.
Create a .env
file in the project root to store your API keys:
HUGGING_FACE=<your_hugging_face_api_key>
GROQ_API=<your_groq_api_key>
- Hugging Face API: Used for image captioning with the
Salesforce/blip-image-captioning-base
model. - Groq API: Powers the story generation using
llama3-8b-8192
. - Parler TTS: Converts text into expressive speech using the
parler-tts-large-v1
model, running on local resources.
- Custom Voices: Experiment with different Parler TTS voice settings or modify the description to adjust the voice's style.
- Longer Stories: Adapt the story generation prompt to produce more detailed narratives.
- Advanced Image Analysis: Try other image-to-text models for deeper interpretation of the uploaded images.
- CUDA Errors: If CUDA isn't available, the program will switch to CPU automatically.
- Environment Variables: Ensure the
.env
file contains correct API keys for Hugging Face and Groq.
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face for providing powerful models.
- Groq for real-time AI model execution.
- Parler TTS for enabling high-quality text-to-speech synthesis.
- Streamlit for providing a seamless UI for deployment.