The open-source playground for generative AI. Rapidly prototype and deploy multi-modal applications—from text and image classification to speech synthesis—using Python, Gradio, and Transformers.
- 🚀 Gradio Interface: Fast, modern AI-Powered apps interface with the API ready feature
- 🤗 Hugging Face Integration: Uses state-of-the-art BLIP model for image captioning
- 📦 Batch Processing: Support for processing multiple images at once
- 🔥 GPU Support: Automatic GPU acceleration when available
Genr-Kit provides a comprehensive suite of tools for common generative AI tasks. Each tool leverages state-of-the-art, publicly available models from the Hugging Face Hub.
Categorize text into predefined labels like sentiment or topic.
- Model:
distilbert-base-uncased-finetuned-sst-2-englishA fast and accurate model fine-tuned for sentiment analysis (positive/negative), perfect for real-time applications.
Generate coherent and contextually relevant text from a prompt.
- Model:
gpt2A pioneering transformer model capable of generating creative text continuations in various styles.
Analyze text to determine its emotional tone (e.g., positive, negative, neutral).
- Model:
cardiffnlp/twitter-roberta-base-sentiment-latestA robust model trained on a large corpus of tweets, excellent for modern, informal language.
Condense long articles, reports, or documents into concise summaries.
- Model:
facebook/bart-large-cnnThe BART model fine-tuned on the CNN/DailyMail dataset, making it excellent for abstractive summarization.
Identify and extract entities like names, organizations, and locations from text.
- Model:
dslim/bert-base-NERA BERT model specifically fine-tuned to recognize common named entities with high accuracy.
Convert natural language instructions into structured commands or API calls.
- Model:
microsoft/DialoGPT-mediumWhile often used for chat, its fine-tuning capabilities make it a good base for learning instruction-to-command tasks.
Identify the main subject or scene within an image.
- Model:
google/vit-base-patch16-224A Vision Transformer (ViT) model that achieves excellent accuracy on the ImageNet-1k benchmark.
Locate and identify multiple objects within an image using bounding boxes.
- Model:
hustvl/yolos-tinyA You Only Look Once (YOLO) transformer model that provides a great balance of speed and accuracy for real-time detection.
Generate novel images from a text description.
- Model:
runwayml/stable-diffusion-v1-5A powerful latent diffusion model for creating high-quality, detailed images from any text prompt.
Generate a descriptive English-language caption for a given image.
- Model:
Salesforce/blip-image-captioning-baseThe BLIP model provides high-quality, context-aware captions, ideal for accessibility and content description.
Transform an input image based on a text prompt (e.g., style transfer, enhancement).
- Model:
timbrooks/instruct-pix2pixA model specifically fine-tuned for following instructions to edit images, like "make it a cartoon".
Identify and map specific objects or regions in an image at the pixel level.
- Model:
facebook/detr-resnet-50-panopticA transformer-based model that performs both instance and panoptic segmentation in a single architecture.
Convert written text into natural-sounding spoken audio.
- Model:
espnet/kan-bayashi_ljspeech_vitsA VITS-based model that produces highly natural and expressive speech in English.
Transcribe spoken audio from various languages into written text.
- Model:
openai/whisper-baseOpenAI's Whisper model offers robust, accurate transcription and translation across multiple languages.
Remove background noise and improve the clarity of an audio recording.
- Model:
microsoft/speechbrain-mtl-mimic-voicebankA model trained specifically for speech enhancement and denoising tasks.
Generate short musical audio clips from text descriptions.
- Model:
facebook/musicgen-smallA simple and controllable model for generating high-quality music from text prompts.
Answer natural language questions about the contents of an image.
- Model:
dandelin/vilt-b32-finetuned-vqaA vision-and-language transformer (ViLT) model designed for answering questions about images.
Answer questions based on the content of a document (e.g., scanned PDFs, images with text).
- Model:
impira/layoutlm-document-qaA model that understands the layout of documents (text + spatial information) to answer questions accurately.
Create numerical vector representations (embeddings) of text, images, or audio for analysis and search.
- Model:
sentence-transformers/all-MiniLM-L6-v2A versatile model that maps sentences and paragraphs to a dense vector space, perfect for semantic search and clustering.
- Clone the repository:
git clone https://github.com/alinrajpoot/genr-kit.git
cd genr-kit- Install dependencies:
pip install -r requirements.txt- Run the server:
python main.pyThe API will be available at http://localhost:9000
Once the server is running, visit http://localhost:9000 in browser.
- Python 3.8+
- Gradio
- Hugging Face Transformers
- PyTorch
- Pillow (PIL)
Open source - feel free to contribute and improve!