A comprehensive collection of Python scripts for preparing high-quality datasets for text-to-image model training. This toolkit provides end-to-end functionality from extracting frames from videos to generating detailed image captions using state-of-the-art vision-language models.
- Video Frame Extraction: Extract frames from videos at specified intervals
- Image Processing: Rename, resize, and format conversion utilities
- Automated Captioning: Generate detailed captions using multiple vision models
- Advanced Description Generation: Create comprehensive image descriptions with artistic analysis
- Flexible Output Formats: Support for JSONL, CSV, TSV, and TXT formats
- Resume Capability: Continue processing from where you left off
- Custom Style Prefixes: Add custom style tokens to captions
- GPU Acceleration: Optimized for CUDA-enabled systems
# Core dependencies
pip install torch torchvision torchaudio
pip install transformers
pip install Pillow
pip install opencv-python
# For quantization (optional but recommended for large models)
pip install bitsandbytes accelerate
git clone https://github.com/gokhaneraslan/text_to_image_dataset_toolkit.git
cd text_to_image_dataset_toolkit
Extract frames from videos at specified time or frame intervals.
Features:
- Extract frames by time intervals (seconds) or frame count
- Process single videos or entire folders
- Multiple output formats (JPG, PNG)
- Automatic folder organization by video name
Usage:
# Extract frame every 2 seconds
python video_frame_extractor.py input_video.mp4 output_folder -s 2.0
# Extract every 30th frame
python video_frame_extractor.py input_video.mp4 output_folder -f 30
# Process entire folder
python video_frame_extractor.py video_folder/ output_folder -s 1.5 --format png
Rename and organize images with sequential numbering and format conversion.
Features:
- Batch rename images with custom prefixes
- Format conversion between image types
- Automatic RGB conversion for JPEG compatibility
- Sequential numbering with zero-padding
Usage:
# Basic renaming
python image_organizer.py source_folder dest_folder --prefix "dataset_img"
# With format conversion
python image_organizer.py source_folder dest_folder --prefix "img" --output_format jpg --start_index 1000
Generate captions using lightweight vision models (BLIP, GIT).
Features:
- Support for Microsoft GIT and Salesforce BLIP models
- Multiple output formats (JSONL, CSV, TSV, TXT)
- Custom style prefix addition
- GPU acceleration support
Usage:
# Configuration
MODEL_ID = "microsoft/git-base-coco" # or "Salesforce/blip-image-captioning-base"
IMAGE_INPUT_FOLDER = "/path/to/images"
METADATA_OUTPUT_FILE = "/path/to/metadata.jsonl"
custom_style_caption = "your_style_token"
# Run the script
python basic_captioner.py
Generate detailed, artistic descriptions using Google's Gemma-3 model.
Features:
- Two captioning modes: detailed analysis and concise captioning
- Advanced prompt engineering for artistic analysis
- 4-bit quantization for memory optimization
- Resume capability for large datasets
- Comprehensive error handling and memory management
Caption Styles:
detailed_art_analyst
: Comprehensive artistic analysis including composition, color theory, and styleconcise_captioner
: Brief, factual descriptions for general use
Usage:
# Configuration
GEMMA_MODEL_ID = "google/gemma-3-12b-it"
IMAGE_INPUT_FOLDER = "/path/to/images"
METADATA_OUTPUT_FILE = "/path/to/metadata.jsonl"
custom_style_caption = "your_style_token"
# Run the script
python advanced_captioner.py
# Extract frames from videos
python video_frame_extractor.py videos/ raw_frames/ -s 2.0 --format jpg
# Organize and rename images
python image_organizer.py raw_frames/ organized_images/ --prefix "dataset" --output_format jpg
# For basic captioning
python basic_captioner.py
# For detailed artistic descriptions
python advanced_captioner.py
Your final dataset should look like:
MyImgDataset/
βββ images/
β βββ dataset_000001.jpg
β βββ dataset_000002.jpg
β βββ ...
βββ metadata.jsonl
{"file_name": "image_001.jpg", "text": "style_token A serene landscape featuring rolling hills..."}
{"file_name": "image_002.jpg", "text": "style_token Portrait of a person with dramatic lighting..."}
file_name,text
image_001.jpg,"style_token A serene landscape featuring rolling hills..."
image_002.jpg,"style_token Portrait of a person with dramatic lighting..."
-
Basic Models: Fast, lightweight, good for general captions
microsoft/git-base-coco
: General-purpose captioningSalesforce/blip-image-captioning-base
: Alternative captioning model
-
Advanced Models: Detailed, artistic analysis
google/gemma-3-12b-it
: Comprehensive visual analysis
The advanced captioner includes several memory optimization features:
- 4-bit quantization using BitsAndBytesConfig
- Automatic CUDA memory cleanup
- Resume capability to handle interruptions
- Batch processing with memory monitoring
Add custom style tokens to all captions for fine-tuning specific artistic styles:
custom_style_caption = "your_style_name"
# Results in: "your_style_name [generated caption]"
CUDA Out of Memory:
- Reduce batch size or use CPU-only processing
- Enable 4-bit quantization (already included in advanced captioner)
- Process smaller batches of images
Model Download Issues:
- Ensure stable internet connection
- Some models may require Hugging Face authentication
- Check available disk space for model downloads
Image Processing Errors:
- Verify image file integrity
- Check supported formats: PNG, JPG, JPEG, BMP, GIF, TIFF, WebP
- Ensure sufficient disk space for output
- GPU Utilization: Use CUDA-enabled systems for faster processing
- Batch Processing: Process images in batches to optimize memory usage
- Resume Feature: Use the resume capability for large datasets
- Storage: Use fast SSD storage for image datasets
- Minimum resolution: 512x512 pixels (recommended)
- Clear, well-lit images produce better captions
- Diverse content improves model training
- Detailed Mode: 150-200 tokens per caption
- Concise Mode: 50-100 tokens per caption
- Consistent style and terminology
- Accurate visual descriptions
- Hugging Face Transformers library
- Google Gemma models
- Microsoft GIT model
- Salesforce BLIP model
- OpenCV and Pillow libraries
For questions and support:
- Open an issue on GitHub
- Check the troubleshooting section
- Review the model documentation on Hugging Face
Happy dataset preparation! π¨π