This project provides a comprehensive system for analyzing real-time videos using Vision Language Models (VLM) and generating summaries of the content. The system works in two main phases: real-time frame analysis and post-processing summarization.
The server component uses Ollama to run the VLM and LLM models, providing real-time analysis and summarization capabilities.
Real-time VLM analysis output from client.py
Comprehensive video summary from summarize_video_audio.py
- Real-time video frame analysis using Vision Language Models
- Frame-by-frame descriptions with timestamps
- Multi-modal summarization combining visual and audio transcription content
- Text-only chat capabilities
- Rich console output formatting
- Configurable system prompts for different use cases
- Runs FastAPI server to handle client requests
- Manages Ollama VLM and LLM models
- Processes video frames and generates descriptions
- Handles text-based chat interactions
- Provides API endpoints for all client operations
- Captures video frames from webcam in real-time
- Sends frames to server for VLM analysis
- Receives and displays frame descriptions
- Saves timestamped descriptions to a file
- Generates chronological summaries of video and audio content
- Combines saved frame descriptions with audio transcription
- Provides comprehensive summaries integrating both visual and audio content
- Enables text-only interactions with the LLM
- Supports system prompts for different roles and contexts
- Example use cases included
-
Install required packages for both server and client:
pip install -r requirements.txt
-
Configure server settings:
- Update
UBUNTU_SERVER_IPandUBUNTU_SERVER_PORTin the configuration section of each script
- Update
-
Prepare input stream and files:
- Ensure your web camera is accessible and properly configured
- For audio analysis, prepare audio transcription files at AUDIO_TRANSCRIPTION_FILE_PATH
-
Start the server:
# Run the FastAPI server with uvicorn uvicorn server.server:app --host 0.0.0.0 --port 8000 --workers 1

