A smart multimodal classroom video recording system that automatically composes multiple content streams—camera feeds, slides, and whiteboard— based on real-time cues like gestures and spoken references. By leveraging computer vision, automatic speech recognition (ASR), and content analysis, it can dynamically switch between sources to create a more engaging, context-aware lecture recording. The goal is to overcome the limitations of static cameras and provide a richer, more immersive experience for both live and recorded viewers.
- Automatic switching between slide and professor views based on content analysis
- Corner overlay mode to show both feeds simultaneously
- Pose estimation for gesture detection (pointing and writing)
- Real-time pose visualization with skeleton tracking
- Debug mode with comprehensive visualization overlays
- Standalone pose estimation for fine-tuning
- High-quality video output with configurable settings
- Confidence analysis reports for model performance evaluation
- Python 3.11 (Only tested with 3.11)
- FFmpeg installed on your system
- Tesseract OCR installed on your system
First, install the required system tools:
# Install FFmpeg
# on Ubuntu or Debian
sudo apt update && sudo apt install ffmpeg
# on Arch Linux
sudo pacman -S ffmpeg
# on MacOS using Homebrew (https://brew.sh/)
brew install ffmpeg
# on Windows using Chocolatey (https://chocolatey.org/)
choco install ffmpeg
# on Windows using Scoop (https://scoop.sh/)
scoop install ffmpeg
# Install Tesseract OCR
# on Ubuntu or Debian
sudo apt install tesseract-ocr
# on Arch Linux
sudo pacman -S tesseract
# on MacOS using Homebrew
brew install tesseract
# on Windows
# Download and install from https://github.com/UB-Mannheim/tesseract/wikiWe recommend using a virtual environment to install the required Python dependencies:
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
python -m pip install -r requirements.txtpython main.py slide_video.mp4 professor_video.mp4 --output-dir output--output-dir: Directory to save outputs (required)--ocr-results: Path to pre-computed OCR results--pose-results: Path to pre-computed pose results--transcription: Path to pre-computed transcription--analysis: Path to pre-computed content analysis--load-only: Only load pre-computed results (skip processing)--skip-video: Skip video creation after processing--pose-only: Only run pose estimation and exit--debug: Enable debug mode (faster processing, lower resolution, visual overlays)--quality: Set output quality ('high', 'medium', 'low')--report: Generate confidence analysis report--pose-method: Choose pose estimation method ('mediapipe' or 'openpose', default: 'mediapipe')
python main.py slide_video.mp4 professor_video.mp4 --debugThis will create a lower resolution video with comprehensive debug overlays showing:
- Pose skeleton visualization with keypoints
- Bounding box around the professor
- Gesture detection status (pointing/writing)
- Confidence scores
- Motion tracking information
- Decision-making process
# Using MediaPipe (faster)
python main.py slide_video.mp4 professor_video.mp4 --pose-only --pose-method mediapipe
# Using OpenPose (slower but potentially more accurate)
python main.py slide_video.mp4 professor_video.mp4 --pose-only --pose-method openposeThis will run only the pose estimation and save the results to a JSON file for analysis.
python main.py slide_video.mp4 professor_video.mp4 --quality highThis will create a high-quality output video with the best possible settings.
python main.py slide_video.mp4 professor_video.mp4 --reportThis will generate a confidence analysis report showing the performance of each model over time.
python main.py slide_video.mp4 professor_video.mp4 --load-only --ocr-results ocr.json --pose-results pose.json --transcription trans.json --analysis analysis.jsonThis will use existing analysis results instead of running the full pipeline.
# Using MediaPipe (faster, default)
python main.py slide_video.mp4 professor_video.mp4 --pose-method mediapipe
# Using OpenPose (slower but potentially more accurate)
python main.py slide_video.mp4 professor_video.mp4 --pose-method openposeChoose between MediaPipe (faster) and OpenPose (slower but potentially more accurate) for pose estimation.
The program creates several output files in the specified output directory:
output.mp4(ordebug_output.mp4in debug mode): The final combined videopose/pose_results.json: Pose estimation data including keypoints and gesture detectionocr/ocr_results.json: OCR results from slidestranscription/transcription.json: Speech transcriptionanalysis/analysis_results.json: Content analysis resultsdecisions/decisions.json: Camera switching decisions with pose datamulti_model_analysis.png: Confidence analysis report (when using --report)