This repository is an enhanced and modified version of the original SyncNet Python implementation. It provides a complete demonstration and implementation of the audio-to-video synchronisation network (SyncNet), a deep learning model that analyzes the correlation between audio and visual streams in videos.
SyncNet is designed to detect and measure audio-visual synchronization in videos by learning the correspondence between lip movements and speech. This powerful network can be applied to various audio-visual synchronisation tasks, including:
- Audio-Video Offset Detection: Identifying and measuring temporal lags between audio and visual streams in a video, which is essential for correcting desynchronized content.
- Active Speaker Detection: Determining which person is speaking among multiple faces visible in a video frame, useful for video conferencing, surveillance, and content analysis applications.
The model uses a two-stream convolutional neural network architecture that processes both visual (face) and audio (speech) features to compute synchronization confidence scores.
- Python 3.10 or higher (you can use Anaconda)
- NVIDIA GPU with CUDA support (recommended for optimal performance)
- CUDA 11.6 or compatible version
- FFmpeg for audio and video processing
Install all required Python packages using pip:
pip install -r requirements.txtImportant: The requirements.txt file includes --extra-index-url to ensure PyTorch is installed with CUDA support. Make sure this matches your installed CUDA version. You can check your CUDA version by running:
nvidia-smiThis command will display your NVIDIA driver version and the maximum CUDA version supported by your system. Adjust the PyTorch installation accordingly if needed.
FFmpeg is essential for extracting and processing audio and video frames. Download the latest static build from the FFmpeg Builds repository and add the bin directory to your system PATH environment variable.
This section demonstrates how to run SyncNet on a sample video to verify your installation is working correctly.
First, download the pre-trained SyncNet model and example video files:
.\download_model.ps1This script will download:
syncnet_v2.model- The pre-trained SyncNet model weightsexample.avi- A sample video for testingsfd_face.pth- Face detection model weights
Create a temporary directory for intermediate files and run the demo:
mkdir tmp
python demo_syncnet.py --videofile data/example.avi --tmp_dir tmpThe script will process the video and output synchronization metrics. A successful run should return values similar to:
AV offset: 3
Min dist: 5.353
Confidence: 10.021
Output Explanation:
AV offset: The detected audio-video offset in frames (positive means audio is ahead)Min dist: The minimum distance between audio and video featuresConfidence: The confidence score for the synchronization detection (higher is better)
For processing your own videos with complete analysis and visualization, use the full three-stage pipeline:
Extract and crop face tracks from the video:
python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/outputThis stage detects faces in each frame using the S3FD face detector and creates cropped video tracks for each detected face.
Analyze audio-video synchronization for each face track:
python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/outputThis stage computes the audio-video offset for each face track and generates confidence scores to determine which face is actively speaking.
Generate an output video with synchronization annotations:
python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/outputThis final stage creates a visualization showing detected faces with bounding boxes and synchronization scores overlaid on the video.
After running the full pipeline, you will find the following outputs in your specified data directory:
$DATA_DIR/pycrop/$REFERENCE/*.avi- Individual cropped face track videos for each detected person$DATA_DIR/pywork/$REFERENCE/offsets.txt- Text file containing audio-video offset values and confidence scores for each face track$DATA_DIR/pyavi/$REFERENCE/video_out.avi- Final annotated output video with synchronization visualizations
This repository is based on the original SyncNet implementation by Joon Son Chung. For more details about the methodology and research, please refer to the original paper:
Chung, J. S., & Zisserman, A. (2016). Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.