Skip to content

Out of time: automated lip sync in the wild (modified)

License

Notifications You must be signed in to change notification settings

bryanherdianto/syncnet_python

 
 

Repository files navigation

SyncNet (Modified)

This repository is an enhanced and modified version of the original SyncNet Python implementation. It provides a complete demonstration and implementation of the audio-to-video synchronisation network (SyncNet), a deep learning model that analyzes the correlation between audio and visual streams in videos.

Overview

SyncNet is designed to detect and measure audio-visual synchronization in videos by learning the correspondence between lip movements and speech. This powerful network can be applied to various audio-visual synchronisation tasks, including:

  1. Audio-Video Offset Detection: Identifying and measuring temporal lags between audio and visual streams in a video, which is essential for correcting desynchronized content.
  2. Active Speaker Detection: Determining which person is speaking among multiple faces visible in a video frame, useful for video conferencing, surveillance, and content analysis applications.

The model uses a two-stream convolutional neural network architecture that processes both visual (face) and audio (speech) features to compute synchronization confidence scores.

Requirements

System Requirements

  • Python 3.10 or higher (you can use Anaconda)
  • NVIDIA GPU with CUDA support (recommended for optimal performance)
  • CUDA 11.6 or compatible version
  • FFmpeg for audio and video processing

Python Dependencies

Install all required Python packages using pip:

pip install -r requirements.txt

Important: The requirements.txt file includes --extra-index-url to ensure PyTorch is installed with CUDA support. Make sure this matches your installed CUDA version. You can check your CUDA version by running:

nvidia-smi

This command will display your NVIDIA driver version and the maximum CUDA version supported by your system. Adjust the PyTorch installation accordingly if needed.

FFmpeg Installation

FFmpeg is essential for extracting and processing audio and video frames. Download the latest static build from the FFmpeg Builds repository and add the bin directory to your system PATH environment variable.

Quick Start Demo

This section demonstrates how to run SyncNet on a sample video to verify your installation is working correctly.

Download Pre-trained Model

First, download the pre-trained SyncNet model and example video files:

.\download_model.ps1

This script will download:

  • syncnet_v2.model - The pre-trained SyncNet model weights
  • example.avi - A sample video for testing
  • sfd_face.pth - Face detection model weights

Run the Demo

Create a temporary directory for intermediate files and run the demo:

mkdir tmp
python demo_syncnet.py --videofile data/example.avi --tmp_dir tmp

The script will process the video and output synchronization metrics. A successful run should return values similar to:

AV offset:      3 
Min dist:       5.353
Confidence:     10.021

Output Explanation:

  • AV offset: The detected audio-video offset in frames (positive means audio is ahead)
  • Min dist: The minimum distance between audio and video features
  • Confidence: The confidence score for the synchronization detection (higher is better)

Full Pipeline Usage

For processing your own videos with complete analysis and visualization, use the full three-stage pipeline:

Stage 1: Face Detection and Cropping

Extract and crop face tracks from the video:

python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

This stage detects faces in each frame using the S3FD face detector and creates cropped video tracks for each detected face.

Stage 2: Synchronization Analysis

Analyze audio-video synchronization for each face track:

python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

This stage computes the audio-video offset for each face track and generates confidence scores to determine which face is actively speaking.

Stage 3: Visualization

Generate an output video with synchronization annotations:

python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

This final stage creates a visualization showing detected faces with bounding boxes and synchronization scores overlaid on the video.

Pipeline Outputs

After running the full pipeline, you will find the following outputs in your specified data directory:

  1. $DATA_DIR/pycrop/$REFERENCE/*.avi - Individual cropped face track videos for each detected person
  2. $DATA_DIR/pywork/$REFERENCE/offsets.txt - Text file containing audio-video offset values and confidence scores for each face track
  3. $DATA_DIR/pyavi/$REFERENCE/video_out.avi - Final annotated output video with synchronization visualizations

Credits

This repository is based on the original SyncNet implementation by Joon Son Chung. For more details about the methodology and research, please refer to the original paper:

Chung, J. S., & Zisserman, A. (2016). Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.

About

Out of time: automated lip sync in the wild (modified)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • PowerShell 1.6%