SyncNet (Modified)

This repository is an enhanced and modified version of the original SyncNet Python implementation. It provides a complete demonstration and implementation of the audio-to-video synchronisation network (SyncNet), a deep learning model that analyzes the correlation between audio and visual streams in videos.

Overview

SyncNet is designed to detect and measure audio-visual synchronization in videos by learning the correspondence between lip movements and speech. This powerful network can be applied to various audio-visual synchronisation tasks, including:

Audio-Video Offset Detection: Identifying and measuring temporal lags between audio and visual streams in a video, which is essential for correcting desynchronized content.
Active Speaker Detection: Determining which person is speaking among multiple faces visible in a video frame, useful for video conferencing, surveillance, and content analysis applications.

The model uses a two-stream convolutional neural network architecture that processes both visual (face) and audio (speech) features to compute synchronization confidence scores.

Requirements

System Requirements

Python 3.10 or higher (you can use Anaconda)
NVIDIA GPU with CUDA support (recommended for optimal performance)
CUDA 11.6 or compatible version
FFmpeg for audio and video processing

Python Dependencies

Install all required Python packages using pip:

pip install -r requirements.txt

Important: The requirements.txt file includes --extra-index-url to ensure PyTorch is installed with CUDA support. Make sure this matches your installed CUDA version. You can check your CUDA version by running:

nvidia-smi

This command will display your NVIDIA driver version and the maximum CUDA version supported by your system. Adjust the PyTorch installation accordingly if needed.

FFmpeg Installation

FFmpeg is essential for extracting and processing audio and video frames. Download the latest static build from the FFmpeg Builds repository and add the bin directory to your system PATH environment variable.

Quick Start Demo

This section demonstrates how to run SyncNet on a sample video to verify your installation is working correctly.

Download Pre-trained Model

First, download the pre-trained SyncNet model and example video files:

.\download_model.ps1

This script will download:

syncnet_v2.model - The pre-trained SyncNet model weights
example.avi - A sample video for testing
sfd_face.pth - Face detection model weights

Run the Demo

Create a temporary directory for intermediate files and run the demo:

mkdir tmp
python demo_syncnet.py --videofile data/example.avi --tmp_dir tmp

The script will process the video and output synchronization metrics. A successful run should return values similar to:

AV offset:      3 
Min dist:       5.353
Confidence:     10.021

Output Explanation:

AV offset: The detected audio-video offset in frames (positive means audio is ahead)
Min dist: The minimum distance between audio and video features
Confidence: The confidence score for the synchronization detection (higher is better)

Full Pipeline Usage

For processing your own videos with complete analysis and visualization, use the full three-stage pipeline:

Stage 1: Face Detection and Cropping

Extract and crop face tracks from the video:

python run_pipeline.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

This stage detects faces in each frame using the S3FD face detector and creates cropped video tracks for each detected face.

Stage 2: Synchronization Analysis

Analyze audio-video synchronization for each face track:

python run_syncnet.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

This stage computes the audio-video offset for each face track and generates confidence scores to determine which face is actively speaking.

Stage 3: Visualization

Generate an output video with synchronization annotations:

python run_visualise.py --videofile /path/to/video.mp4 --reference name_of_video --data_dir /path/to/output

This final stage creates a visualization showing detected faces with bounding boxes and synchronization scores overlaid on the video.

Pipeline Outputs

After running the full pipeline, you will find the following outputs in your specified data directory:

$DATA_DIR/pycrop/$REFERENCE/*.avi - Individual cropped face track videos for each detected person
$DATA_DIR/pywork/$REFERENCE/offsets.txt - Text file containing audio-video offset values and confidence scores for each face track
$DATA_DIR/pyavi/$REFERENCE/video_out.avi - Final annotated output video with synchronization visualizations

Credits

This repository is based on the original SyncNet implementation by Joon Son Chung. For more details about the methodology and research, please refer to the original paper:

Chung, J. S., & Zisserman, A. (2016). Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
detectors		detectors
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
SyncNetInstance.py		SyncNetInstance.py
SyncNetModel.py		SyncNetModel.py
demo_feature.py		demo_feature.py
demo_syncnet.py		demo_syncnet.py
download_model.ps1		download_model.ps1
requirements.txt		requirements.txt
run_pipeline.py		run_pipeline.py
run_syncnet.py		run_syncnet.py
run_visualise.py		run_visualise.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SyncNet (Modified)

Overview

Requirements

System Requirements

Python Dependencies

FFmpeg Installation

Quick Start Demo

Download Pre-trained Model

Run the Demo

Full Pipeline Usage

Stage 1: Face Detection and Cropping

Stage 2: Synchronization Analysis

Stage 3: Visualization

Pipeline Outputs

Credits

About

Uh oh!

Releases

Packages

Languages

License

bryanherdianto/syncnet_python

Folders and files

Latest commit

History

Repository files navigation

SyncNet (Modified)

Overview

Requirements

System Requirements

Python Dependencies

FFmpeg Installation

Quick Start Demo

Download Pre-trained Model

Run the Demo

Full Pipeline Usage

Stage 1: Face Detection and Cropping

Stage 2: Synchronization Analysis

Stage 3: Visualization

Pipeline Outputs

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages