Open
Description
Paper
Link: https://arxiv.org/pdf/2008.00820.pdf
Year: 2020
Summary
- RegNet - video sound generation, visually aligned sound, audio forwarding regularizer
- using GAN, learn a correct mapping between video frames and visually relevant sound
Methods
- visual encoder - BN-inception model, three 1D convolutional layers and a two-layer bidirectional LSTM
- audio forward regularizer - two-layer Bi-LSTM with the cell dimension
- generator - concat of an encoded visual feature and regularizer output and produce a spectrogram. two 1D convolutional layers and two 1D transposed convolutional layers
- vocoder - from spectrogram, convert to waveform. using WaveNet
- discriminator - input the extracted frame feature and a spectrogram and distinguishes whether the spectrogram is derived from real video or generated by model
Results
- can fool the human with a 68.12% success rate