Skip to content

Generating Visually Aligned Sound from Videos #38

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/pdf/2008.00820.pdf
Year: 2020

Summary

  • RegNet - video sound generation, visually aligned sound, audio forwarding regularizer
  • using GAN, learn a correct mapping between video frames and visually relevant sound

Methods

  • visual encoder - BN-inception model, three 1D convolutional layers and a two-layer bidirectional LSTM
  • audio forward regularizer - two-layer Bi-LSTM with the cell dimension
  • generator - concat of an encoded visual feature and regularizer output and produce a spectrogram. two 1D convolutional layers and two 1D transposed convolutional layers
  • vocoder - from spectrogram, convert to waveform. using WaveNet
  • discriminator - input the extracted frame feature and a spectrogram and distinguishes whether the spectrogram is derived from real video or generated by model

image

Results

  • can fool the human with a 68.12% success rate

Comments

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions