Open
Description
Paper
Link: https://arxiv.org/pdf/1712.01393.pdf
Year: 2018
Summary
- from video frames to audio
Methods
- encoder-decoder architecture, video encoder and sound generator
- sound generator uses SampleRNN (S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. C. Courville, and Y. Bengio. Samplernn: An unconditional end-to-end neural audio generation model. ICLR, 2016.). a 3-tier SampleRNN with onelayer RNN for the coarsest and second coarsest tiers, and a MLP for the finest tier.
- video encoder uses RNN based
- they have another network that is "flow based", pre-compute optical flow between video frames using and feed the flows to the temporal ConvNets.
Results
- over 70% of the generated sound from our models can fool humans into thinking that they are real