Open
Description
Hi Pritam, thank you very much for your amazing work. I have some questions about the dataset you used in this work. The pretrained dataset : K400, AudioSet and Kinetics-Sound, do you always use both audio and visual information, and do they always contain audio stream? Because I am trying k400, but I found some videos miss audio stream. In addition, the downstream dataset like UCF-101 and HMDB-51, do you use both audio and visual pairs , or just use visual information for evaluation? It seems that videos files in UCF-101 do not always contain the audio stream. Thank you very much.
Metadata
Assignees
Labels
No labels