Skip to content

Inquiry Regarding Audio Spectrogram Transformer  #128

Open
@Ingram-lin

Description

@Ingram-lin

I am a graduate student from China, and our team recently had the privilege of studying your article on the 'Audio Spectrogram Transformer'. We were truly impressed by the content and scope of your work, and it has sparked a great deal of interest within our team. Following our admiration for your research, we endeavored to replicate your work on the ESC-50 dataset. However, as we proceeded to fine-tune the model using our own dataset, we encountered several challenges. We would greatly appreciate your guidance and assistance in navigating these challenges.

1、Our dataset consists of 2400 samples, each audio clip is 4 seconds. We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?

2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?

3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions