Inquiry Regarding Audio Spectrogram Transformer 

I am a graduate student from China, and our team recently had the privilege of studying your article on the 'Audio Spectrogram Transformer'. We were truly impressed by the content and scope of your work, and it has sparked a great deal of interest within our team. Following our admiration for your research, we endeavored to replicate your work on the ESC-50 dataset. However, as we proceeded to fine-tune the model using our own dataset, we encountered several challenges. We would greatly appreciate your guidance and assistance in navigating these challenges.

1、Our dataset consists of 2400 samples, each audio clip is 4 seconds.  We set the audio_length parameter to 400 and timen to 80. We replaced the labels while keeping the rest recipe consistent with ESC-50. We downloaded a pre-trained model from Audioset and followed the same process as replicating ESC-50. We are pleased with the final result; the accuracy can reach 0.9. However, what surprises us is that the average precision is only between 0.3 to 0.5. Why could this be?

2、We understand your work involves projecting spectrograms to embeddings. (If our understanding of your work is incorrect, please forgive us.) After fine-tuning the model, we process new speech data and aim to obtain the embeddings. Could you please guide us on how to do this?

3、For example, if we want to fine-tune a pre-trained model with an English dataset and then validate the fine-tuned model with a Chinese dataset, can we set the training set as the English dataset and the validation set as the Chinese dataset during the fine-tuning process?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry Regarding Audio Spectrogram Transformer #128

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inquiry Regarding Audio Spectrogram Transformer #128

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions