-
数据集(总共10个类别)
-
Animal Sound Dataset
-
调整使每个音频长度相同
-
-
Animal Image Datasets
-
-
增加采样率,调整音频和文本的权重
-
增加准确率来衡量分类任务
for audio_idx in range(len(paths_to_audio)): # acquire Top-3 most similar results conf_values, ids = confidence[audio_idx].topk(3) # format output strings query = f'{os.path.basename(paths_to_audio[audio_idx]):>30s} ->\t\t' results = ', '.join([f'{LABELS[i]:>15s} ({v:06.2%})' for v, i in zip(conf_values, ids)]) top_label = LABELS[ids[0]] token.append(top_label) print(query + results) cnt += 1 true_label = os.path.basename(paths_to_audio[audio_idx]).split('_')[0] if true_label == top_label: truecnt += 1 print("准确率:",truecnt/cnt)
-
使用stable diffusion模型根据上一步结果生成图片
from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, use_safetensors=True, variant="fp16") pipe.to("cuda") for idx, prompt_token in enumerate(token): prompt = f" {prompt_token}" img = pipe( prompt=prompt, num_inference_steps=50 ).images[0] img.save(f"generated_image_{idx + 1}_{prompt_token}.png") print(f"Generated image for '{prompt_token}' saved as 'generated_image_{idx + 1}_{prompt_token}.png'")
@misc{guzhov2021audioclip,
title={AudioCLIP: Extending CLIP to Image, Text and Audio},
author={Andrey Guzhov and Federico Raue and Jörn Hees and Andreas Dengel},
year={2021},
eprint={2106.13043},
archivePrefix={arXiv},
primaryClass={cs.SD}
}