basic tunefow plugin

tuneflow · Feb 23, 2023 · 5deafdf · 5deafdf
1 parent 78e5236
commit 5deafdf
Show file tree

Hide file tree

Showing 6 changed files with 176 additions and 146 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,4 @@
+*:Zone.Identifier
+output
+*.pyc
+ckpt
diff --git a/README.md b/README.md
@@ -1,145 +1,25 @@
-# Text-to-Audio Generation
+# AudioLDM TuneFlow Plugin
 
-[![arXiv](https://img.shields.io/badge/arXiv-2109.13731-brightgreen.svg?style=flat-square)](https://arxiv.org/abs/2301.12503)  [![githubio](https://img.shields.io/badge/GitHub.io-Audio_Samples-blue?logo=Github&style=flat-square)](https://audioldm.github.io/)  [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/olaviinha/NeuralTextToAudio/blob/main/AudioLDM_pub.ipynb?force_theme=dark)  [![Replicate](https://replicate.com/jagilley/audio-ldm/badge)](https://replicate.com/jagilley/audio-ldm)
+Fork of https://github.com/haoheliu/AudioLDM as a TuneFlow Plugin
 
-<!-- # [![PyPI version](https://badge.fury.io/py/voicefixer.svg)](https://badge.fury.io/py/voicefixer) -->
+## Usage
 
-Generate speech, sound effects, music and beyond.
+> **Note:** It is highly recommended to create a venv to run this plugin.
 
-<hr>
+Steps to run the plugin:
 
-## Important tricks to make your generated audio sound better
-1. Try to provide more hints to AudioLDM, such as using more adjectives to describe your sound (e.g., clearly, high quality) or make your target more specific (e.g., "water stream in a forest" instead of "stream"). This can make sure AudioLDM understand what you want. 
-2. Try to use different random seeds, which can affect the generation quality significantly sometimes.
-3. It's best to use general terms like 'man' or 'woman' instead of specific names for individuals or abstract objects that humans may not be familiar with.
+1. Install dependencies using:
 
-# Change Log
-
-**2023-02-15**: Add audio style transfer. Add more options on generation.
-
-## Web APP
-1. Prepare running environment
-```shell
-conda create -n audioldm python=3.8; conda activate audioldm
-pip3 install audioldm
-git clone https://github.com/haoheliu/AudioLDM; cd AudioLDM
-```
-2. Start the web application (powered by Gradio)
-```shell
-python3 app.py
-```
-3. A link will be printed out. Click the link to open the browser and play.
-
-## Commandline Usage
-1. Prepare running environment
-```shell
-# Optional
-conda create -n audioldm python=3.8; conda activate audioldm
-# Install AudioLDM
-pip3 install audioldm
-```
-
-2. text-to-audio generation
-```python
-# Test run
-audioldm -t "A hammer is hitting a wooden surface" # The default --mode is "generation"
+```bash
+pip install -r requirements.txt
 ```
 
-3. audio-to-audio style transfer
-```python
-# Test run
-# --file_path is the original audio file for transfer
-# -t is the text AudioLDM uses for transfer. 
-# Please make sure that --file_path exist
-audioldm --mode "transfer" --file_path trumpet.wav -t "Children Singing" 
+2. Download the model from https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation/blob/main/ckpt/ldm_trimmed.ckpt, and place it under the `ckpt` folder.
 
-# Tune the value of --transfer_strength is important!
-# --transfer_strength: A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text
-audioldm --mode "transfer" --file_path trumpet.wav -t "Children Singing" --transfer_strength 0.25
-```
+3. Run the plugin:
 
-For more options on guidance scale, batchsize, seed, ddim steps, etc., please run
-```shell
-audioldm -h
+```bash
+python debug.py
 ```
-```console
-usage: audioldm [-h] [--mode {generation,transfer}] [-t TEXT] [-f FILE_PATH] [--transfer_strength TRANSFER_STRENGTH] [-s SAVE_PATH] [-ckpt CKPT_PATH] [-b BATCHSIZE] [--ddim_steps DDIM_STEPS] [-gs GUIDANCE_SCALE]
-                [-dur DURATION] [-n N_CANDIDATE_GEN_PER_TEXT] [--seed SEED]
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --mode {generation,transfer}
-                        generation: text-to-audio generation; transfer: style transfer. DEFAULT "generation"
-  -t TEXT, --text TEXT  Text prompt to the model for audio generation
-  -f FILE_PATH, --file_path FILE_PATH
-                        Original audio file for style transfer
-  --transfer_strength TRANSFER_STRENGTH
-                        A value between 0 and 1. 0 means original audio without transfer, 1 means completely transfer to the audio indicated by text. DEFAULT 0.5
-  -s SAVE_PATH, --save_path SAVE_PATH
-                        The path to save model output. DEFAULT "./output"
-  -ckpt CKPT_PATH, --ckpt_path CKPT_PATH
-                        The path to the pretrained .ckpt model. DEFAULT "~/.cache/audioldm/audioldm-s-full.ckpt"
-  -b BATCHSIZE, --batchsize BATCHSIZE
-                        Generate how many samples at the same time. DEFAULT 1
-  --ddim_steps DDIM_STEPS
-                        The sampling step for DDIM. DEFAULT 200
-  -gs GUIDANCE_SCALE, --guidance_scale GUIDANCE_SCALE
-                        Guidance scale (Large => better relavancy to text; Small => better diversity). DEFAULT 2.5
-  -dur DURATION, --duration DURATION
-                        The duration of the samples. DEFAULT 10
-  -n N_CANDIDATE_GEN_PER_TEXT, --n_candidate_gen_per_text N_CANDIDATE_GEN_PER_TEXT
-                        Automatic quality control. This number control the number of candidates (e.g., generate three audios and choose the best to show you). A Larger value usually lead to better quality with heavier
-                        computation. DEFAULT 3
-  --seed SEED           Change this value (any integer number) will lead to a different generation result. DEFAULT 42
-```
-
-
-
-For the evaluation of audio generative model, please refer to [audioldm_eval](https://github.com/haoheliu/audioldm_eval).
-
-# Web Demo
-
-Integrated into [Hugging Face Spaces 🤗](https://huggingface.co/spaces) using [Gradio](https://github.com/gradio-app/gradio). Try out the Web Demo [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/haoheliu/audioldm-text-to-audio-generation)
-
-
-# TODO
-
-- [ ] Update the checkpoint with more training steps.
-- [ ] Add AudioCaps finetuned AudioLDM-S model
-- [x] Build pip installable package for commandline use
-- [x] Build Gradio web application
-- [x] Add text-guided style transfer
-- [ ] Add audio super-resolution
-- [ ] Add audio inpainting
-
-## Cite this work
-
-If you found this tool useful, please consider citing
-```bibtex
-@article{liu2023audioldm,
-  title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models},
-  author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D},
-  journal={arXiv preprint arXiv:2301.12503},
-  year={2023}
-}
-```
-
-# Hardware requirement
-- GPU with 8GB of dedicated VRAM
-- A system with a 64-bit operating system (Windows 7, 8.1 or 10, Ubuntu 16.04 or later, or macOS 10.13 or later) 16GB or more of system RAM
-
-## Reference
-Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution. 
-
-> https://github.com/LAION-AI/CLAP
-
-> https://github.com/CompVis/stable-diffusion
-
-> https://github.com/v-iashin/SpecVQGAN 
-
-> https://github.com/toshas/torch-fidelity
-
-
-We build the model with data from AudioSet, Freesound and BBC Sound Effect library. We share this demo based on the UK copyright exception of data for academic research. 
 
-<!-- This code repo is strictly for research demo purpose only. For commercial use please contact us. -->
+4. Start TuneFlow Desktop and run the "Plugin Development" plugin.
diff --git a/audioldm/__init__.py b/audioldm/__init__.py
@@ -15,14 +15,14 @@
     },
 }
 
-if not os.path.exists(meta["audioldm"]["path"]):
-    os.makedirs(os.path.dirname(meta["audioldm"]["path"]), exist_ok=True)
-    print("Downloading the main structure of audioldm")
+# if not os.path.exists(meta["audioldm"]["path"]):
+#     os.makedirs(os.path.dirname(meta["audioldm"]["path"]), exist_ok=True)
+#     print("Downloading the main structure of audioldm")
 
-    urllib.request.urlretrieve(meta["audioldm"]["url"], meta["audioldm"]["path"])
-    print(
-        "Weights downloaded in: {} Size: {}".format(
-            meta["audioldm"]["path"],
-            os.path.getsize(meta["audioldm"]["path"]),
-        )
-    )
+#     urllib.request.urlretrieve(meta["audioldm"]["url"], meta["audioldm"]["path"])
+#     print(
+#         "Weights downloaded in: {} Size: {}".format(
+#             meta["audioldm"]["path"],
+#             os.path.getsize(meta["audioldm"]["path"]),
+#         )
+#     )
diff --git a/debug.py b/debug.py
@@ -0,0 +1,5 @@
+from plugin import AudioLDMPlugin
+from tuneflow_devkit import Debugger
+
+if __name__ == "__main__":
+    Debugger(plugin_class=AudioLDMPlugin).start()
diff --git a/plugin.py b/plugin.py
@@ -0,0 +1,141 @@
+from tuneflow_py import TuneflowPlugin, ParamDescriptor, Song, ReadAPIs, TrackType, WidgetType, LabelText
+from typing import Dict, Any
+from audioldm import text_to_audio, build_model
+from pathlib import Path
+import traceback
+import random
+from io import BytesIO
+import soundfile as sf
+from typing import List
+
+
+class AudioLDMPlugin(TuneflowPlugin):
+    @staticmethod
+    def provider_id() -> str:
+        return 'andantei'
+
+    @staticmethod
+    def plugin_id() -> str:
+        return 'audioldm-generate'
+
+    @staticmethod
+    def provider_display_name() -> LabelText:
+        return {
+            "zh": "Andantei行板",
+            "en": "Andantei"
+        }
+
+    @staticmethod
+    def plugin_display_name() -> LabelText:
+        return {
+            "zh": "[AI] 文字生成音频 (AudioLDM)",
+            "en": "[AI] Text-to-Audio (AudioLDM)"
+        }
+
+    def params(self) -> Dict[str, ParamDescriptor]:
+        # TODO: Limit prompt length
+        return {
+            "prompt": {
+                "displayName": {
+                    "en": "Prompt",
+                    "zh": "提示词"
+                },
+                "description": {
+                    "en": "A short sentence to describe the audio you want to generate",
+                    "zh": "用一段简短的文字描述你想要的音频"
+                },
+                "defaultValue": None,
+                "widget": {
+                    "type": WidgetType.TextArea.value,
+                    "config": {
+                        "placeholder": {
+                            "zh": "样例：斧头正在伐木",
+                            "en": "e.g. A hammer is hitting a tree"
+                        },
+                        "maxLength": 140
+                    }
+                }
+            },
+            "guidance_scale": {
+                "displayName": {
+                    "en": "Guidance Scale",
+                    "zh": "提示强度"
+                },
+                "description": {
+                    "en": "Larger value yields results more relavant to the prompt, smaller value yields more diversity",
+                    "zh": "值越大，生成结果越贴近提示词，值越小，生成结果越发散"
+                },
+                "defaultValue": 2.5,
+                "widget": {
+                    "type": WidgetType.InputNumber.value,
+                    "config": {
+                        "minValue": 0.1,
+                        "maxValue": 5,
+                        "step": 0.1
+                    }
+                }
+            },
+            "duration": {
+                "displayName": {
+                    "en": "Duration (seconds)",
+                    "zh": "长度 (秒)"
+                },
+                "defaultValue": 10,
+                "widget": {
+                    "type": WidgetType.InputNumber.value,
+                    "config": {
+                        "minValue": 2.5,
+                        "maxValue": 100,
+                        "step": 2.5
+                    }
+                }
+            }
+        }
+
+    def init(self, song: Song, read_apis: ReadAPIs):
+        model_path = str(Path(__file__).parent.joinpath('ckpt/ldm_trimmed.ckpt').absolute())
+        self.model = build_model(ckpt_path=model_path)
+
+    def run(self, song: Song, params: Dict[str, Any], read_apis: ReadAPIs):
+        # TODO: Support prompt i18n
+        file_bytes_list = self._text2audio(
+            text=params["prompt"],
+            duration=params["duration"],
+            guidance_scale=params["guidance_scale"],
+            # Randomize seed.
+            random_seed=random.randint(0, 999999))
+        for file_bytes in file_bytes_list:
+            try:
+                file_bytes.seek(0)
+                track = song.create_track(type=TrackType.AUDIO_TRACK)
+                track.create_audio_clip(clip_start_tick=0, audio_clip_data={
+                    "audio_data": {
+                        "format": "wav",
+                        "data": file_bytes.read()
+                    },
+                    "duration": params["duration"],
+                    "start_tick": 0
+                })
+            except:
+                print(traceback.format_exc())
+
+    def _text2audio(self, text, duration, guidance_scale, random_seed):
+        # print(text, length, guidance_scale)
+        waveform = text_to_audio(
+            self.model,
+            text=text,
+            seed=random_seed,
+            duration=duration,
+            guidance_scale=guidance_scale,
+            n_candidate_gen_per_text=3,
+            batchsize=1,
+        )
+        return self._save_wave(waveform)
+
+    def _save_wave(self, waveform):
+        saved_file_bytes: List[BytesIO] = []
+        for i in range(waveform.shape[0]):
+            file_bytes = BytesIO()
+            sf.write(file_bytes, waveform[i, 0], samplerate=16000, format="wav")
+            saved_file_bytes.append(file_bytes)
+        return saved_file_bytes
diff --git a/requirements.txt b/requirements.txt
@@ -7,10 +7,10 @@ pyyaml
 einops
 numpy<=1.23.5
 soundfile
-librosa
+librosa==0.9.2
 scipy
 pandas
-torchlibrosa
+torchlibrosa==0.0.9
 transformers
 ftfy
-tuneflow-py==0.0.8
+tuneflow-py==0.1.0