Add API usage (#345)

* Fix api * Add i18n statement * Add API usage * Fix indent * Fix * Fix address * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * generate.py supports cpu --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
fishaudio · Jul 6, 2024 · f2c7eed · f2c7eed
1 parent 3c526c6
commit f2c7eed
Show file tree

Hide file tree

Showing 10 changed files with 241 additions and 49 deletions.
diff --git a/docs/en/index.md b/docs/en/index.md
@@ -18,7 +18,7 @@ We assume no responsibility for any illegal use of the codebase. Please refer to
 This codebase is released under the `BSD-3-Clause` license, and all models are released under the CC-BY-NC-SA-4.0 license.
 
 <p align="center">
-<img src="/docs/assets/figs/diagram.png" width="75%">
+   <img src="/docs/assets/figs/diagram.png" width="75%">
 </p>
 
 ## Requirements

diff --git a/docs/en/inference.md b/docs/en/inference.md
@@ -13,7 +13,7 @@ Inference support command line, HTTP API and web UI.
 ## Command Line Inference
 
 Download the required `vqgan` and `llama` models from our Hugging Face repository.
-    
+
 ```bash
 huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
 ```
@@ -28,9 +28,11 @@ python tools/vqgan/inference.py \
     -i "paimon.wav" \
     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
 ```
+
 You should get a `fake.npy` file.
 
 ### 2. Generate semantic tokens from text:
+
 ```bash
 python tools/llama/generate.py \
     --text "The text you want to convert" \
@@ -53,6 +55,7 @@ This command will create a `codes_N` file in the working directory, where N is a
 ### 3. Generate vocals from semantic tokens:
 
 #### VQGAN Decoder (not recommended)
+
 ```bash
 python tools/vqgan/inference.py \
     -i "codes_0.npy" \
@@ -69,10 +72,69 @@ python -m tools.api \
     --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
     --decoder-config-name firefly_gan_vq
+```
 
 If you want to speed up inference, you can add the --compile parameter.
 
-After that, you can view and test the API at http://127.0.0.1:8000/.  
+After that, you can view and test the API at http://127.0.0.1:8000/.
+
+Below is an example of sending a request using `tools/post_api.py`.
+
+```bash
+python -m tools.post_api \
+    --text "Text to be input" \
+    --reference_audio "Path to reference audio" \
+    --reference_text "Text content of the reference audio"
+    --streaming True
+```
+
+The above command indicates synthesizing the desired audio according to the reference audio information and returning it in a streaming manner.
+
+If you need to randomly select reference audio based on `{SPEAKER}` and `{EMOTION}`, configure it according to the following steps:
+
+### 1. Create a `ref_data` folder in the root directory of the project.
+
+### 2. Create a directory structure similar to the following within the `ref_data` folder.
+
+```
+.
+├── SPEAKER1
+│    ├──EMOTION1
+│    │    ├── 21.15-26.44.lab
+│    │    ├── 21.15-26.44.wav
+│    │    ├── 27.51-29.98.lab
+│    │    ├── 27.51-29.98.wav
+│    │    ├── 30.1-32.71.lab
+│    │    └── 30.1-32.71.flac
+│    └──EMOTION2
+│         ├── 30.1-32.71.lab
+│         └── 30.1-32.71.mp3
+└── SPEAKER2
+    └─── EMOTION3
+          ├── 30.1-32.71.lab
+          └── 30.1-32.71.mp3
+```
+
+That is, first place `{SPEAKER}` folders in `ref_data`, then place `{EMOTION}` folders under each speaker, and place any number of `audio-text pairs` under each emotion folder.
+
+### 3. Enter the following command in the virtual environment
+
+```bash
+python tools/gen_ref.py
+
+```
+
+### 4. Call the API.
+
+```bash
+python -m tools.post_api \
+    --text "Text to be input" \
+    --speaker "${SPEAKER1}" \
+    --emotion "${EMOTION1}"
+    --streaming True
+```
+
+The above example is for testing purposes only.
 
 ## WebUI Inference
 

diff --git a/docs/ja/index.md b/docs/ja/index.md
@@ -13,24 +13,24 @@
 </div>
 
 !!! warning
-私たちは、コードベースの違法な使用について一切の責任を負いません。お住まいの地域のDMCA（デジタルミレニアム著作権法）およびその他の関連法については、現地の法律を参照してください。
+私たちは、コードベースの違法な使用について一切の責任を負いません。お住まいの地域の DMCA（デジタルミレニアム著作権法）およびその他の関連法については、現地の法律を参照してください。
 
 このコードベースは `BSD-3-Clause` ライセンスの下でリリースされており、すべてのモデルは CC-BY-NC-SA-4.0 ライセンスの下でリリースされています。
 
 <p align="center">
-<img src="/docs/assets/figs/diagram.png" width="75%">
+   <img src="/docs/assets/figs/diagram.png" width="75%">
 </p>
 
 ## 要件
 
-- GPUメモリ: 4GB（推論用）、16GB（微調整用）
+- GPU メモリ: 4GB（推論用）、16GB（微調整用）
 - システム: Linux、Windows
 
-## Windowsセットアップ
+## Windows セットアップ
 
-Windowsのプロユーザーは、コードベースを実行するためにWSL2またはDockerを検討することができます。
+Windows のプロユーザーは、コードベースを実行するために WSL2 または Docker を検討することができます。
 
-非プロのWindowsユーザーは、Linux環境なしでコードベースを実行するために以下の方法を検討することができます（モデルコンパイル機能付き、つまり `torch.compile`）：
+非プロの Windows ユーザーは、Linux 環境なしでコードベースを実行するために以下の方法を検討することができます（モデルコンパイル機能付き、つまり `torch.compile`）：
 
 <ol>
    <li>プロジェクトパッケージを解凍します。</li>
@@ -88,7 +88,7 @@ Windowsのプロユーザーは、コードベースを実行するためにWSL2
    <li>（オプション）<code>run_cmd.bat</code>をダブルクリックして、このプロジェクトのconda/pythonコマンドライン環境に入ります。</li>
 </ol>
 
-## Linuxセットアップ
+## Linux セットアップ
 
 ```bash
 # python 3.10仮想環境を作成します。virtualenvも使用できます。
@@ -107,15 +107,15 @@ apt install libsox-dev
 
 ## 変更履歴
 
-- 2024/07/02: Fish-Speechを1.2バージョンに更新し、VITSデコーダーを削除し、ゼロショット能力を大幅に強化しました。
-- 2024/05/10: Fish-Speechを1.1バージョンに更新し、VITSデコーダーを実装してWERを減少させ、音色の類似性を向上させました。
-- 2024/04/22: Fish-Speech 1.0バージョンを完成させ、VQGANおよびLLAMAモデルを大幅に修正しました。
+- 2024/07/02: Fish-Speech を 1.2 バージョンに更新し、VITS デコーダーを削除し、ゼロショット能力を大幅に強化しました。
+- 2024/05/10: Fish-Speech を 1.1 バージョンに更新し、VITS デコーダーを実装して WER を減少させ、音色の類似性を向上させました。
+- 2024/04/22: Fish-Speech 1.0 バージョンを完成させ、VQGAN および LLAMA モデルを大幅に修正しました。
 - 2023/12/28: `lora`微調整サポートを追加しました。
 - 2023/12/27: `gradient checkpointing`、`causual sampling`、および`flash-attn`サポートを追加しました。
-- 2023/12/19: webuiおよびHTTP APIを更新しました。
+- 2023/12/19: webui および HTTP API を更新しました。
 - 2023/12/18: 微調整ドキュメントおよび関連例を更新しました。
 - 2023/12/17: `text2semantic`モデルを更新し、音素フリーモードをサポートしました。
-- 2023/12/13: ベータ版をリリースし、VQGANモデルおよびLLAMAに基づく言語モデル（音素のみサポート）を含みます。
+- 2023/12/13: ベータ版をリリースし、VQGAN モデルおよび LLAMA に基づく言語モデル（音素のみサポート）を含みます。
 
 ## 謝辞
 

diff --git a/docs/ja/inference.md b/docs/ja/inference.md
@@ -1,6 +1,6 @@
 # 推論
 
-推論は、コマンドライン、HTTP API、およびWeb UIをサポートしています。
+推論は、コマンドライン、HTTP API、および Web UI をサポートしています。
 
 !!! note
     全体として、推論は次のいくつかの部分で構成されています：
@@ -12,8 +12,8 @@
 
 ## コマンドライン推論
 
-必要な`vqgan`および`llama`モデルをHugging Faceリポジトリからダウンロードします。
-    
+必要な`vqgan`および`llama`モデルを Hugging Face リポジトリからダウンロードします。
+
 ```bash
 huggingface-cli download fishaudio/fish-speech-1.2 --local-dir checkpoints/fish-speech-1.2
 ```
@@ -28,9 +28,11 @@ python tools/vqgan/inference.py \
     -i "paimon.wav" \
     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
 ```
+
 `fake.npy`ファイルが生成されるはずです。
 
 ### 2. テキストからセマンティックトークンを生成する：
+
 ```bash
 python tools/llama/generate.py \
     --text "変換したいテキスト" \
@@ -41,42 +43,104 @@ python tools/llama/generate.py \
     --compile
 ```
 
-このコマンドは、作業ディレクトリに`codes_N`ファイルを作成します。ここで、Nは0から始まる整数です。
+このコマンドは、作業ディレクトリに`codes_N`ファイルを作成します。ここで、N は 0 から始まる整数です。
 
 !!! note
-    `--compile`を使用してCUDAカーネルを融合し、より高速な推論を実現することができます（約30トークン/秒 -> 約500トークン/秒）。
+    `--compile`を使用して CUDA カーネルを融合し、より高速な推論を実現することができます（約 30 トークン/秒 -> 約 500 トークン/秒）。
     それに対応して、加速を使用しない場合は、`--compile`パラメータをコメントアウトできます。
 
 !!! info
-    bf16をサポートしていないGPUの場合、`--half`パラメータを使用する必要があるかもしれません。
+    bf16 をサポートしていない GPU の場合、`--half`パラメータを使用する必要があるかもしれません。
 
 ### 3. セマンティックトークンから音声を生成する：
 
-#### VQGANデコーダー（推奨されません）
+#### VQGAN デコーダー（推奨されません）
+
 ```bash
 python tools/vqgan/inference.py \
     -i "codes_0.npy" \
     --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
 ```
 
-## HTTP API推論
+## HTTP API 推論
 
-推論のためのHTTP APIを提供しています。次のコマンドを使用してサーバーを起動できます：
+推論のための HTTP API を提供しています。次のコマンドを使用してサーバーを起動できます：
 
 ```bash
 python -m tools.api \
     --listen 0.0.0.0:8000 \
     --llama-checkpoint-path "checkpoints/fish-speech-1.2" \
     --decoder-checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth" \
     --decoder-config-name firefly_gan_vq
+```
+
+推論を高速化したい場合は、--compile パラメータを追加できます。
+
+その後、`http://127.0.0.1:8000/`で API を表示およびテストできます。
+
+以下は、`tools/post_api.py` を使用してリクエストを送信する例です。
+
+```bash
+python tools/vqgan/inference.py \
+    -i "paimon.wav" \
+    --checkpoint-path "checkpoints/fish-speech-1.2/firefly-gan-vq-fsq-4x1024-42hz-generator.pth"
+```
+
+上記のコマンドは、参照音声の情報に基づいて必要な音声を合成し、ストリーミング方式で返すことを示しています。
+
+`{SPEAKER}`と`{EMOTION}`に基づいて参照音声をランダムに選択する必要がある場合は、以下の手順に従って設定します：
+
+### 1. プロジェクトのルートディレクトリに`ref_data`フォルダを作成します。
 
-推論を高速化したい場合は、--compileパラメータを追加できます。
+### 2. `ref_data`フォルダ内に次のような構造のディレクトリを作成します。
+
+```
+.
+├── SPEAKER1
+│    ├──EMOTION1
+│    │    ├── 21.15-26.44.lab
+│    │    ├── 21.15-26.44.wav
+│    │    ├── 27.51-29.98.lab
+│    │    ├── 27.51-29.98.wav
+│    │    ├── 30.1-32.71.lab
+│    │    └── 30.1-32.71.flac
+│    └──EMOTION2
+│         ├── 30.1-32.71.lab
+│         └── 30.1-32.71.mp3
+└── SPEAKER2
+    └─── EMOTION3
+          ├── 30.1-32.71.lab
+          └── 30.1-32.71.mp3
+
+```
+
+つまり、まず`ref_data`に`{SPEAKER}`フォルダを配置し、各スピーカーの下に`{EMOTION}`フォルダを配置し、各感情フォルダの下に任意の数の音声-テキストペアを配置します
+
+### 3. 仮想環境で以下のコマンドを入力します.
+
+```bash
+python tools/gen_ref.py
+
+```
+
+参照ディレクトリを生成します。
+
+### 4. API を呼び出します。
+
+```bash
+python -m tools.post_api \
+    --text "入力するテキスト" \
+    --speaker "${SPEAKER1}" \
+    --emotion "${EMOTION1}"
+    --streaming True
+
+```
 
-その後、http://127.0.0.1:8000/でAPIを表示およびテストできます。
+上記の例はテスト目的のみです。
 
-## WebUI推論
+## WebUI 推論
 
-次のコマンドを使用してWebUIを起動できます：
+次のコマンドを使用して WebUI を起動できます：
 
 ```bash
 python -m tools.webui \
@@ -86,6 +150,6 @@ python -m tools.webui \
 ```
 
 !!! note
-    Gradio環境変数（`GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME`など）を使用してWebUIを構成できます。
+    Gradio 環境変数（`GRADIO_SHARE`、`GRADIO_SERVER_PORT`、`GRADIO_SERVER_NAME`など）を使用して WebUI を構成できます。
 
 お楽しみください！
diff --git a/docs/zh/index.md b/docs/zh/index.md
@@ -18,7 +18,7 @@
 此代码库根据 `BSD-3-Clause` 许可证发布, 所有模型根据 CC-BY-NC-SA-4.0 许可证发布.
 
 <p align="center">
-<img src="/docs/assets/figs/diagram.png" width="75%">
+   <img src="/docs/assets/figs/diagram.png" width="75%">
 </p>
 
 ## 要求
@@ -107,7 +107,7 @@ apt install libsox-dev
 
 ## 更新日志
 
-- 2024/07/02: 更新了 Fish-Speech 到 1.2 版本，移除 VITS Decoder，同时极大幅度提升zero-shot能力.
+- 2024/07/02: 更新了 Fish-Speech 到 1.2 版本，移除 VITS Decoder，同时极大幅度提升 zero-shot 能力.
 - 2024/05/10: 更新了 Fish-Speech 到 1.1 版本，引入了 VITS Decoder 来降低口胡和提高音色相似度.
 - 2024/04/22: 完成了 Fish-Speech 1.0 版本, 大幅修改了 VQGAN 和 LLAMA 模型.
 - 2023/12/28: 添加了 `lora` 微调支持.