Skip to content

Commit

Permalink
feat: swap sample speaker reference (metavoiceio#21)
Browse files Browse the repository at this point in the history
Co-authored-by: sid <sid@themetavoice.xyz>
  • Loading branch information
sidroopdaska and sid authored Feb 8, 2024
1 parent 11969e5 commit 11428f9
Show file tree
Hide file tree
Showing 4 changed files with 12 additions and 6 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:
* **Emotional speech rhythm and tone** in English. No hallucinations.
* **Zero-shot cloning for American & British voices**, with 30s reference audio.
* Support for (cross-lingual) **voice cloning with finetuning**.
* We have had success with as little as 1 minute training data for Indian speakers.
* **Zero-shot cloning for American & British voices**, with 30s reference audio.
* Support for **long-form synthesis**.

We’re releasing MetaVoice-1B under the Apache 2.0 license, *it can be used without restrictions*.
Expand All @@ -28,7 +28,7 @@ pip install -e .
## Usage
1. Download it and use it anywhere (including locally) with our [reference implementation](/fam/llm/sample.py),
```bash
python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/ava.flac"
python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path="assets/bria.mp3"
```

2. Deploy it on any cloud (AWS/GCP/Azure), using our [inference server](/fam/llm/serving.py)
Expand Down
Binary file removed assets/ava.flac
Binary file not shown.
Binary file added assets/bria.mp3
Binary file not shown.
14 changes: 10 additions & 4 deletions fam/llm/sample.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from dataclasses import dataclass
from typing import List, Literal, Optional, Type

import librosa
import torch
import tqdm
import tqdm.contrib.concurrent
Expand Down Expand Up @@ -401,6 +402,7 @@ def get_cached_file(file_or_uri: str):
"""
is_uri = file_or_uri.startswith("http")

cache_path = None
if is_uri:
ext = pathlib.Path(file_or_uri).suffix
# hash the file path to get the cache name
Expand All @@ -412,14 +414,18 @@ def get_cached_file(file_or_uri: str):
if not os.path.exists(cache_path):
command = f"curl -o {cache_path} {file_or_uri}"
subprocess.run(command, shell=True, check=True)

return cache_path
else:
if os.path.exists(file_or_uri):
return file_or_uri
cache_path = file_or_uri
else:
raise FileNotFoundError(f"File {file_or_uri} not found!")

# check audio file is at min. 30s in length
audio, sr = librosa.load(cache_path)
assert librosa.get_duration(y=audio, sr=sr) >= 30, "Speaker reference audio file needs to be >= 30s in duration."

return cache_path


def get_cached_embedding(local_file_path: str, spkemb_model):
if not os.path.exists(local_file_path):
Expand Down Expand Up @@ -596,7 +602,7 @@ class SamplingControllerConfig:
"""Absolute path to the model directory."""

spk_cond_path: str
"""Path to speaker reference file. Supports: wav, flac & mp3"""
"""Path to speaker reference file. Min. 30s of audio required. Supports both local paths & public URIs. Audio formats: wav, flac & mp3"""

text: str = (
"This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model by MetaVoice."
Expand Down

0 comments on commit 11428f9

Please sign in to comment.