Skip to content

Commit

Permalink
initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
sid committed Feb 6, 2024
0 parents commit 0e815bb
Show file tree
Hide file tree
Showing 36 changed files with 3,325 additions and 0 deletions.
186 changes: 186 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
*.pkl
*.flac
*.npz
*.wav
*.m4a
*.opus
*.npy
*wandb
*.parquet
*.wav
*.pt
*.bin
*.png
*.DS_Store
*.idea
*.ipynb_checkpoints/
*__pycache__/
*.pyc
*.tsv
*.bak
*.tar
*.db
*.dat
*.json

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# poetry
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
#pdm.lock
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
# in version control.
# https://pdm.fming.dev/#use-with-ide
.pdm.toml

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyCharm
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
**/.tmp
66 changes: 66 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# <img src="./assets/logo.png" style="vertical-align:middle; margin-bottom:6px;" alt="Logo" width="36" height="36"/> MetaVoice

MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:
* **Emotional speech rhythm and tone** in English. No hallucinations.
* Support for (cross-lingual) **voice cloning with finetuning**.
* We have had success with as little as 1 minute training data for Indian speakers.
* **Zero-shot cloning for American & British voices**, with 30s reference audio.
* Support for **long-form synthesis**.

We’re releasing MetaVoice-1B under the Apache 2.0 license, *it can be used without restrictions*.

<audio src="https://cdn.themetavoice.xyz/github_readme_bria.wav" controls style="margin: 0 auto;"></audio>

## Installation
```bash
# install ffmpeg
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz.md5
md5sum -c ffmpeg-git-amd64-static.tar.xz.md5
tar xvf ffmpeg-git-amd64-static.tar.xz
sudo mv ffmpeg-git-*-static/ffprobe ffmpeg-git-*-static/ffmpeg /usr/local/bin/
rm -rf ffmpeg-git-*

pip install -r requirements.txt
pip install -e .
```

## Usage
1. Download it and use it anywhere (including locally) with our [reference implementation](/fam/llm/sample.py),
```bash
python fam/llm/sample.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1" --spk_cond_path=<PATH_TO_TARGET_AUDIO>
```

2. Deploy it on any cloud (AWS/GCP/Azure), using our [inference server](/fam/llm/serving.py)
```bash
python fam/llm/serving.py --huggingface_repo_id="metavoiceio/metavoice-1B-v0.1"
```

3. Use it on HuggingFace

## Soon
- Long form TTS
- Fine-tuning code

## Architecture
We predict EnCodec tokens from text, and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.

* We use a causal GPT to predict the first two hierarchies of EnCodec tokens. Text and audio are part of the LLM context. Speaker information is passed via conditioning at the token embedding layer. This speaker conditioning is obtained from a separately trained speaker verification network.
- The two hierarchies are predicted in a "flattened interleaved" manner, we predict the first token of the first hierarchy, then the first token of the second hierarchy, then the second token of the first hierarchy, and so on.
- We use condition-free sampling to boost the cloning capability of the model.
- The text is tokenised using a custom trained BPE tokeniser with 512 tokens.
- Note that we've skipped predicting semantic tokens as done in other works, as we found that this isn't strictly necessary.
* We use a non-causal (encoder-style) transformer to predict the rest of the 6 hierarchies from the first two hierarchies. This is a super small model (~10Mn parameters), and has extensive zero-shot generalisation to most speakers we've tried. Since it's non-causal, we're also able to predict all the timesteps in parallel.
* We use multi-band diffusion to generate waveforms from the EnCodec tokens. We noticed that the speech is clearer than using the original RVQ decoder or VOCOS. However, the diffusion at waveform level leaves some background artifacts which are quite unpleasant to the ear. We clean this up in the next step.
* We use DeepFilterNet to clear up the artifacts introduced by the multi-band diffusion.

## Optimizations
The model supports:
1. KV-caching via Flash Decoding
2. Batching (including texts of different lengths)

## Contribute
- See all [active issues](https://github.com/themetavoicexyz/issues)!

## Acknowledgements
We are grateful to Together.ai for their 24/7 help in marshalling our cluster. We thank the teams of AWS, GCP & HuggingFace for support with their cloud platforms.
Binary file added assets/ava.flac
Binary file not shown.
Binary file added assets/logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file added fam/__init__.py
Empty file.
Empty file added fam/llm/__init__.py
Empty file.
2 changes: 2 additions & 0 deletions fam/llm/adapters/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from fam.llm.adapters.flattened_encodec import FlattenedInterleavedEncodec2Codebook
from fam.llm.adapters.tilted_encodec import TiltedEncodec
5 changes: 5 additions & 0 deletions fam/llm/adapters/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from abc import ABC


class BaseDataAdapter(ABC):
pass
38 changes: 38 additions & 0 deletions fam/llm/adapters/flattened_encodec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
from fam.llm.adapters.base import BaseDataAdapter


class FlattenedInterleavedEncodec2Codebook(BaseDataAdapter):
def __init__(self, end_of_audio_token):
self._end_of_audio_token = end_of_audio_token

def decode(self, tokens: list[list[int]]) -> tuple[list[int], list[list[int]]]:
assert len(tokens) == 1
tokens = tokens[0]

text_ids = []
extracted_audio_ids = [[], []]

for t in tokens:
if t < self._end_of_audio_token:
extracted_audio_ids[0].append(t)
elif t >= self._end_of_audio_token and t < 2 * self._end_of_audio_token:
extracted_audio_ids[1].append(t - self._end_of_audio_token)
# We ignore t = 2 * self._end_of_audio_token, as it is the end of audio token
elif t > 2 * self._end_of_audio_token:
text_ids.append(t)

if len(set([len(x) for x in extracted_audio_ids])) != 1:
min_len = min([len(x) for x in extracted_audio_ids])
max_len = max([len(x) for x in extracted_audio_ids])
print("WARNING: Number of tokens at each hierarchy must be of the same length!")
print(f"Truncating to min length of {min_len} tokens from {max_len} max.")
print([len(x) for x in extracted_audio_ids])
extracted_audio_ids = [x[:min_len] for x in extracted_audio_ids]

return text_ids[:-1], extracted_audio_ids

def encode(self, text_tokens: list[int], audio_tokens: list[list[int]]):
"""
Performs the required combination and padding as needed.
"""
raise NotImplementedError
45 changes: 45 additions & 0 deletions fam/llm/adapters/tilted_encodec.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
from fam.llm.adapters.base import BaseDataAdapter


class TiltedEncodec(BaseDataAdapter):
def __init__(self, end_of_audio_token):
self._end_of_audio_token = end_of_audio_token

def decode(self, tokens: list[list[int]]) -> tuple[list[int], list[list[int]]]:
assert len(tokens) > 1

text_ids = []
extracted_audio_ids = []

extracted_audio_ids.append([])
# Handle first hierarchy as special case as it contains text tokens as well
# TODO: maybe it doesn't need special case, and can be handled on it's own :)
for t in tokens[0]:
if t > self._end_of_audio_token:
text_ids.append(t)
elif t < self._end_of_audio_token:
extracted_audio_ids[0].append(t)

# Handle the rest of the hierarchies
for i in range(1, len(tokens)):
token_hierarchy_ids = tokens[i]
extracted_audio_ids.append([])
for t in token_hierarchy_ids:
if t < self._end_of_audio_token:
extracted_audio_ids[i].append(t)

if len(set([len(x) for x in extracted_audio_ids])) != 1:
min_len = min([len(x) for x in extracted_audio_ids])
max_len = max([len(x) for x in extracted_audio_ids])
print("WARNING: Number of tokens at each hierarchy must be of the same length!")
print(f"Truncating to min length of {min_len} tokens from {max_len} max.")
print([len(x) for x in extracted_audio_ids])
extracted_audio_ids = [x[:min_len] for x in extracted_audio_ids]

return text_ids[:-1], extracted_audio_ids

def encode(self, text_tokens: list[int], audio_tokens: list[list[int]]):
"""
Performs the required combination and padding as needed.
"""
raise NotImplementedError
Loading

0 comments on commit 0e815bb

Please sign in to comment.