-
Notifications
You must be signed in to change notification settings - Fork 108
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add wav2vec2 * support-different-image * update waveglow * update waveglow * update waveglow * update waveglow * update waveglow * update waveglow * update * update * update * update * update * update * update * merge main * merge main * update according to review * update according to review * update according to review * update according to review * update according to review * update according to review
- Loading branch information
Showing
35 changed files
with
2,277 additions
and
28 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
### 模型信息 | ||
- Introduction | ||
|
||
The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer. | ||
|
||
- Paper | ||
[WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002) | ||
|
||
- 模型代码来源 | ||
This case includes code from the BSD3.0 protocol open source project at [NVIDIA DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) | ||
|
||
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: | ||
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. | ||
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. | ||
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. | ||
|
||
### 数据集 | ||
#### 数据集下载地址 | ||
The LJ Speech Dataset | ||
LJ Speech Dataset官网地址:https://keithito.com/LJ-Speech-Dataset/ | ||
Dataset version: 1.1 | ||
File md5sum: c4763be9595ddfa79c2fc6eaeb3b6c8e | ||
|
||
Statistics | ||
| Item | Statistics | | ||
| ------------------- | ---------- | | ||
| Total Clips | 13,100 | | ||
| Total Words | 225,715 | | ||
| Total Characters | 1,308,678 | | ||
| Total Duration | 23:55:17 | | ||
| Mean Clip Duration | 6.57 sec | | ||
| Min Clip Duration | 1.11 sec | | ||
| Max Clip Duration | 10.10 sec | | ||
| Mean Words per Clip | 17.23 | | ||
| Distinct Words | 13,821 | | ||
|
||
|
||
#### 预处理 | ||
参考:https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2 | ||
|
||
``` bash | ||
. | ||
├── LJSpeech-1.1 | ||
│ ├── README | ||
│ ├── mels | ||
│ ├── metadata.csv | ||
│ └── wavs | ||
└── filelists | ||
├── ljs_audio_text_test_filelist.txt | ||
├── ljs_audio_text_train_filelist.txt | ||
├── ljs_audio_text_train_subset_1250_filelist.txt | ||
├── ljs_audio_text_train_subset_2500_filelist.txt | ||
├── ljs_audio_text_train_subset_300_filelist.txt | ||
├── ljs_audio_text_train_subset_625_filelist.txt | ||
├── ljs_audio_text_train_subset_64_filelist.txt | ||
├── ljs_audio_text_val_filelist.txt | ||
├── ljs_mel_text_filelist.txt | ||
├── ljs_mel_text_test_filelist.txt | ||
├── ljs_mel_text_train_filelist.txt | ||
├── ljs_mel_text_train_subset_1250_filelist.txt | ||
├── ljs_mel_text_train_subset_2500_filelist.txt | ||
├── ljs_mel_text_train_subset_625_filelist.txt | ||
└── ljs_mel_text_val_filelist.txt | ||
4 directories, 17 files | ||
``` | ||
|
||
|
||
### 框架与芯片支持情况 | ||
| | Pytorch | | ||
| ---------- | ------- | | ||
| Nvidia GPU | ✅ | | ||
| 昆仑芯 XPU | N/A | | ||
| 天数智芯 | N/A | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
from ._base import * | ||
from .mutable_params import mutable_params |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# necessary | ||
name: str = "WaveGlow" | ||
dist_backend = "nccl" | ||
vendor: str = "nvidia" | ||
target_val_loss = -5.72 #https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2 | ||
|
||
save_checkpoint = True | ||
|
||
#Perf | ||
do_train = True | ||
local_rank = -1 | ||
log_freq = 1 | ||
output = "training/result/" | ||
log_file = "nvlog.json" | ||
gradient_accumulation_steps = 1 | ||
|
||
# training | ||
epochs = 250 | ||
batch_size = 10 | ||
|
||
# device | ||
device: str = None | ||
n_device: int = 1 | ||
fp16 = False | ||
data_dir = None | ||
world_size = None | ||
|
||
# random seed | ||
seed: int = None | ||
|
||
# model args | ||
amp = True | ||
epochs_per_checkpoint = 50 | ||
learning_rate = 1e-4 | ||
segment_length = 8000 | ||
weight_decay = 0 | ||
grad_clip_thresh = 65504.0 | ||
cudnn_benchmark = True | ||
cudnn_enabled = True | ||
anneal_steps = None | ||
bench_class = '' |
4 changes: 4 additions & 0 deletions
4
training/benchmarks/WaveGlow/pytorch/config/mutable_params.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
mutable_params = [ | ||
"local_rank", "do_train", "data_dir", "log_freq", "dist_backend", | ||
"batch_size", "vendor", "amp" | ||
] |
Empty file.
117 changes: 117 additions & 0 deletions
117
training/benchmarks/WaveGlow/pytorch/dataloaders/data_function.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,117 @@ | ||
# ***************************************************************************** | ||
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Redistribution and use in source and binary forms, with or without | ||
# modification, are permitted provided that the following conditions are met: | ||
# * Redistributions of source code must retain the above copyright | ||
# notice, this list of conditions and the following disclaimer. | ||
# * Redistributions in binary form must reproduce the above copyright | ||
# notice, this list of conditions and the following disclaimer in the | ||
# documentation and/or other materials provided with the distribution. | ||
# * Neither the name of the NVIDIA CORPORATION nor the | ||
# names of its contributors may be used to endorse or promote products | ||
# derived from this software without specific prior written permission. | ||
# | ||
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND | ||
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED | ||
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY | ||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES | ||
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; | ||
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND | ||
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | ||
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS | ||
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
# | ||
# *****************************************************************************\ | ||
|
||
import torch | ||
import os | ||
import tacotron2_common.layers as layers | ||
from tacotron2_common.utils import load_wav_to_torch, load_filepaths_and_text, to_gpu | ||
|
||
|
||
class MelAudioLoader(torch.utils.data.Dataset): | ||
""" | ||
1) loads audio,text pairs | ||
2) computes mel-spectrograms from audio files. | ||
""" | ||
|
||
def __init__(self, dataset_path, audiopaths_and_text, args): | ||
self.audiopaths_and_text = load_filepaths_and_text( | ||
dataset_path, audiopaths_and_text) | ||
self.max_wav_value = args.max_wav_value | ||
self.sampling_rate = args.sampling_rate | ||
self.stft = layers.TacotronSTFT(args.filter_length, args.hop_length, | ||
args.win_length, args.n_mel_channels, | ||
args.sampling_rate, args.mel_fmin, | ||
args.mel_fmax) | ||
self.segment_length = args.segment_length | ||
|
||
def get_mel_audio_pair(self, filename): | ||
audio, sampling_rate = load_wav_to_torch(filename) | ||
|
||
if sampling_rate != self.stft.sampling_rate: | ||
raise ValueError("{} {} SR doesn't match target {} SR".format( | ||
sampling_rate, self.stft.sampling_rate)) | ||
|
||
# Take segment | ||
if audio.size(0) >= self.segment_length: | ||
max_audio_start = audio.size(0) - self.segment_length | ||
audio_start = torch.randint(0, max_audio_start + 1, | ||
size=(1, )).item() | ||
audio = audio[audio_start:audio_start + self.segment_length] | ||
else: | ||
audio = torch.nn.functional.pad( | ||
audio, (0, self.segment_length - audio.size(0)), | ||
'constant').data | ||
|
||
audio = audio / self.max_wav_value | ||
audio_norm = audio.unsqueeze(0) | ||
audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False) | ||
melspec = self.stft.mel_spectrogram(audio_norm) | ||
melspec = melspec.squeeze(0) | ||
|
||
return (melspec, audio, len(audio)) | ||
|
||
def __getitem__(self, index): | ||
return self.get_mel_audio_pair(self.audiopaths_and_text[index][0]) | ||
|
||
def __len__(self): | ||
return len(self.audiopaths_and_text) | ||
|
||
|
||
def batch_to_gpu(batch): | ||
x, y, len_y = batch | ||
x = to_gpu(x).float() | ||
y = to_gpu(y).float() | ||
len_y = to_gpu(torch.sum(len_y)) | ||
return ((x, y), y, len_y) | ||
|
||
|
||
def get_collate_function(model_name, ): | ||
if model_name == 'WaveGlow': | ||
collate_fn = torch.utils.data.dataloader.default_collate | ||
else: | ||
raise NotImplementedError( | ||
"unknown collate function requested: {}".format(model_name)) | ||
|
||
return collate_fn | ||
|
||
|
||
def get_data_loader(model_name, dataset_path, audiopaths_and_text, args): | ||
if model_name == 'WaveGlow': | ||
data_loader = MelAudioLoader(dataset_path, audiopaths_and_text, args) | ||
else: | ||
raise NotImplementedError( | ||
"unknown data loader requested: {}".format(model_name)) | ||
|
||
return data_loader | ||
|
||
|
||
def get_batch_to_gpu(model_name): | ||
if model_name == 'WaveGlow': | ||
return batch_to_gpu | ||
else: | ||
raise NotImplementedError( | ||
"unknown batch_to_gpu requested: {}".format(model_name)) |
101 changes: 101 additions & 0 deletions
101
training/benchmarks/WaveGlow/pytorch/dataloaders/dataloader.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
import copy | ||
import os | ||
|
||
import torch | ||
import torch.utils.data | ||
import torchvision | ||
|
||
from pycocotools import mask as coco_mask | ||
from pycocotools.coco import COCO | ||
|
||
from . import data_function | ||
from torch.utils.data.distributed import DistributedSampler | ||
from torch.utils.data import DataLoader | ||
from driver import dist_pytorch | ||
|
||
|
||
def get_collate_fn(args): | ||
collate_fn = data_function.get_collate_function(args.name) | ||
return collate_fn | ||
|
||
|
||
def build_train_dataset(args): | ||
trainset = data_function.get_data_loader(args.name, args.data_dir, | ||
args.training_files, args) | ||
return trainset | ||
|
||
|
||
def build_train_dataloader(trainset, args): | ||
if dist_pytorch.get_world_size() > 1: | ||
train_sampler = DistributedSampler(trainset, seed=(args.seed or 0)) | ||
shuffle = False | ||
else: | ||
train_sampler = None | ||
shuffle = True | ||
train_loader = DataLoader(trainset, | ||
num_workers=1, | ||
shuffle=shuffle, | ||
sampler=train_sampler, | ||
batch_size=args.batch_size, | ||
pin_memory=False, | ||
drop_last=True, | ||
collate_fn=get_collate_fn(args)) | ||
return train_loader | ||
|
||
|
||
def build_eval_dataloader(valset, args): | ||
val_sampler = DistributedSampler( | ||
valset) if dist_pytorch.get_world_size() > 1 else None | ||
val_loader = DataLoader( | ||
valset, | ||
num_workers=1, | ||
shuffle=False, | ||
sampler=val_sampler, | ||
batch_size=args.batch_size, | ||
pin_memory=False, | ||
collate_fn=get_collate_fn(args), | ||
drop_last=(True if args.bench_class == "perf-train" else False)) | ||
|
||
return val_loader | ||
|
||
|
||
def build_eval_dataset(args): | ||
valset = data_function.get_data_loader(args.name, args.data_dir, | ||
args.validation_files, args) | ||
return valset | ||
|
||
|
||
class FilterAndRemapCocoCategories(object): | ||
|
||
def __init__(self, categories, remap=True): | ||
self.categories = categories | ||
self.remap = remap | ||
|
||
def __call__(self, image, target): | ||
anno = target["annotations"] | ||
anno = [obj for obj in anno if obj["category_id"] in self.categories] | ||
if not self.remap: | ||
target["annotations"] = anno | ||
return image, target | ||
anno = copy.deepcopy(anno) | ||
for obj in anno: | ||
obj["category_id"] = self.categories.index(obj["category_id"]) | ||
target["annotations"] = anno | ||
return image, target | ||
|
||
|
||
def convert_coco_poly_to_mask(segmentations, height, width): | ||
masks = [] | ||
for polygons in segmentations: | ||
rles = coco_mask.frPyObjects(polygons, height, width) | ||
mask = coco_mask.decode(rles) | ||
if len(mask.shape) < 3: | ||
mask = mask[..., None] | ||
mask = torch.as_tensor(mask, dtype=torch.uint8) | ||
mask = mask.any(dim=2) | ||
masks.append(mask) | ||
if masks: | ||
masks = torch.stack(masks, dim=0) | ||
else: | ||
masks = torch.zeros((0, height, width), dtype=torch.uint8) | ||
return masks |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
from . import loss_function | ||
|
||
|
||
def create_criterion(args): | ||
try: | ||
sigma = args.sigma | ||
except AttributeError: | ||
sigma = None | ||
criterion = loss_function.get_loss_function(args.name, sigma) | ||
return criterion |
Oops, something went wrong.