Skip to content

Commit

Permalink
add waveglow (#92)
Browse files Browse the repository at this point in the history
* add wav2vec2

* support-different-image

* update waveglow

* update waveglow

* update waveglow

* update waveglow

* update waveglow

* update waveglow

* update

* update

* update

* update

* update

* update

* update

* merge main

* merge main

* update according to review

* update according to review

* update according to review

* update according to review

* update according to review

* update according to review
  • Loading branch information
upvenly authored Jun 6, 2023
1 parent 5c3a3a5 commit ca824b1
Show file tree
Hide file tree
Showing 35 changed files with 2,277 additions and 28 deletions.
73 changes: 73 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
### 模型信息
- Introduction

The WaveGlow model is a flow-based generative model that generates audio samples from Gaussian distribution using mel-spectrogram conditioning (Figure 2). During training, the model learns to transform the dataset distribution into spherical Gaussian distribution through a series of flows. One step of a flow consists of an invertible convolution, followed by a modified WaveNet architecture that serves as an affine coupling layer. During inference, the network is inverted and audio samples are generated from the Gaussian distribution. Our implementation uses 512 residual channels in the coupling layer.

- Paper
[WaveGlow: A Flow-based Generative Network for Speech Synthesis](https://arxiv.org/abs/1811.00002)

- 模型代码来源
This case includes code from the BSD3.0 protocol open source project at [NVIDIA DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

### 数据集
#### 数据集下载地址
The LJ Speech Dataset
LJ Speech Dataset官网地址:https://keithito.com/LJ-Speech-Dataset/
Dataset version: 1.1
File md5sum: c4763be9595ddfa79c2fc6eaeb3b6c8e

Statistics
| Item | Statistics |
| ------------------- | ---------- |
| Total Clips | 13,100 |
| Total Words | 225,715 |
| Total Characters | 1,308,678 |
| Total Duration | 23:55:17 |
| Mean Clip Duration | 6.57 sec |
| Min Clip Duration | 1.11 sec |
| Max Clip Duration | 10.10 sec |
| Mean Words per Clip | 17.23 |
| Distinct Words | 13,821 |


#### 预处理
参考:https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

``` bash
.
├── LJSpeech-1.1
│ ├── README
│ ├── mels
│ ├── metadata.csv
│ └── wavs
└── filelists
├── ljs_audio_text_test_filelist.txt
├── ljs_audio_text_train_filelist.txt
├── ljs_audio_text_train_subset_1250_filelist.txt
├── ljs_audio_text_train_subset_2500_filelist.txt
├── ljs_audio_text_train_subset_300_filelist.txt
├── ljs_audio_text_train_subset_625_filelist.txt
├── ljs_audio_text_train_subset_64_filelist.txt
├── ljs_audio_text_val_filelist.txt
├── ljs_mel_text_filelist.txt
├── ljs_mel_text_test_filelist.txt
├── ljs_mel_text_train_filelist.txt
├── ljs_mel_text_train_subset_1250_filelist.txt
├── ljs_mel_text_train_subset_2500_filelist.txt
├── ljs_mel_text_train_subset_625_filelist.txt
└── ljs_mel_text_val_filelist.txt
4 directories, 17 files
```


### 框架与芯片支持情况
| | Pytorch |
| ---------- | ------- |
| Nvidia GPU ||
| 昆仑芯 XPU | N/A |
| 天数智芯 | N/A |
2 changes: 2 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/config/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from ._base import *
from .mutable_params import mutable_params
41 changes: 41 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/config/_base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# necessary
name: str = "WaveGlow"
dist_backend = "nccl"
vendor: str = "nvidia"
target_val_loss = -5.72 #https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2

save_checkpoint = True

#Perf
do_train = True
local_rank = -1
log_freq = 1
output = "training/result/"
log_file = "nvlog.json"
gradient_accumulation_steps = 1

# training
epochs = 250
batch_size = 10

# device
device: str = None
n_device: int = 1
fp16 = False
data_dir = None
world_size = None

# random seed
seed: int = None

# model args
amp = True
epochs_per_checkpoint = 50
learning_rate = 1e-4
segment_length = 8000
weight_decay = 0
grad_clip_thresh = 65504.0
cudnn_benchmark = True
cudnn_enabled = True
anneal_steps = None
bench_class = ''
4 changes: 4 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/config/mutable_params.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
mutable_params = [
"local_rank", "do_train", "data_dir", "log_freq", "dist_backend",
"batch_size", "vendor", "amp"
]
Empty file.
117 changes: 117 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/dataloaders/data_function.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# *****************************************************************************
# Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# * Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
# * Neither the name of the NVIDIA CORPORATION nor the
# names of its contributors may be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# *****************************************************************************\

import torch
import os
import tacotron2_common.layers as layers
from tacotron2_common.utils import load_wav_to_torch, load_filepaths_and_text, to_gpu


class MelAudioLoader(torch.utils.data.Dataset):
"""
1) loads audio,text pairs
2) computes mel-spectrograms from audio files.
"""

def __init__(self, dataset_path, audiopaths_and_text, args):
self.audiopaths_and_text = load_filepaths_and_text(
dataset_path, audiopaths_and_text)
self.max_wav_value = args.max_wav_value
self.sampling_rate = args.sampling_rate
self.stft = layers.TacotronSTFT(args.filter_length, args.hop_length,
args.win_length, args.n_mel_channels,
args.sampling_rate, args.mel_fmin,
args.mel_fmax)
self.segment_length = args.segment_length

def get_mel_audio_pair(self, filename):
audio, sampling_rate = load_wav_to_torch(filename)

if sampling_rate != self.stft.sampling_rate:
raise ValueError("{} {} SR doesn't match target {} SR".format(
sampling_rate, self.stft.sampling_rate))

# Take segment
if audio.size(0) >= self.segment_length:
max_audio_start = audio.size(0) - self.segment_length
audio_start = torch.randint(0, max_audio_start + 1,
size=(1, )).item()
audio = audio[audio_start:audio_start + self.segment_length]
else:
audio = torch.nn.functional.pad(
audio, (0, self.segment_length - audio.size(0)),
'constant').data

audio = audio / self.max_wav_value
audio_norm = audio.unsqueeze(0)
audio_norm = torch.autograd.Variable(audio_norm, requires_grad=False)
melspec = self.stft.mel_spectrogram(audio_norm)
melspec = melspec.squeeze(0)

return (melspec, audio, len(audio))

def __getitem__(self, index):
return self.get_mel_audio_pair(self.audiopaths_and_text[index][0])

def __len__(self):
return len(self.audiopaths_and_text)


def batch_to_gpu(batch):
x, y, len_y = batch
x = to_gpu(x).float()
y = to_gpu(y).float()
len_y = to_gpu(torch.sum(len_y))
return ((x, y), y, len_y)


def get_collate_function(model_name, ):
if model_name == 'WaveGlow':
collate_fn = torch.utils.data.dataloader.default_collate
else:
raise NotImplementedError(
"unknown collate function requested: {}".format(model_name))

return collate_fn


def get_data_loader(model_name, dataset_path, audiopaths_and_text, args):
if model_name == 'WaveGlow':
data_loader = MelAudioLoader(dataset_path, audiopaths_and_text, args)
else:
raise NotImplementedError(
"unknown data loader requested: {}".format(model_name))

return data_loader


def get_batch_to_gpu(model_name):
if model_name == 'WaveGlow':
return batch_to_gpu
else:
raise NotImplementedError(
"unknown batch_to_gpu requested: {}".format(model_name))
101 changes: 101 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/dataloaders/dataloader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import copy
import os

import torch
import torch.utils.data
import torchvision

from pycocotools import mask as coco_mask
from pycocotools.coco import COCO

from . import data_function
from torch.utils.data.distributed import DistributedSampler
from torch.utils.data import DataLoader
from driver import dist_pytorch


def get_collate_fn(args):
collate_fn = data_function.get_collate_function(args.name)
return collate_fn


def build_train_dataset(args):
trainset = data_function.get_data_loader(args.name, args.data_dir,
args.training_files, args)
return trainset


def build_train_dataloader(trainset, args):
if dist_pytorch.get_world_size() > 1:
train_sampler = DistributedSampler(trainset, seed=(args.seed or 0))
shuffle = False
else:
train_sampler = None
shuffle = True
train_loader = DataLoader(trainset,
num_workers=1,
shuffle=shuffle,
sampler=train_sampler,
batch_size=args.batch_size,
pin_memory=False,
drop_last=True,
collate_fn=get_collate_fn(args))
return train_loader


def build_eval_dataloader(valset, args):
val_sampler = DistributedSampler(
valset) if dist_pytorch.get_world_size() > 1 else None
val_loader = DataLoader(
valset,
num_workers=1,
shuffle=False,
sampler=val_sampler,
batch_size=args.batch_size,
pin_memory=False,
collate_fn=get_collate_fn(args),
drop_last=(True if args.bench_class == "perf-train" else False))

return val_loader


def build_eval_dataset(args):
valset = data_function.get_data_loader(args.name, args.data_dir,
args.validation_files, args)
return valset


class FilterAndRemapCocoCategories(object):

def __init__(self, categories, remap=True):
self.categories = categories
self.remap = remap

def __call__(self, image, target):
anno = target["annotations"]
anno = [obj for obj in anno if obj["category_id"] in self.categories]
if not self.remap:
target["annotations"] = anno
return image, target
anno = copy.deepcopy(anno)
for obj in anno:
obj["category_id"] = self.categories.index(obj["category_id"])
target["annotations"] = anno
return image, target


def convert_coco_poly_to_mask(segmentations, height, width):
masks = []
for polygons in segmentations:
rles = coco_mask.frPyObjects(polygons, height, width)
mask = coco_mask.decode(rles)
if len(mask.shape) < 3:
mask = mask[..., None]
mask = torch.as_tensor(mask, dtype=torch.uint8)
mask = mask.any(dim=2)
masks.append(mask)
if masks:
masks = torch.stack(masks, dim=0)
else:
masks = torch.zeros((0, height, width), dtype=torch.uint8)
return masks
10 changes: 10 additions & 0 deletions training/benchmarks/WaveGlow/pytorch/loss/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from . import loss_function


def create_criterion(args):
try:
sigma = args.sigma
except AttributeError:
sigma = None
criterion = loss_function.get_loss_function(args.name, sigma)
return criterion
Loading

0 comments on commit ca824b1

Please sign in to comment.