Skip to content

Commit 2749a80

Browse files
committed
Merge branch 'gh/master' into gh/release
2 parents 799660f + 64ea93d commit 2749a80

File tree

39 files changed

+1154
-318
lines changed

39 files changed

+1154
-318
lines changed

CUDA-Optimized/FastSpeech/.gitmodules

Lines changed: 0 additions & 6 deletions
This file was deleted.
Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
1-
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3
1+
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.10-py3
22
FROM ${FROM_IMAGE_NAME}
33

4+
# ARG UNAME
5+
# ARG UID
6+
# ARG GID
7+
# RUN groupadd -g $GID -o $UNAME
8+
# RUN useradd -m -u $UID -g $GID -o -s /bin/bash $UNAME
9+
# USER $UNAME
10+
411
ADD . /workspace/fastspeech
512
WORKDIR /workspace/fastspeech
613

7-
RUN sh ./scripts/install.sh
14+
RUN sh ./scripts/install.sh

CUDA-Optimized/FastSpeech/README.md

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -95,9 +95,9 @@ and encapsulates some dependencies. Aside from these dependencies, ensure you
9595
have the following components:
9696

9797
* [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker)
98-
* [PyTorch 20.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
98+
* [PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
9999
or newer
100-
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU
100+
* [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/), [Turing](https://www.nvidia.com/en-us/geforce/turing/)<!--, or [Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) based GPU-->
101101

102102
For more information about how to get started with NGC containers, see the
103103
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -120,11 +120,6 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
120120
git clone https://github.com/NVIDIA/DeepLearningExamples.git
121121
cd DeepLearningExamples/CUDA-Optimized/FastSpeech
122122
```
123-
and pull submodules.
124-
```
125-
git submodule init
126-
git submodule update
127-
```
128123
129124
2. Download and preprocess the dataset. Data is downloaded to the ./LJSpeech-1.1 directory (on the host). The ./LJSpeech-1.1 directory is mounted to the /workspace/fastspeech/LJSpeech-1.1 location in the NGC container.
130125
```
@@ -148,7 +143,7 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
148143
149144
The preprocessed mel-spectrograms are stored in the ./mels_ljspeech1.1 directory.
150145
151-
Next, calculate alignments on the LJSpeech dataset using a pre-trained [NVIDIA Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view). The output directory is specified with `--aligns_path`.
146+
Next, preprocess the alignments on LJSpeech dataset with feed-forwards to the teacher model. Download the Nvidia [pretrained Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view) to get a pretrained teacher model. And set --tacotron2_path to the Tacotron2 checkpoint file path and the result alignments are stored in --aligns_path.
152147
```
153148
python fastspeech/align_tacotron2.py --dataset_path="./LJSpeech-1.1" --tacotron2_path="tacotron2_statedict.pt" --aligns_path="aligns_ljspeech1.1"
154149
```
@@ -169,23 +164,23 @@ Next, calculate alignments on the LJSpeech dataset using a pre-trained [NVIDIA T
169164
python fastspeech/train.py --dataset_path="./LJSpeech-1.1" --mels_path="./mels_ljspeech1.1" --aligns_path="./aligns_ljspeech1.1" --log_path="./logs" --checkpoint_path="./checkpoints" --use_amp
170165
```
171166
172-
6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
167+
6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ckpt_amp_256/files?version=19.10.0) from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.
173168
174169
After you have trained the FastSpeech model, you can perform generation using the checkpoint stored in ./checkpoints. Then run:
175170
```
176-
python generate.py --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="./test_sentences.txt"
171+
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="./test_sentences.txt"
177172
```
178173
179174
The script loads automatically the latest checkpoint (if any exists), or you can pass a checkpoint file through --ckpt_file. And it loads input texts in ./test_sentences.txt and stores the result in ./results directory. You can also set the result directory path with --results_path.
180175
181176
You can also run with a sample text:
182177
```
183-
python generate.py --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
178+
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
184179
```
185180
186-
7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first extract a TensorRT engine file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt) and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
181+
7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first get an ONNX file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/tensorrt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt), convert it to an TensorRT engine using scripts/waveglow/convert_onnx2trt.py, and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
187182
```
188-
python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="./test_sentences.txt" --waveglow_engine_path="waveglow.fp16.trt"
183+
python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="./test_sentences.txt" --waveglow_engine_path="waveglow.fp16.trt"
189184
```
190185
191186
## Advanced
@@ -293,33 +288,29 @@ For more details, refer to [accelerating inference with TensorRT](fastspeech/trt
293288

294289
#### Generation
295290

296-
To generate waveforms with WaveGlow Vocoder, 1) Make sure to pull [Nvidia WaveGlow](https://github.com/NVIDIA/waveglow) through git submodule, 2) get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
297-
```
298-
git submodule init
299-
git submodule update
300-
```
291+
To generate waveforms with WaveGlow Vocoder, get [pretrained WaveGlow model](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ckpt_amp_256/files?version=19.10.0) from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.
301292

302293
Run generate.py with:
303294
* --text - an input text or the text file path.
304295
* --results_path - result waveforms directory path. (default=./results).
305296
* --ckpt_file - checkpoint file path. (default checkpoint file is the latest file in --checkpoint_path)
306297
```
307-
python generate.py --waveglow_path="./waveglow_256channels.pt" --text="The more you buy, the more you save."
298+
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text="The more you buy, the more you save."
308299
```
309300
or
310301
```
311-
python generate.py --waveglow_path="./waveglow_256channels.pt" --text=test_sentences.txt
302+
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text=test_sentences.txt
312303
```
313304

314-
Sample result waveforms are [here](https://gitlab-master.nvidia.com/dahn/fastspeech/tree/master/samples).
305+
Sample result waveforms are [here](samples).
315306

316-
To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
307+
To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
317308

318309
```
319-
python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
310+
python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
320311
```
321312

322-
Sample result waveforms are [FP32](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples) and [FP16](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples_fp16).
313+
Sample result waveforms are [FP32](fastspeech/trt/samples) and [FP16](fastspeech/trt/samples_fp16).
323314

324315

325316
## Performance
@@ -391,7 +382,17 @@ The following sections provide details on how we achieved our performance and ac
391382

392383
#### Training performance results
393384

394-
Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
385+
Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) on <!--NVIDIA DGX A100 with 8x A100 40G GPUs and -->NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
386+
387+
<!-- ##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
388+
389+
| GPUs | Batch size / GPU | Throughput(mels/s) - FP32 | Throughput(mels/s) - mixed precision | Throughput speedup (FP32 - mixed precision) | Multi-GPU Weak scaling - FP32 | Multi-GPU Weak scaling - mixed precision
390+
|---|----|--------|--------|------|-----|------|
391+
| 1 | 32 | | | | | 1 |
392+
| 4 | 32 | | | | | |
393+
| 8 | 32 | | | | | | -->
394+
395+
##### Training performance: NVIDIA DGX-1 (8x V100 16GB)
395396

396397
| GPUs | Batch size / GPU | Throughput(mels/s) - FP32 | Throughput(mels/s) - mixed precision | Throughput speedup (FP32 - mixed precision) | Multi-GPU Weak scaling - FP32 | Multi-GPU Weak scaling - mixed precision
397398
|---|----|--------|--------|------|-----|------|
@@ -401,7 +402,7 @@ Our results were obtained by running the script in [training performance benchma
401402

402403
#### Inference performance results
403404

404-
Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
405+
Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark) on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
405406

406407
##### Inference performance: NVIDIA DGX-1 (1x V100 16GB)
407408

@@ -442,9 +443,12 @@ Our results were obtained by running the script in [inference performance benchm
442443
## Release notes
443444

444445
### Changelog
446+
Oct 2020
447+
- PyTorch 1.7, TensorRT 7.2 support <!--and Nvidia Ampere architecture support-->
448+
445449
July 2020
446450
- Initial release
447451

448452
### Known issues
449453

450-
There are no known issues in this release.
454+
There are no known issues in this release.

CUDA-Optimized/FastSpeech/fastspeech/dataset/ljspeech_dataset.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,15 @@
2424

2525
import csv
2626

27+
import pprint
28+
2729
import librosa
2830
from torch.utils.data import Dataset
2931
import pandas as pd
3032
from fastspeech.text_norm import text_to_sequence
3133
from fastspeech import audio
34+
from fastspeech.utils.logging import tprint
35+
3236
import os
3337
import pathlib
3438

@@ -38,6 +42,8 @@
3842

3943
from fastspeech import hparam as hp
4044

45+
pp = pprint.PrettyPrinter(indent=4, width=1000)
46+
4147
class LJSpeechDataset(Dataset):
4248

4349
def __init__(self, root_path, meta_file="metadata.csv",
@@ -130,7 +136,7 @@ def __getitem__(self, idx):
130136
return data
131137

132138

133-
def preprocess_mel(hparam="base.yaml"):
139+
def preprocess_mel(hparam="base.yaml", **kwargs):
134140
"""The script for preprocessing mel-spectrograms from the dataset.
135141
136142
By default, this script assumes to load parameters in the default config file, fastspeech/hparams/base.yaml.
@@ -147,8 +153,9 @@ def preprocess_mel(hparam="base.yaml"):
147153
hparam (str, optional): Path to default config file. Defaults to "base.yaml".
148154
"""
149155

150-
hp.set_hparam(hparam)
151-
156+
hp.set_hparam(hparam, kwargs)
157+
tprint("Hparams:\n{}".format(pp.pformat(hp)))
158+
152159
pathlib.Path(hp.mels_path).mkdir(parents=True, exist_ok=True)
153160

154161
dataset = LJSpeechDataset(hp.dataset_path, mels_path=None)

CUDA-Optimized/FastSpeech/fastspeech/hparams/base.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Path
22
dataset_path: "/workspace/fastspeech/LJSpeech-1.1"
33
tacotron2_path: "/workspace/fastspeech/tacotron2_statedict.pt"
4-
waveglow_path: "/workspace/fastspeech/waveglow_256channels.pt"
4+
waveglow_path: "/workspace/fastspeech/nvidia_waveglow256pyt_fp16"
55
mels_path: "/workspace/fastspeech/mels_ljspeech1.1"
66
aligns_path: "/workspace/fastspeech/aligns_ljspeech1.1"
77
log_path: "/workspace/fastspeech/logs"

CUDA-Optimized/FastSpeech/fastspeech/hparams/trt.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ parent_yaml: 'infer.yaml'
33
# Inference
44
batch_size: 1 # Batch size.
55
use_trt: True # Usage of TensorRT. Must be True to enable TensorRT.
6-
use_fp16: True # Usage of FP16. Set to True to enable half precision for the engine.
6+
use_fp16: True # Usage of FP16. Set to True to enable half precision for the engine.
77

88
# TRT
99
trt_file_path: "/workspace/fastspeech/fastspeech.fp16.b1.trt" # Built TensorRT engine file path.

CUDA-Optimized/FastSpeech/fastspeech/inferencer/waveglow_inferencer.py

Lines changed: 43 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,21 @@
3030
from fastspeech.utils.pytorch import to_cpu_numpy, to_device_async
3131
from fastspeech.inferencer.denoiser import Denoiser
3232

33+
from waveglow.model import WaveGlow
34+
import argparse
35+
36+
def unwrap_distributed(state_dict):
37+
"""
38+
Unwraps model from DistributedDataParallel.
39+
DDP wraps model in additional "module.", it needs to be removed for single
40+
GPU inference.
41+
:param state_dict: model's state dict
42+
"""
43+
new_state_dict = {}
44+
for key, value in state_dict.items():
45+
new_key = key.replace('module.', '')
46+
new_state_dict[new_key] = value
47+
return new_state_dict
3348

3449
class WaveGlowInferencer(object):
3550

@@ -40,11 +55,36 @@ def __init__(self, ckpt_file, device='cuda', use_fp16=False, use_denoiser=False)
4055
self.use_denoiser = use_denoiser
4156

4257
# model
43-
sys.path.append('waveglow')
44-
self.model = torch.load(self.ckpt_file, map_location=self.device)['model']
58+
# sys.path.append('waveglow')
59+
60+
from waveglow.arg_parser import parse_waveglow_args
61+
parser = parser = argparse.ArgumentParser()
62+
model_parser= parse_waveglow_args(parser)
63+
args, _ = model_parser.parse_known_args()
64+
model_config = dict(
65+
n_mel_channels=args.n_mel_channels,
66+
n_flows=args.flows,
67+
n_group=args.groups,
68+
n_early_every=args.early_every,
69+
n_early_size=args.early_size,
70+
WN_config=dict(
71+
n_layers=args.wn_layers,
72+
kernel_size=args.wn_kernel_size,
73+
n_channels=args.wn_channels
74+
)
75+
)
76+
self.model = WaveGlow(**model_config)
77+
78+
state_dict = torch.load(self.ckpt_file, map_location=self.device)['state_dict']
79+
state_dict = unwrap_distributed(state_dict)
80+
self.model.load_state_dict(state_dict)
81+
82+
self.model = to_device_async(self.model, self.device)
83+
4584
self.model = self.model.remove_weightnorm(self.model)
85+
4686
self.model.eval()
47-
self.model = to_device_async(self.model, self.device)
87+
4888
if self.use_fp16:
4989
self.model = self.model.half()
5090
self.model = self.model

0 commit comments

Comments
 (0)