You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
*[PyTorch 20.03-py3+ NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
98
+
*[PyTorch 20.10-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch)
99
99
or newer
100
-
*[NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/)based GPU
100
+
*[NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/), [Turing](https://www.nvidia.com/en-us/geforce/turing/)<!--, or [Ampere](https://www.nvidia.com/en-us/data-center/nvidia-ampere-gpu-architecture/) based GPU-->
101
101
102
102
For more information about how to get started with NGC containers, see the
103
103
following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning
@@ -120,11 +120,6 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
2. Download and preprocess the dataset. Data is downloaded to the ./LJSpeech-1.1 directory (on the host). The ./LJSpeech-1.1 directory is mounted to the /workspace/fastspeech/LJSpeech-1.1 location in the NGC container.
130
125
```
@@ -148,7 +143,7 @@ To train your model using mixed precision with Tensor Cores or using FP32, perfo
148
143
149
144
The preprocessed mel-spectrograms are stored in the ./mels_ljspeech1.1 directory.
150
145
151
-
Next, calculate alignments on the LJSpeech dataset using a pre-trained [NVIDIA Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view). The output directory is specified with `--aligns_path`.
146
+
Next, preprocess the alignments on LJSpeech dataset with feed-forwards to the teacher model. Download the Nvidia [pretrained Tacotron2 checkpoint](https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs_VdZuJ86ZqA/view) to get a pretrained teacher model. And set --tacotron2_path to the Tacotron2 checkpoint file path and the result alignments are stored in --aligns_path.
6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
167
+
6. Start generation. To generate waveforms with WaveGlow Vocoder, Get [pretrained WaveGlow model](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ckpt_amp_256/files?version=19.10.0) from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.
173
168
174
169
After you have trained the FastSpeech model, you can perform generation using the checkpoint stored in ./checkpoints. Then run:
The script loads automatically the latest checkpoint (if any exists), or you can pass a checkpoint file through --ckpt_file. And it loads input texts in ./test_sentences.txt and stores the result in ./results directory. You can also set the result directory path with --results_path.
180
175
181
176
You can also run with a sample text:
182
177
```
183
-
python generate.py --waveglow_path="./waveglow_256channels.pt" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
178
+
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --checkpoint_path="./checkpoints" --text="The more you buy, the more you save."
184
179
```
185
180
186
-
7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first extract a TensorRT engine file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/trt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt) and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
181
+
7. Accelerate generation(inferencing of FastSpeech and WaveGlow) with TensorRT. Set parameters config file with --hparam=trt.yaml to enable TensorRT inference mode. To prepare for running WaveGlow on TensorRT, first get an ONNX file via [DeepLearningExamples/PyTorch/SpeechSynthesis/Tacotron2/tensorrt](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt), convert it to an TensorRT engine using scripts/waveglow/convert_onnx2trt.py, and copy this in the home directory, for example, ./waveglow.fp16.trt. Then run with --waveglow_engine_path:
@@ -293,33 +288,29 @@ For more details, refer to [accelerating inference with TensorRT](fastspeech/trt
293
288
294
289
#### Generation
295
290
296
-
To generate waveforms with WaveGlow Vocoder, 1) Make sure to pull [Nvidia WaveGlow](https://github.com/NVIDIA/waveglow) through git submodule, 2) get [pretrained WaveGlow model](https://drive.google.com/open?id=1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF) in the home directory, for example, ./waveglow_256channels.pt.
297
-
```
298
-
git submodule init
299
-
git submodule update
300
-
```
291
+
To generate waveforms with WaveGlow Vocoder, get [pretrained WaveGlow model](https://ngc.nvidia.com/catalog/models/nvidia:waveglow_ckpt_amp_256/files?version=19.10.0) from NGC into the home directory, for example, ./nvidia_waveglow256pyt_fp16.
301
292
302
293
Run generate.py with:
303
294
* --text - an input text or the text file path.
304
295
* --results_path - result waveforms directory path. (default=./results).
305
296
* --ckpt_file - checkpoint file path. (default checkpoint file is the latest file in --checkpoint_path)
306
297
```
307
-
python generate.py --waveglow_path="./waveglow_256channels.pt" --text="The more you buy, the more you save."
298
+
python generate.py --waveglow_path="./nvidia_waveglow256pyt_fp16" --text="The more you buy, the more you save."
Sample result waveforms are [here](https://gitlab-master.nvidia.com/dahn/fastspeech/tree/master/samples).
305
+
Sample result waveforms are [here](samples).
315
306
316
-
To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/trt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
307
+
To generate waveforms with the whole pipeline of FastSpeech and WaveGlow with TensorRT, extract a WaveGlow TRT engine file through https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/tensorrt and run generate.py with --hparam=trt.yaml and --waveglow_engine_path.
317
308
318
309
```
319
-
python generate.py --hparam=trt.yaml --waveglow_path="./waveglow_256channels.pt" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
310
+
python generate.py --hparam=trt.yaml --waveglow_path="./nvidia_waveglow256pyt_fp16" --waveglow_engine_path="waveglow.fp16.trt" --text="The more you buy, the more you save."
320
311
```
321
312
322
-
Sample result waveforms are [FP32](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples) and [FP16](https://gitlab-master.nvidia.com/dahn/fastspeech/-/tree/master/fastspeech/trt/samples_fp16).
313
+
Sample result waveforms are [FP32](fastspeech/trt/samples) and [FP16](fastspeech/trt/samples_fp16).
323
314
324
315
325
316
## Performance
@@ -391,7 +382,17 @@ The following sections provide details on how we achieved our performance and ac
391
382
392
383
#### Training performance results
393
384
394
-
Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
385
+
Our results were obtained by running the script in [training performance benchmark](#training-performance-benchmark) on <!--NVIDIA DGX A100 with 8x A100 40G GPUs and -->NVIDIA DGX-1 with 8x V100 16G GPUs. Performance numbers (in number of mels per second) were averaged over an entire training epoch.
386
+
387
+
<!-- ##### Training performance: NVIDIA DGX A100 (8x A100 40GB)
@@ -401,7 +402,7 @@ Our results were obtained by running the script in [training performance benchma
401
402
402
403
#### Inference performance results
403
404
404
-
Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark)in the PyTorch-20.03-py3 NGC container on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
405
+
Our results were obtained by running the script in [inference performance benchmark](#inference-performance-benchmark) on NVIDIA DGX-1 with 1x V100 16GB GPU and a NVIDIA T4. The following tables show inference statistics for the FastSpeech and WaveGlow text-to-speech system on PyTorch and comparisons by framework with batch size 1 in FP16, gathered from 1000 inference runs. Latency is measured from the start of FastSpeech inference to the end of WaveGlow inference. The tables include average latency, latency standard deviation, and latency confidence intervals. Throughput is measured as the number of generated audio samples per second. RTF is the real-time factor which tells how many seconds of speech are generated in 1 second of compute. The used WaveGlow model is a 256-channel model. The numbers reported below were taken with a moderate length of 128 characters.
0 commit comments