Skip to content

Commit 799660f

Browse files
committed
[Kaldi] Adding Jupyter notebook
1 parent b5741a9 commit 799660f

19 files changed

+2349
-150
lines changed

Kaldi/SpeechRecognition/Dockerfile

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,10 @@
1111
# See the License for the specific language governing permissions and
1212
# limitations under the License.
1313

14-
FROM nvcr.io/nvidia/kaldi:19.12-online-beta-py3 as kb
14+
FROM nvcr.io/nvidia/kaldi:20.03-py3 as kb
15+
FROM nvcr.io/nvidia/tritonserver:20.03-py3
1516
ENV DEBIAN_FRONTEND=noninteractive
1617

17-
ARG PYVER=3.6
18-
19-
FROM nvcr.io/nvidia/tensorrtserver:19.12-py3
20-
2118
# Kaldi dependencies
2219
RUN apt-get update && apt-get install -y --no-install-recommends \
2320
automake \
@@ -27,8 +24,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
2724
gawk \
2825
libatlas3-base \
2926
libtool \
30-
python$PYVER \
31-
python$PYVER-dev \
27+
python3.6 \
28+
python3.6-dev \
3229
sox \
3330
subversion \
3431
unzip \

Kaldi/SpeechRecognition/Dockerfile.client

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,8 @@
1111
# See the License for the specific language governing permissions and
1212
# limitations under the License.
1313

14-
FROM nvcr.io/nvidia/kaldi:19.12-online-beta-py3 as kb
15-
FROM nvcr.io/nvidia/tensorrtserver:19.12-py3-clientsdk
14+
FROM nvcr.io/nvidia/kaldi:20.03-py3 as kb
15+
FROM nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk
1616

1717
# Kaldi dependencies
1818
RUN apt-get update && apt-get install -y --no-install-recommends \
@@ -23,8 +23,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
2323
gawk \
2424
libatlas3-base \
2525
libtool \
26-
python$PYVER \
27-
python$PYVER-dev \
26+
python3.6 \
27+
python3.6-dev \
2828
sox \
2929
subversion \
3030
unzip \
@@ -36,6 +36,8 @@ RUN apt-get update && apt-get install -y --no-install-recommends \
3636
COPY --from=kb /opt/kaldi /opt/kaldi
3737
ENV LD_LIBRARY_PATH /opt/kaldi/src/lib/:$LD_LIBRARY_PATH
3838

39+
COPY scripts /workspace/scripts
40+
3941
COPY kaldi-asr-client /workspace/src/clients/c++/kaldi-asr-client
4042
RUN echo "add_subdirectory(kaldi-asr-client)" >> "/workspace/src/clients/c++/CMakeLists.txt"
4143
RUN cd /workspace/build/ && make -j16 trtis-clients
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
# Copyright (c) 2019 NVIDIA CORPORATION. All rights reserved.
2+
# Licensed under the Apache License, Version 2.0 (the "License");
3+
# you may not use this file except in compliance with the License.
4+
# You may obtain a copy of the License at
5+
#
6+
# http://www.apache.org/licenses/LICENSE-2.0
7+
#
8+
# Unless required by applicable law or agreed to in writing, software
9+
# distributed under the License is distributed on an "AS IS" BASIS,
10+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
# See the License for the specific language governing permissions and
12+
# limitations under the License.
13+
14+
FROM nvcr.io/nvidia/tritonserver:20.03-py3-clientsdk
15+
16+
# Kaldi dependencies
17+
RUN apt-get update && apt-get install -y jupyter \
18+
python3-pyaudio \
19+
python-pyaudio \
20+
libasound-dev \
21+
portaudio19-dev \
22+
libportaudio2 \
23+
libportaudiocpp0 \
24+
libsndfile1 \
25+
alsa-base \
26+
alsa-utils \
27+
vim
28+
29+
RUN python3 -m pip uninstall -y pip
30+
RUN apt install python3-pip --reinstall
31+
RUN pip3 install matplotlib soundfile librosa sounddevice

Kaldi/SpeechRecognition/README.md

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# Kaldi ASR Integration With TensorRT Inference Server
1+
# Kaldi ASR Integration With Triton
22

3-
This repository provides a Kaldi ASR custom backend for the NVIDIA TensorRT Inference Server (TRTIS). It can be used to demonstrate high-performance online inference on Kaldi ASR models. This includes handling the gRPC communication between the TensorRT Inference Server and clients, and the dynamic batching of inference requests. This repository is tested and maintained by NVIDIA.
3+
This repository provides a Kaldi ASR custom backend for the NVIDIA Triton (former TensorRT Inference Server). It can be used to demonstrate high-performance online inference on Kaldi ASR models. This includes handling the gRPC communication between the Triton and clients, and the dynamic batching of inference requests. This repository is tested and maintained by NVIDIA.
44

55
## Table Of Contents
66

@@ -33,9 +33,9 @@ This repository provides a Kaldi ASR custom backend for the NVIDIA TensorRT Infe
3333

3434
This repository provides a wrapper around the online GPU-accelerated ASR pipeline from the paper [GPU-Accelerated Viterbi Exact Lattice Decoder for Batched Online and Offline Speech Recognition](https://arxiv.org/abs/1910.10032). That work includes a high-performance implementation of a GPU HMM Decoder, a low-latency Neural Net driver, fast Feature Extraction for preprocessing, and new ASR pipelines tailored for GPUs. These different modules have been integrated into the Kaldi ASR framework.
3535

36-
This repository contains a TensorRT Inference Server custom backend for the Kaldi ASR framework. This custom backend calls the high-performance online GPU pipeline from the Kaldi ASR framework. This TensorRT Inference Server integration provides ease-of-use to Kaldi ASR inference: gRPC streaming server, dynamic sequence batching, and multi-instances support. A client connects to the gRPC server, streams audio by sending chunks to the server, and gets back the inferred text as an answer (see [Input/Output](#input-output)). More information about the TensorRT Inference Server can be found [here](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/).
36+
This repository contains a Triton custom backend for the Kaldi ASR framework. This custom backend calls the high-performance online GPU pipeline from the Kaldi ASR framework. This Triton integration provides ease-of-use to Kaldi ASR inference: gRPC streaming server, dynamic sequence batching, and multi-instances support. A client connects to the gRPC server, streams audio by sending chunks to the server, and gets back the inferred text as an answer (see [Input/Output](#input-output)). More information about the Triton can be found [here](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/).
3737

38-
This TensorRT Inference Server integration is meant to be used with the LibriSpeech model for demonstration purposes. We include a pre-trained version of this model to allow you to easily test this work (see [Quick Start Guide](#quick-start-guide)). Both the TensorRT Inference Server integration and the underlying Kaldi ASR online GPU pipeline are a work in progress and will support more functionalities in the future. This includes online iVectors not currently supported in the Kaldi ASR GPU online pipeline and being replaced by a zero vector (see [Known issues](#known-issues)). Support for a custom Kaldi model is experimental (see [Using a custom Kaldi model](#using-custom-kaldi-model)).
38+
This Triton integration is meant to be used with the LibriSpeech model for demonstration purposes. We include a pre-trained version of this model to allow you to easily test this work (see [Quick Start Guide](#quick-start-guide)). Both the Triton integration and the underlying Kaldi ASR online GPU pipeline are a work in progress and will support more functionalities in the future. Support for a custom Kaldi model is experimental (see [Using a custom Kaldi model](#using-custom-kaldi-model)).
3939

4040
### Reference model
4141

@@ -60,7 +60,7 @@ Details about parameters can be found in the [Parameters](#parameters) section.
6060

6161
### Requirements
6262

63-
This repository contains Dockerfiles which extends the Kaldi and TensorRT Inference Server NVIDIA GPU Cloud (NGC) containers and encapsulates some dependencies. Aside from these dependencies, ensure you have [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) installed.
63+
This repository contains Dockerfiles which extends the Kaldi and Triton NVIDIA GPU Cloud (NGC) containers and encapsulates some dependencies. Aside from these dependencies, ensure you have [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) installed.
6464

6565

6666
For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:
@@ -108,7 +108,7 @@ The following command will stream 1000 parallel streams to the server. The `-p`
108108

109109
### Parameters
110110

111-
The configuration is done through the `config.pbtxt` file available in `model-repo/` directory. It allows you to specify the following:
111+
The configuration is done through the `config.pbtxt` file available in the `model-repo/kaldi_online/` directory. It allows you to specify the following:
112112

113113
#### Model path
114114

@@ -141,7 +141,7 @@ The inference engine configuration parameters configure the inference engine. Th
141141

142142
### Inference process
143143

144-
Inference is done through simulating concurrent users. Each user is attributed to one utterance from the LibriSpeech dataset. It streams that utterance by cutting it into chunks and gets the final `TEXT` output once the final chunk has been sent. A parameter sets the number of active users being simulated in parallel.
144+
Inference is done through simulating concurrent users. Each user is attributed to one utterance from the LibriSpeech dataset. It streams that utterance by cutting it into chunks and gets the final `TEXT` output once the final chunk has been sent. The `-c` parameter sets the number of active users being simulated in parallel.
145145

146146
### Client command-line parameters
147147

@@ -187,7 +187,8 @@ Even if only the best path is used, we are still generating a full lattice for b
187187

188188
Support for Kaldi ASR models that are different from the provided LibriSpeech model is experimental. However, it is possible to modify the [Model Path](#model-path) section of the config file `model-repo/kaldi_online/config.pbtxt` to set up your own model.
189189

190-
The models and Kaldi allocators are currently not shared between instances. This means that if your model is large, you may end up with not enough memory on the GPU to store two different instances. If that's the case, you can set `count` to `1` in the `instance_group` section of the config file.
190+
The models and Kaldi allocators are currently not shared between instances. This means that if your model is large, you may end up with not enough memory on the GPU to store two different instances. If that's the case,
191+
you can set `count` to `1` in the [`instance_group` section](https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#instance-groups) of the config file.
191192

192193
## Performance
193194

@@ -218,16 +219,17 @@ Our results were obtained by:
218219
1. Building and starting the server as described in [Quick Start Guide](#quick-start-guide).
219220
2. Running `scripts/run_inference_all_v100.sh` and `scripts/run_inference_all_t4.sh`
220221

222+
221223
| GPU | Realtime I/O | Number of parallel audio channels | Throughput (RTFX) | Latency | | | |
222224
| ------ | ------ | ------ | ------ | ------ | ------ | ------ |------ |
223225
| | | | | 90% | 95% | 99% | Avg |
224-
| V100 | No | 2000 | 1769.8 | N/A | N/A | N/A | N/A |
225-
| V100 | Yes | 1500 | 1220 | 0.424 | 0.473 | 0.758 | 0.345 |
226-
| V100 | Yes | 1000 | 867.4 | 0.358 | 0.405 | 0.707 | 0.276 |
227-
| V100 | Yes | 800 | 647.8 | 0.304 | 0.325 | 0.517 | 0.238 |
228-
| T4 | No | 1000 | 906.7 | N/A | N/A | N/A| N/A |
229-
| T4 | Yes | 700 | 629.6 | 0.629 | 0.782 | 1.01 | 0.463 |
230-
| T4 | Yes | 400 | 373.7 | 0.417 | 0.441 | 0.690 | 0.349 |
226+
| V100 | No | 2000 | 1506.5 | N/A | N/A | N/A | N/A |
227+
| V100 | Yes | 1500 | 1243.2 | 0.582 | 0.699 | 1.04 | 0.400 |
228+
| V100 | Yes | 1000 | 884.1 | 0.379 | 0.393 | 0.788 | 0.333 |
229+
| V100 | Yes | 800 | 660.2 | 0.334 | 0.340 | 0.438 | 0.288 |
230+
| T4 | No | 1000 | 675.2 | N/A | N/A | N/A| N/A |
231+
| T4 | Yes | 700 | 629.2 | 0.945 | 1.08 | 1.27 | 0.645 |
232+
| T4 | Yes | 400 | 373.7 | 0.579 | 0.624 | 0.758 | 0.452 |
231233

232234
## Release notes
233235

@@ -236,6 +238,9 @@ Our results were obtained by:
236238
January 2020
237239
* Initial release
238240

239-
### Known issues
241+
April 2020
242+
* Printing WER accuracy in Triton client
243+
* Using the latest Kaldi GPU ASR pipeline, extended support for features (ivectors, fbanks)
240244

241-
Only mfcc features are supported at this time. The reference model used in the benchmark scripts requires both mfcc and iVector features to deliver the best accuracy. Support for iVector features will be added in a future release.
245+
### Known issues
246+
* No multi-gpu support for the Triton integration

Kaldi/SpeechRecognition/kaldi-asr-client/CMakeLists.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@ target_include_directories(
3232
/opt/kaldi/src/
3333
)
3434

35+
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -w") # openfst yields many warnings
3536
target_include_directories(
3637
kaldi_asr_parallel_client
3738
PRIVATE
@@ -58,6 +59,10 @@ target_link_libraries(
5859
PRIVATE /opt/kaldi/src/lib/libkaldi-base.so
5960
)
6061

62+
target_link_libraries(
63+
kaldi_asr_parallel_client
64+
PRIVATE /opt/kaldi/src/lat/libkaldi-lat.so
65+
)
6166

6267
install(
6368
TARGETS kaldi_asr_parallel_client

Kaldi/SpeechRecognition/kaldi-asr-client/asr_client_imp.cc

Lines changed: 67 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,11 @@
1818
#include <cstring>
1919
#include <iomanip>
2020
#include <numeric>
21+
#include <sstream>
22+
23+
#include "lat/kaldi-lattice.h"
24+
#include "lat/lattice-functions.h"
25+
#include "util/kaldi-table.h"
2126

2227
#define FAIL_IF_ERR(X, MSG) \
2328
{ \
@@ -31,11 +36,12 @@
3136
void TRTISASRClient::CreateClientContext() {
3237
contextes_.emplace_back();
3338
ClientContext& client = contextes_.back();
34-
FAIL_IF_ERR(nic::InferGrpcStreamContext::Create(
35-
&client.trtis_context, /*corr_id*/ -1, url_, model_name_,
36-
/*model_version*/ -1,
37-
/*verbose*/ false),
38-
"unable to create context");
39+
FAIL_IF_ERR(
40+
nic::InferGrpcStreamContext::Create(&client.trtis_context,
41+
/*corr_id*/ -1, url_, model_name_,
42+
/*model_version*/ -1,
43+
/*verbose*/ false),
44+
"unable to create context");
3945
}
4046

4147
void TRTISASRClient::SendChunk(ni::CorrelationID corr_id,
@@ -59,6 +65,8 @@ void TRTISASRClient::SendChunk(ni::CorrelationID corr_id,
5965
options->SetFlag(ni::InferRequestHeader::FLAG_SEQUENCE_END,
6066
end_of_sequence);
6167
for (const auto& output : context.Outputs()) {
68+
if (output->Name() == "TEXT" && !print_results_)
69+
continue; // no need for text output if not printing
6270
options->AddRawResult(output);
6371
}
6472
}
@@ -89,27 +97,33 @@ void TRTISASRClient::SendChunk(ni::CorrelationID corr_id,
8997
total_audio_ += (static_cast<double>(nsamples) / 16000.); // TODO freq
9098
double start = gettime_monotonic();
9199
FAIL_IF_ERR(context.AsyncRun([corr_id, end_of_sequence, start, this](
92-
nic::InferContext* ctx,
93-
const std::shared_ptr<nic::InferContext::Request>& request) {
100+
nic::InferContext* ctx,
101+
const std::shared_ptr<
102+
nic::InferContext::Request>& request) {
94103
if (end_of_sequence) {
95104
double elapsed = gettime_monotonic() - start;
96-
std::string out;
97105
std::map<std::string, std::unique_ptr<nic::InferContext::Result>> results;
98106
ctx->GetAsyncRunResults(request, &results);
99107

100-
if (results.size() != 1) {
101-
std::cerr << "Warning: Could not read output for corr_id " << corr_id
102-
<< std::endl;
108+
if (results.empty()) {
109+
std::cerr << "Warning: Could not read "
110+
"output for corr_id "
111+
<< corr_id << std::endl;
103112
} else {
104-
FAIL_IF_ERR(results["TEXT"]->GetRawAtCursor(0, &out),
105-
"unable to get TEXT output");
106113
if (print_results_) {
114+
std::string text;
115+
FAIL_IF_ERR(results["TEXT"]->GetRawAtCursor(0, &text),
116+
"unable to get TEXT output");
107117
std::lock_guard<std::mutex> lk(stdout_m_);
108-
std::cout << "CORR_ID " << corr_id << "\t\t" << out << std::endl;
118+
std::cout << "CORR_ID " << corr_id << "\t\t" << text << std::endl;
109119
}
120+
121+
std::string lattice_bytes;
122+
FAIL_IF_ERR(results["RAW_LATTICE"]->GetRawAtCursor(0, &lattice_bytes),
123+
"unable to get RAW_LATTICE output");
110124
{
111125
std::lock_guard<std::mutex> lk(results_m_);
112-
results_.insert({corr_id, {std::move(out), elapsed}});
126+
results_.insert({corr_id, {std::move(lattice_bytes), elapsed}});
113127
}
114128
}
115129
n_in_flight_.fetch_sub(1, std::memory_order_relaxed);
@@ -125,7 +139,7 @@ void TRTISASRClient::WaitForCallbacks() {
125139
}
126140
}
127141

128-
void TRTISASRClient::PrintStats() {
142+
void TRTISASRClient::PrintStats(bool print_latency_stats) {
129143
double now = gettime_monotonic();
130144
double diff = now - started_at_;
131145
double rtf = total_audio_ / diff;
@@ -150,9 +164,16 @@ void TRTISASRClient::PrintStats() {
150164
latencies.size();
151165

152166
std::cout << std::setprecision(3);
153-
std::cout << "Latencies:\t90\t\t95\t\t99\t\tAvg\n";
154-
std::cout << "\t\t" << lat_90 << "\t\t" << lat_95 << "\t\t" << lat_99
155-
<< "\t\t" << avg << std::endl;
167+
std::cout << "Latencies:\t90%\t\t95%\t\t99%\t\tAvg\n";
168+
if (print_latency_stats) {
169+
std::cout << "\t\t" << lat_90 << "\t\t" << lat_95 << "\t\t" << lat_99
170+
<< "\t\t" << avg << std::endl;
171+
} else {
172+
std::cout << "\t\tN/A\t\tN/A\t\tN/A\t\tN/A" << std::endl;
173+
std::cout << "Latency statistics are printed only when the "
174+
"online option is set (-o)."
175+
<< std::endl;
176+
}
156177
}
157178

158179
TRTISASRClient::TRTISASRClient(const std::string& url,
@@ -175,3 +196,30 @@ TRTISASRClient::TRTISASRClient(const std::string& url,
175196
started_at_ = gettime_monotonic();
176197
total_audio_ = 0;
177198
}
199+
200+
void TRTISASRClient::WriteLatticesToFile(
201+
const std::string& clat_wspecifier,
202+
const std::unordered_map<ni::CorrelationID, std::string>&
203+
corr_id_and_keys) {
204+
kaldi::CompactLatticeWriter clat_writer;
205+
clat_writer.Open(clat_wspecifier);
206+
std::lock_guard<std::mutex> lk(results_m_);
207+
for (auto& p : corr_id_and_keys) {
208+
ni::CorrelationID corr_id = p.first;
209+
const std::string& key = p.second;
210+
auto it = results_.find(corr_id);
211+
if(it == results_.end()) {
212+
std::cerr << "Cannot find lattice for corr_id " << corr_id << std::endl;
213+
continue;
214+
}
215+
const std::string& raw_lattice = it->second.raw_lattice;
216+
// We could in theory write directly the binary hold in raw_lattice (it is
217+
// in the kaldi lattice format) However getting back to a CompactLattice
218+
// object allows us to us CompactLatticeWriter
219+
std::istringstream iss(raw_lattice);
220+
kaldi::CompactLattice* clat = NULL;
221+
kaldi::ReadCompactLattice(iss, true, &clat);
222+
clat_writer.Write(key, *clat);
223+
}
224+
clat_writer.Close();
225+
}

Kaldi/SpeechRecognition/kaldi-asr-client/asr_client_imp.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
#include <queue>
1616
#include <string>
1717
#include <vector>
18+
#include <unordered_map>
1819

1920
#include "request_grpc.h"
2021

@@ -52,7 +53,7 @@ class TRTISASRClient {
5253
std::mutex stdout_m_;
5354

5455
struct Result {
55-
std::string text;
56+
std::string raw_lattice;
5657
double latency;
5758
};
5859

@@ -64,7 +65,8 @@ class TRTISASRClient {
6465
void SendChunk(uint64_t corr_id, bool start_of_sequence, bool end_of_sequence,
6566
float* chunk, int chunk_byte_size);
6667
void WaitForCallbacks();
67-
void PrintStats();
68+
void PrintStats(bool print_latency_stats);
69+
void WriteLatticesToFile(const std::string &clat_wspecifier, const std::unordered_map<ni::CorrelationID, std::string> &corr_id_and_keys);
6870

6971
TRTISASRClient(const std::string& url, const std::string& model_name,
7072
const int ncontextes, bool print_results);

0 commit comments

Comments
 (0)