You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NVIDIA TensorRT is a platform for high-performance deep learning inference. It includes a deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. More information on how to perform inference using TensorRT can be found in the subfolder [./trt/README.md](trt/README.md)
621
+
618
622
### Deploying the BERT model using TensorRT Inference Server
619
623
620
624
The [NVIDIA TensorRT Inference Server](https://github.com/NVIDIA/tensorrt-inference-server) provides a datacenter and cloud inferencing solution optimized for NVIDIA GPUs. The server provides an inference service via an HTTP or gRPC endpoint, allowing remote clients to request inferencing for any number of GPU or CPU models being managed by the server. More information on how to perform inference using `TensorRT Inference Server` can be found in the subfolder `./trtis/README.md`.
@@ -675,6 +679,7 @@ Our results were obtained by running the `scripts/run_pretraining_lamb.sh` train
@@ -728,7 +733,7 @@ The following tables compare `F1` scores across 5 different training runs with d
728
733
729
734
###### Pre-training training performance: single-node on 16G
730
735
731
-
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
736
+
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput with number of accumulation steps set to 1.
@@ -744,7 +749,7 @@ Note: The respective values for FP32 runs that use a batch size of 16, 4 in sequ
744
749
745
750
###### Pre-training training performance: multi-node on 16G
746
751
747
-
Our results were obtained by running the `run.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput.
752
+
Our results were obtained by running the `run.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-1 with 8x V100 16G GPUs. Performance (in sentences per second) is the steady state throughput with number of accumulation steps set to 1.
@@ -777,7 +782,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
777
782
778
783
###### Pre-training training performance: single-node on 32G
779
784
780
-
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
785
+
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-1 with 8x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput with number of accumulation steps set to 1.
@@ -809,7 +814,7 @@ To achieve these same results, follow the [Quick Start Guide](#quick-start-guide
809
814
810
815
###### Pre-training training performance: single-node on DGX-2 32G
811
816
812
-
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
817
+
Our results were obtained by running the `scripts/run_pretraining_lamb.sh` training script in the TensorFlow 19.08-py3 NGC container on NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput with number of accumulation steps set to 1.
@@ -826,7 +831,7 @@ Note: The respective values for FP32 runs that use a batch size of 48, 8 in sequ
826
831
827
832
###### Pre-training training performance: multi-node on DGX-2H 32G
828
833
829
-
Our results were obtained by running the `run.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput.
834
+
Our results were obtained by running the `run.sub` training script in the TensorFlow 19.08-py3 NGC container using multiple NVIDIA DGX-2 with 16x V100 32G GPUs. Performance (in sentences per second) is the steady state throughput with number of accumulation steps set to 1.
0 commit comments