Very large deep neural networks (DNNs), whether applied to natural language processing (e.g., GPT-3), computer vision (e.g., huge Vision Transformers), or speech AI (e.g., Wave2Vec 2) have certain properties that set them apart from their smaller counterparts. As DNNs become larger and are trained on progressively larger datasets, they can adapt to new tasks with just a handful of training examples, accelerating the route toward general artificial intelligence. Training models that contain tens to hundreds of billions of parameters on vast datasets isn’t trivial and requires a unique combination of AI, high performance computing (HPC), and systems knowledge. We'll demonstrate how to train the largest of neural networks and deploy them to production. In this workshop, you’ll learn how to:
- Train neural networks across multiple servers
- Use techniques such as activation checkpointing, gradient accumulation, and various forms of model parallelism to overcome the challenges associated with large-model memory footprint
- Capture and understand training performance characteristics to optimize model architecture Deploy very large multi-GPU models to production using NVIDIA Triton™ Inference Server
- Cost of labels limits the utility of supervised deep learning models
- The Scaling Laws - loss decreasing with training data size increasing
- Few Shot Learning
- learning from far fewer examples - larger models make increasingly efficient use of in-context information
Model Tuning & Prompt Tuning
- Prompt tuning is where you essentially have the model learn some 'virtual tokens' - basically an embedding that gets prepended to your prompt. So the model learns, given some examples, what a good prompt looks like (although the prompt isn't something human readable - thus 'virtual prompt')
Towards General Intelligence
- Does not need labelled data
- Single generic model can do more than one tasks
- More generalized: in addition to language also learns higher level concepts, styles, etc.
- Computationally Expensive (~500 Billion parameters)
Execution Time
- Scale of compute
Transformer
- Common fault tolerance mechanism, in the large model training
- Model checkpointing is a standard practice these days, to restart training from those checkpoints in case of HW failures during training.
- Data Parallel = same model, different slices of data each batch, Model Parallel = same data, different slices of model
- NVIDIA DGX Superpod Solution
- Automatic Mixed Precision
- Activation Checkpoints
- Trading compute for memory
- Gradient Accumulation
- Gradient accumulation increases the effective batch size. Increasing the batch size is a very common trick to speed up training since GPUs are massively parallel and can process more data points at once with a larger batch.
- Offloading - trading memory capacity for bandwidth (Offload CPU tensors not used in computation from GPU to CPU)
- Model Implementation
- Tensor, Pipeline Parallelism
- Data parallel, distributed Optimizer
- Automatic Mixed Precision
- BERT, GPT, T5, Vision Transformer
- Achieve high utilization and scaling to thousands of GPUs
- Working towards Trillion models
- 深度学习并行训练算法: DDP, TP, PP, ZeRO
- Distributed Parallel Training: Data Parallelism and Model Parallelism
- GPU/CPU集群下做到Data/Model Parallelism的区别
- Tensor parallelism would just break up the matrix operations used in forward/back propagation.
- In practice, tensor parallelism used only when NVLink is available (in Ampere generation of GPUs, this is limited to the GPUs within a single node).
- 1) 通过并行化来做性能加速
- 2) 解决单机内存无法hold的huge model size
- 将计算与通信开销overlap以后,提高计算资源的utilization rate;
- 神经网络的拓扑结构进行优化,在不明显损失精度的情况下,减少并行计算的同步过程所需要传输的数据量;
- 在不明显损失精度的情况下,改善计算资源的utilization rate;
CPU vs. GPU
- 同样面积的芯片,CPU放置更多的多级缓存和指令并行的控制部件;GPU则更多运算单元;
- GPU往往拥有更大带宽的Memory,所谓的显存,因此在大吞吐量的应用中会有很好的性能;
-
相较于CPU而言,GPU更强大的“naive"浮点算术能力,但在GPU集群上,因为计算与通信的gap导致的性能degradation会更显著。
-
GPU的访存特点也使得GPU计算平台上能hold的有效模型尺寸通常来说是远小于CPU平台上的(以比较主流的Nvidia Tesla K40M为例,显存12GiB),这也使得GPU平台上在处理比较大的模型的时候,会比CPU平台更早地遇到模型尺寸的瓶颈,需要考虑model parallelism。
Model Selection
- Not all models respond in the same way to knowledge distillation, pruning and quantization.
- Small precision but still maintain accuracy;
- Without calibration, you could incur a pretty steep accuracy loss. FP16 tends to maintain accuracy well but a lot more caution is needed for going down to 8 bits.
- Bandwidth reduction;
- Pruning
- Compress a large model or teach a smaller model; i.e. DistillBERT etc.
- Maximize utilization of GPUs;
- Maximize throughput, minimize latency;