-
Notifications
You must be signed in to change notification settings - Fork 5.9k
PaddlePaddle 2.2.0 rc0 Release Note
We are excited to release the PaddlePaddle Framework V2.2.0-rc0. This version contains the following highlights.
- Added 100+ APIs, including 24 Fourier transform APIs, 14 linear algebra APIs, etc., to better facilitate developing of scientific computing and signal processing models.
- Added the support for multiple indexing syntax, including ellipsis (...), dimension expansion (None), boolean arrays (Bool Mask), and integer arrays (list and tensor), making it easier to operate on tensor.
- Added the
paddle.einsumAPI, to express multi-dimensional tensor computation in a more concise way. - Enhanced the dynamic graph mixed precision. Added a way to use half-precision (float16) training for the whole task. The computational efficiency under the main tasks increased by 20%.
- Dynamic graph to static graph conversion: Further expand the syntax and scenarios supported by dynamic-static conversion. Now the dynamic graph models trained with mixed precision can also be converted to static graphs for training or inference deployment via the
to_staticinterface. In addition, the training performance after conversion can be optimized, and the training performance after conversion is significantly improved with the comparison to the dynamic graph method by introducing caching and enabling the Pass and other strategies. - Pass development: Added the interface for rewriting static graph IR in Python, so that development can be completed quickly in python for OP fusion and other subgraph replacement scenarios.
- Abstraction and functional encapsulation of the underlying codes in the operator Kernel: Provide high-performance Block-level IO operations and Compute operations (Kernel Primitive API).The Kernel development using the Kernel Primitive API allows you to focus more on the implementation of the computational logic, significantly reducing the amount of codes while ensuring performance, and decoupling operator computation from hardware.
- Hybrid parallel: Based on the existing 4D hybrid parallel of static graph, the performance optimization such as pipeline executor is carried out, and the training arithmetic utilization reaches 51% of the theoretical peak performance of GPU under 100 billion models. The dynamic graph supports 4D hybrid parallelism, and the function and performance under 100 billion models are the same as static graphs. The basic functions such as auto-completion and auto-slicing are added, and semi-automatic parallelism based on user mark is available.
- GPU Parameter Server: Under the 100 billion models, optimize the data reading, GPU-PS construction, SSD performance, and improve the pipeline. The overall performance is doubled and memory usage is halved, and one GPU machine can replace one hundred CPU machines to train 100 billion models.
- Inference acceleration: Support the latest TensorRT 8.x, and adapt Nvidia's new hardware features for acceleration.
- Ease of Inference: Add automatic derivation of dynamic Shape configurations in TensorRT subgraphs. Optionally, derive the range of Shapes from data without trivial manual configuration. This can simplify the use of dynamic Shape.
- For the problem of
gradbeing exposed in paths (paddle.autograd,grad,paddle.grad), it is recommended to usepaddle.grad, with removingfrom paddle.autograd import *and calling the grad directly. (#35579)
| 2.1 | 2.2 |
|---|---|
|
|
-
Tensor.__setitem__does not support the slice index of non-inttype (x[start:stop:step] = value). Since thefloattype does not make mathematical sense when used as an index (For example, how to determine the exact index position whenstartis 0.5?) and it is prone to some unknown behaviors, we limit the data type of slice index tointin this update, and the slice index usingfloatwill report an error. (#35701)
| 2.1 | 2.2 |
|---|---|
|
|
- Add inplace to call legality check for dynamic graph
Tensor.__setitem__. When the detected assignment code is not met, an error will be reported (detection logic: whenTensoris a leaf node andstop_gradientisFalse, theTensorassignment operation will be intercepted with reporting an error).Since the execution oftensor[index]=valuewill overwrite the original value of theTensor, it is an inplace operation of theTensor. If theTensoris a leaf node in the computation graph and needs to calculate the gradient, the assignment of theTensorwill cause problems in the calculation of the inverse gradient of theTensor, which is an illegal inplace operation. Therefore, we add the detection and interception of such operations in this update. For the current code with the assignment by usingtensor [index]=value, check whether the inplace operation requirement is met. If it is not met, an error is reported. (#35701)- Example: The initialization code is adjusted by using
weight[index]=value. Theself.weightbelongs to the leaf node and needs to calculate the gradient, so the inplace operation cannot be used (it will affect the inverse gradient value calculation). However, the initialization assignment itself does not need the inverse calculation process. Therefore, useno_ gradto disable the gradient calculation and then assign the value when it is clear that the inverse calculation is not needed.
- Example: The initialization code is adjusted by using
| 2.1 | 2.2 |
|---|---|
|
|
- When the
paddle.suminput type isbool, the output type is also bool, and the action is not consistent withnumpy.sum. To solve the problem, upgrade the incompatibility. After the upgrade, the output type isint64, which is consistent withnumpy.sum. (#34313)
| 2.1 | 2.2 |
|---|---|
|
|
- Optimize the
Tensorcopying act in the case wherepaddle.to_tensordoes not copy theTensorwhen the inputdatais aTensor, causing thestop_gradientproperty to be incorrectly modified. In the original implementation, whendatais aTensoranddtypeandplacedo not change,datais returned directly (i.e., no copying occurs) and thedata.stop_gradientproperty is modified. This action will cause the problem of the back propagation of the original computed graphdata. In the new implementation, thepaddle.to_tensorcopies a newTensorand returns it in the above case, without modifying thestop_gradientproperty of the originaldata. (#33335)
| 2.1 | 2.2 |
|---|---|
|
|
-
Add the linear algebra computation API
paddle.linalg.* -
Add the
paddle. linalg.svd, to support the singular value decomposition for multi-dimensionalTensor. (#34953)- Add the
paddle.linalg.cond, to support the computing of the condition number of a matrix or a batch of matrixes based on the norm typep. (#35140) - Add the
paddle.linalg.matrix_rank, to support the computing of the rank of a multidimensional matrixTensor. (#34823) - Add the
paddle.linalg.eigvals, to support the computing of general squares. (#35720, #35909) - Add the
padding.linalg.eigh, to support the computing of eigenvalues and eigenvectors of complex Hermite matrix or real symmetric matrix. (#34990, #35916, #35812, #36091,#35919) - Add the
paddle.linalg.det, to support the computing of determinant values of multidimensional matrix. (#34992) - Add the
paddle.linalg.slogdet, to support the computing of signed and natural logarithm values of multidimensional matrix determinant values. (#34992) - Add the
paddle.linalg.pinv, to support the computing of pseudo-inverse matrix of multidimensional matrix Tensor. (#35804) - Add the
paddle.linalg.multi_dot, to support the computing of concatenated multiplication of multiple matrices. (#35224) - Add the
paddle.linalg.solve, to support the computing of the solutions of linear equations. (#35715) - Add the
paddle.linalg.matrix_power, to support the power operations on matrices. (#34667)
- Add the
-
Add new Fourier transform related API (#35665)
-
Add fast Fourier transform family functions
- Differentiable 1d to nd complex to complex fast Fourier transforms. (
paddle.fft.fft,paddle.fft.fft2,paddle.fft.fftn,paddle.fft.ifft,paddle.fft.ifft2,paddle.fft.ifftn) - Differentiable 1d to nd real to complex fast Fourier transform. (
paddle.fft.rfft,paddle.fft.rfft2,paddle.fft.rfftn,paddle.fft.ihfft,paddle.fft.ihfft2,paddle.fft.ihfftn) - Differentiable 1d to nd complex to real fast Fourier transform. (
paddle.fft.hfft,paddle.fft.hfft2,paddle.fft.hfftn,paddle.fft.irfft,paddle.fft.irfft2,paddle.fft.irfftn) - fft related helper functions. (
paddle.fft.fftfreq,paddle.fft.rfftfreq,paddle.fft.fftshift,paddle.fft.ifftshift)
- Differentiable 1d to nd complex to complex fast Fourier transforms. (
-
Add short-time Fourier transform related functions
- Short-time Fourier transform. (
paddle.signal.stft) - Short-time Fourier inverse transform. (
paddle.signal.istft)
- Short-time Fourier transform. (
-
-
Add new high-level APIs
- Add the
paddle.vision.ops.roi_poolandpaddle.vision.ops.RoIPool, support RoI region pooling operations in detection tasks. (#36154)- Add the
paddle.vision.ops.roi_alignandpaddle.vision.ops.RoIAlign, to support RoI region Align operations in detection tasks. (#36207) - Add the
paddle.vision.ops.psroi_poolandpaddle.vision.ops.PSRoIPool, to support location-sensitive RoI region pooling operations in detection tasks. (#36111) - Add the
paddle.vision.models.vgg19pre-training weights. (#35788) - Add thedatasets API download progress bar in
paddle.vision.datasets.*. (#33302) - Add the
paddle.Model.predictparameterverbose, to support whether to show logs or not. (#33405) - Add the
paddle.hubdownload optionwgetmethod. (#33379) - Add the
paddle.Modelgradient accumulation in dynamic graph mode. (#32702) - Add the
paddle.Model.fitandpaddle.Model.evaluatenum_itersparameters in dynamic graph mode to control the number of training iterations. (#33986) - Add the
paddle.vision.ops.yolo_boxparametersiou_awareandiou_aware_factor, to support YoloBox using predicted IOUs as confidence factors. (#33400) - Add the
paddle.summaryparameter input to support the giveninput. (#34165)
- Add the
- Add the
-
Add networking class APIs
- Add the
paddle.nn.MaxUnPool2Dandpaddle.nn.functional.max_unpool2d, to support the computing of the inverse of the pooling result based on the input and maximum position. (#35056) - Add the
paddle.nn.functional.gumbel_softmax, to supportgumbel softmaxsampling. (#35506, #36065, #36094) - Add the
paddle.nn.functional.class_center_sample, to support PartialFC class center sampling. (#34106) - Add the
paddle.nn.functional.margin_cross_entropy, to support ArcFace, CosFace, SphereFace and other MarginLoss functions. (#34247) - Add the
paddle.nn.AvgPool2D, to support second-order derivatives. (#35388) - Add the
paddle.nn.Linear, paddle.matmul, and paddle.mm, to support second-order derivatives. #35428 - Add the
paddle.nn.GroupNorm, to support the inputs of the form (N, C, *). (#34773) - Add the
paddle.nn.BatchNorm1D/2D/3Dto compute the inverse underx.stop_gradient=True. (#34102) - Add the
paddle.nn.Dropout, paddle,nn.Dropout2D/3Dto compute the inverse inmodel.evalmode. (#35122)
- Add the
-
Add hardware related APIs
- Add the
paddle.device.cuda.Stream,paddle.device.cuda.Event,paddle.device.cuda.current_stream,paddle.device.cuda.synchronize,paddle.device.cuda.synchronize, to support synchronization operations for event and stream of CUDA on the Python side. (#32460) - Add the
paddle.device.cuda.device_count, to support returning the current number of available GPUs. (#34811) - Add the
paddle.device.cuda.empty_cache, to support for clearing free GPU memory. (#35427) - Add the
paddle.device.cuda.get_device_properties, to support for returning the given device properties. (#35875) - Add the
paddle.device.cuda.stream_guardfor flexible switching of CUDA Streams under dynamic graphs. (#35623)
- Add the
-
Add Tensor operation APIs
-
Add the
paddle.broadcast_tensors, to support broadcast operations on a set ofTensors. (#33294, #34874) -
Add the
paddle.einsum. (#33821) -
Enhance the
paddle.tensor.gradientinterface to support second-order derivative operators for sigmoid_op. (#32971) -
Add the
paddle.searchsorted, to support the search of the index of a given value in an orderedTensor. (#35159) -
Add the
paddle.unique_consecutive, to support removing duplicates of consecutively repeated elements in aTensorto return consecutive non-repeated Tensor. (#34334) -
Add the
paddle.diagflat, to support the returning of a diagonal matrix with the elements of the inputTensoras diagonals. (#33334) -
Add the
paddle.lgamma, to support element-by-element computing of theTensor'slgammafunction value. (#33913) -
Add the
paddle.digamma, to support element-by-element computing ofdigammafunction values forTensor. (#33278) -
Add the
paddle.neg, to support element-by-element computing of the opposite value of aTensor. (#33248) -
Add the
paddle.cumprod, to support the computing ofTensorcumulative multiplication based on a given dimension. (#35185) -
Add the
paddle.atan2, to support element-by-elementarctangentoperations to determine quadrants by symbols. (#33067) -
Add the
paddle.expm1, to support element-by-element arithmetic withexp(x)-1. (#33066) -
Add the
paddle.trunc, to support truncated integer values for the inputTensor. (#33371) -
Add the
paddle.diagonal, to support the extracting of diagonal elements of the inputTensor. (#33586) -
Add the
paddle.utils.dlpack, including:paddle.utils.dlpack.to_dlpackandpaddle.utils.dlpack.from_dlpack, to support theTensortransfer between different frameworks with usingDLPack. (#35067) -
Add the
Tensor.uniform_, to support filling aTensorin-place with random numbers that obey a uniform distribution. (#33394) -
Add the
paddle.Tensor.T, to transpose an N-D Tensor to return a Tensor with the opposite shape of the original Tensor. (#35379) -
Add the
paddle.Tensormagic operators: & (bitwise_and), | (bitwise_or), ^ (bitwise_xor), ~ (bitwise_not). (#33524) -
Add the
paddle.Tensor.fill_andpaddle.Tensor.zero_, to modify the value in Tensor in-place, use the fixed values to fill, use all-zero to fill respectively. (#33829) -
Add the
paddle.Tensor.fill_diagonal, andpaddle.Tensor.fill_diagonal, to modify Tensor diagonal element values. (#34460) -
Add the
paddle.Tensor.fill_diagonal_tensor_, to modify the whole sub-Tensor formed by the diagonal of two specified coordinate axes of the Tensor with other axes. (#34515) -
Dynamic-Static Graph
Tensor: Add the support for multiple index types, including: ellipsis (...), dimensional augmentation (None), boolean type arrays (Bool Mask), integer arrays (list), and tensors (Tensor).- ellipsis (...) Index:
X[..., 0]。(#34267, #32876) - Dimensional augmentation (None) index:
X[None, :]。(#34338, #34442, #34877, #34911, #33001) - Boolean type array (Bool Mask) index:
X[X > 0] = 0。 (#35026, #35133, #33298) - Array of integers (list) index:
X[[1, 0], [0]]。(#34824, #33000, #35404) - Tensor index:
X[panddle.to_tensor([0, 1], [1, 0])]。(#34824)
- ellipsis (...) Index:
-
Add the distributed related APIs
- Add the
paddle.distributed.utils.global_scatterandpaddle.distributed.utils.global_gather, to support MOE conditional distribution of data. Theglobal_scatterdistributes the data to all cards based on the conditions, and then theglobal_gatherthen collects the data from all GPU cards based on the conditions. (#35546)
- Add the
-
Add additional APIs
- Add the
paddle.disable_signal_handler, to support the disabling of the signal capture mechanism in PaddlePaddle, thus allow users to use Paddle and TVM at the same time. (#34577) - Add the
paddle.incubate.softmax_mask_fuse, to support the acceleration of softmax and mask operations for Transformer architecture. (#33841) - Add the
paddle.incubate.softmax_mask_fuse_upper_triangle, to support the acceleration of the softmax and mask operations of the GPT version of the Transformer architecture. (#33981) - Add the
paddle.static.ExponentialMovingAverage, to support the computing of the sliding average of parameters with exponential decay. (#35673) - Add the
paddle::Tensor::sliceC++ API, to support the slice operation, and allow users to perform slice operations for the external Tensor. (#34227) - Add the
paddle.incubate.segment_*series APIs, includingpaddle.incubate.segment_sum,paddle.incubate.segment_mean,paddle.incubate.segment_max, andpaddle. incubate.segment_min. Support the summing, averaging, maximizing, and minimizing ofTensorby segment. (#35759)
- Add the
-
Dynamic graph to static graph
- Add the dynamic to static transcription error type recognition, and give suggestions for modification. (#35648)
- Add the support for mixed precision training.
@to_staticc supports one-click conversion to mixed precision training mode for static graphs. (#34562) - Add the
build_strategyparameter in@to_static. Support customizing thePassoptimization strategy to accelerate model training after dynamic to static, such as operator fusion, etc. (#34347) - Add the support for
a, b = static_variable. (#33499) - Add the support for second-order derivatives. (#33110)
-
Program and Graph conversion:
ProgramandGraphare the intermediate representations used to express computations in the underlying framework of the PaddlePaddle, or developers of the PaddlePaddle, it is sometimes necessary to convertProgramandGraphto each other for computational processing. This feature adds the ability to convertProgramandGraphto each other.- Develop and refine the
ProgramandGraphinterconversion feature. (#33949) - In order to support control flow nodes such as
while, theProgramof the PaddlePaddle Framework may contain multiple sub-blocksin addition to the mainblock. Previously, in the conversion ofProgramtoGraph, only convert the mainblocktoGraph. In this update, modify theGraph, to support the expression of sub-blocksto achieve a complete conversion ofProgramtoGraph. (#33320) - Provide dependent helper functions needed to analyze the control flow in
Program. (#33439) -
ProgramandGraphretain the values of thestop_gradientandpersistableattributes needed for training after converting each other. (#33771) -
Passnow supports processing the mainGraphand all its sub-graphs, while the originalPassonly processed the mainGraphand ignored the sub-graphs. (#34158) - Handle some topological ordering problems for
ProgramandGraphinter-conversion in the prediction cases. (#34121, #34521).
- Develop and refine the
-
Pass development
-
Kernel Primitive API
- Abstract and encapsulate the underlying codes in the operator Kernel implementation, to provide high-performance Block-level IO and Compute operations. The Kernel development using the Kernel Primitive API allows you to focus more on the implementation of the computational logic, significantly reducing the amount of codes while ensuring performance, and decoupling operator computation from hardware. (#34672, #35075, #34456, #35282, #35743, #34208)
- Enhance the dynamic graph mixed precision. Add a way to use half-precision (float16) training for the whole task. The computational efficiency under the main task increases by 20%. (#35521)
- In the dynamic graph mixed precision
paddle.amp.GradScaler, add thegetandsetmethods for user-friendly settings. (#33835) - In the dynamic graph mixed precision
paddle.amp.GradScaler, add thestate_dictandload_state_dictmethods. (#34300) - In the dynamic graph mixed precision, split
minimizetostep + update. In addition, add theunscalemethod. (#35927) - In the dynamic graph mixed precision training, support param group. (#34899)
- In the static graph mixed precision training, support the gradient pruning. (#33565)
-
Basic functions of distributed training
- Add
paddle.DataParallel.no_sync, to pause multi-card communication and gradient synchronization under dynamic graph data parallelism. (#34740) - Add the
paddle.distributed.launch, to start the mode support for fault tolerance, and implement fault tolerance for nodes incollectivemode. (#33369, #34572) - In the distributed training API
paddle.static.Executor.train_from_dataset,paddle.static.Executor.infer_from_dataset, add the dump function for parameters and intermediate variables of the model during training. #34457 - In the hybrid parallel, support the combination of model parallel and data parallel. (#34377)
- Add the distributed policy
gradient scaleoption. Users can specify the way ofgradient scale:avg,sumor custom. (#33862) - Add
paddle.distributed.parallel_with_gloo, support CPU barrier operation. (#34671) - For the GPU parameter servers add the training profiler function. (#32640)
- For the GPU parameter server, add the pipeline function. The training performance can increase by 40%. #33159
- For the static graph hybrid parallel, add the
dp_as_optimizer_shardingexperimental feature that can parallelize data as optimizer parameter sharding. This can save the optimizer state GPU memory usage. (#35593) - For the static graph pipeline parallel executor, support the
LRScheduler. (#34402) - Add the
paddle.fluid.core.GraphPyClient.set_node_feat, to support for setting graph node features in the graph engine client, support the storage of multiple types of features. (#34994) - Improve the performance of the graph engine graph node neighbor sampling algorithm, and optimize the execution of the graph wandering algorithm. (#34088)
- Implement the unified dynamic-static mode for the model parallel interfaces
paddle.distributed.fleet.meta_parallel.ColumnParallelLinear,paddle.distributed.fleet.meta_parallel.RowParallelLinear,paddle.distributed.fleet.meta_parallel.VocabParallelEmbedding, andpaddle.distributed.fleet.meta_parallel.ParallelCrossEntropy. (#33700, #33411) - Add the distributed model parallel
cpu c_embeddingop. (#35467) - Change to the retry mechanism for getting gethostbyname when gen_comm_id is added to the initialization phase of the distributed communication. (#34855)
- Add the switch configuration for
scale_sparse_gradient_with_batch_sizeduringfleetgradient update, to determine whether the gradient is multiplied bybatch_size. (#34893)
- Add
-
Dynamic graph hybrid parallel
- In dynamic graph distributed data parallel scenarios, add the
paddle.distributed.fleet.dygraph_optimizer.DygraphShardingOptimizerinterface. Optimize the GPU memory occupation through the sharding optimizer between cards. Support the larger model or batch size. (#33633) - For the dynamic graph Sharding, support the MP-PP-DP for dynamic graph 4D hybrid parallelism. (#35580)
- For the dynamic graph Recompute, support mixed precision computation. (#33251)
- For the pipeline parallel, support 1f1b scheduling policy for runtime memory savings. (#34483)
- For the dynamic graph 3D hybrid parallel, support the recompute policy. Support the offload function. (#34607 #35588)
- For the dynamic graph 3D Hybrid Parallel, support model saving and loading. (#34768)
- Add the scatter-gather scheme for model parallel + pipeline parallel scenarios. Optimize the cross-machine communication performance. (#34130)
- For the pipeline parallel, support the slice based on the number of layers to ensure more equal sharding. (#34207)
- For the pipeline parallel, support the automatic mixing precision. (#33951)
- For the pipeline parallel, add the
paddle.distributed.fleet.meta_parallel.SharedLayerDescthe networking description, to support the parameter sharing networking mode. (#33578) - For the tensor parallel, add
paddle.distributed.fleet.meta_parallel.ParallelCrossEntropy, for a tensor parallel computation method that supports cross-entropy Loss. (#33401) - For the
paddle.DataParallel, add thefind_unused_parametersinterface, to support the use of control flow in the model in the data parallel mode. (#32826) - In the data parallel mode, add the port waiting feature to solve port conflict problem. (#34207)
- In dynamic graph distributed data parallel scenarios, add the
-
Static graph hybrid parallel
-
Automatic parallel
- Add the auto-parallel
shard_tensor,shard_opinterfaces.(#33804, #35765). Support semi-automatic parallel based on user tags. - Add the auto-completion distributed attribute feature. Support completing all untagged distributed attributes based on user-tagged distributed attributes. (#34813)
- Add the auto-slice serial
Programfunction. (#35117) - Enable the automatic parallel adaptation of the Fleet API. (#35483)
- Add the auto-parallel
-
Model quantization
- Add the offline quantization of dynamic graphs. (#33445, #33898, #33962, #35015)
- Refactor the statistical output quantization information module in the dynamic graph quantization training function, to allow the availability on the prediction side to improve the robustness. (#31680, #31710, #31861)
- For the dynamic graph quantization training, support the use in combination with mixed precision training. (#33484)
- For the dynamic graph quantization training function, support the quantization of Function class API. (#33162, #33871)
- Support the distributed quantization training in static graph mode. (#33781)
- Support the quantization of conv2d_transpose in dynamic graph mode. (#34547)
-
Custom OP
- Add the custom operator DCU back-end support. (#34050)
-
Cost Model
- Add the Paddle CostModel, to implement the method to get op time cost via Profiler. (#35774)
-
Model saving and loading
- Add the function of saving Layer's non-forward member methods and related parameters as inference models directly via the
paddle.jit.saveinterface. (#34070)
- Add the function of saving Layer's non-forward member methods and related parameters as inference models directly via the
-
ONNX Exporter
- Add 8 operator adaptations:
softplus,elementwise_mod,elementwise_floordiv,p_norm,depthwise_transpose,group_norm,pixel_shuffle, top_k. (Paddle2ONNX#252, Paddle2ONNX#261, Paddle2ONNX#293) - Add 8 detection model exports: PPYOLO, PPYOLOv2, PPYOLO-Tiny, TTFNet, PAFNet, FCOS, SSD. (Paddle2ONNX#252)
- Add 8 operator adaptations:
-
paddle.slice: Add the support forbooltype Tensor and optimize error messages. (#35586, #35179) -
paddle.strided_slice: Add the support forTensorArraytype input, and adjust the output whenstep< 0. The adjusted result is consistent withnumpy. (#34205, #34172) -
paddle.multiply: Supportbooldata type operations. (#35551) - Logical operations (
paddle.logical_not,paddle.logical_and,paddle.logical_or,paddle.logical_xor): Support non-booldata types (int8, int16, int32, int64, float, double). (#34141) -
paddle.transpose: Supportbooltype operations. (#35886) -
paddle.strided_slice: Supportbooltype operations. (#33373) -
paddle.set_printoptions: Support the setting oflinewidthto printTensor. (#35175) -
paddle.to_tensor: SupportLoDTensor. (#33027) -
paddle.linalg.detandpaddle.linalg.slogdet: Support inverse operations. (#36013) -
paddle.nn.functional.pad: Support the input of tuple type pad parameter in case of full dimensional pads. (35985) - Optimize error report messages when
paddle.nn.functional.padinput is abnormal. (34979) - For the static graph, support partial
program, and generate the corresponding reverseprogram. (#34395) - oneDNN function optimization
- Add the support for oneDNN kernels with multiple operators, including
clip,slice,split,cast,scale,expand_v2,sigmoid, matmul_v2,PReluforward and reverse oneDNN FP32, and oneNheN BF16. (#35601, #34332, #34284, #34216, #34192, #33878, #33584, #33056, #32975) - Add the implementation of Selected rows in SGD operator by using oneDNN AXPY. (33632)
- Add the support for oneDNN kernels with multiple operators, including
- Support for
bfloat16data type on the GPU with the Ampere architecture. (#31232, #32221, #32542) - On the
Convoperator, set the using of Tensor Core on the GPU with Ampere architecture. (#34409) - Support
paddle.device.cuda.current_stream().cuda_streamto get bare pointers. (#35813) - Add the
paddle.optimizer.AdamWGPU fuse kernel, to support the layerwise learning rate function. (#35020, #35569) - Support for using the Nvidia's cusparse library function in paddle. (#35675)
- Add
paddle.fullto support theint16type. (#35619) - Optimize the GPU memory usage of
paddle.nn.ClipGradByGlobalNorm. (#34586) -
reduce_sumoperator supports float16 type (#32966) -
paddle.nn.CTCLoss: Add two grad norm methods:norm_by_total_logits_lenandnorm_by_batchsize. (#34729) - Add the public API recommended usages under each path. (#33313, #33308, #32759, #32695, #32643, #31912, #32650, #32034, #33897)
- Restore the original API accessibility under the
paddle.visionpath. (#34432) -
paddle.vision.ops.deform_conv2d, paddle.vision.ops.DeformConv2D: Add the support for the double input type. (#35330) -
paddle.fluid.contrib.layers.shuffle_batch: Add the GPU Kernel implementation. #33938 - For the existing APIs, add the public call paths
paddle.linalg.cholesky,paddle.linalg.norm, andpaddle.linalg.inv. (#33420) -
paddle.reshape: Support turning an emptyTensorshape into an emptyTensorof another shape. (#36087) -
paddle.equal: Add the support forint,float, andbooltypes for the second input. (#35695) -
paddle.io.DataLoader: Add the support for persistent_worker mode. (#34017) - Optimize
l2_normalize,p_norm,elementwise_max,prelu,clip_by_norm,lars optimizeroperators support the float16 computation. (#35576, #35888, #35888, 35532, #35446, #33280) - Optimize the reading speed of flowers dataset from several minutes per batch to 1~3 seconds per batch. (#31408)
- Support the fuse allreduce sum function in
paddle.distributed.fleet.DistributedStrategywhen thewithout_graph_optimizeswitch is on.In the FP32, the performance increases by 3%. In the AMP, the performance increases by 8%. (#34446)
- Dynamic graph to static graph
- Optimize dynamic to static error reporting format, hide unnecessary error reporting stack at the framework level, add user code error line location identifier and context. (#35365, #35320)
- Optimize the conversion logic of the
list.appendsyntax in the control flow. (#35212) - Optimize the logic of dynamic to static training codes, upgrade the internal
Programcache mechanism, and add an advance copy policy for inputTensorto improve training performance. (#34181, #33796) - Optimize the internal actuator memory recycling strategy for dynamic to static graphs, reducing the GPU memory usage during training. (#34177)
- Integrate the source codes of
Gasttriple dependency library, decoupling version dependencies. (#34556)
-
Basic functions of distributed training
- Enhance the check of the static graph pipeline parallel stage and persist var. (#34193, #34870, #35453)
- Optimize static graph pipeline parallel. In the 1F1B scheduling, the GPU memory does not increase as global batch size increases. (#34230)
- For the GPU Parameter Server, optimize the build phase hashmap. In the build phase, the performance increases by up to 7x on some tasks. (#34175)
- For the GPU Parameter Server, add the multi-stream parallel in the pull/push phase. (#34276)
- For the GPU Parameter Server, support the remote pull of parameters between machines in multi-machine training mode. (#35396)
- For the CPU Parameter Server, support SSD storage. (#33031)
-
paddle.io.Dataset: Support the dynamic library parsing data. (#33969) - In the
paddle.distributed.fleet.dataset.DatasetBase, add the consistency check function for generated data of theuse_var_listandpipe_command. (#34463) - Add the consistency check between the
emddimension ofpaddle.fluid.layers.embeddingandembdimension ofsparse tableinfleet. (#34249)
-
Static graph hybrid parallel
-
Error debugging optimization
- Unify the error reporting mechanism for third-party libraries, and optimize the error reporting messages for
CURAND, CUDNN, CUBLAS, CUSOLVER, and NCCL. This makes the error reporting more detailed and standardized. (#33003, #33743) - Optimize avx and no_avx related installation error messages to simplify redundant and complex contents. (#33818)
- Optimize the error report of the
paddle.nn.functional.gather_tree,paddle.nn.Transformer,paddle.nn.TransformerDecoderLayer,paddle.nn.TransformerEncoderLayer, andpaddle.nn.MultiHeadAttention. (#34322, #33859) - Support the configuration of
FLAGS_check_nan_infenvironment variable under dynamic graphs for runtime checking and localization of modelnanandinf. (#32635) - Remove the stack information introduced by Signal class error messages due to the capture of Signal, to avoid misleading users. (#34842 )
- Fix error message for
elementwiseclass operator when input x or y is an empty Tensor. (#33928)
- Unify the error reporting mechanism for third-party libraries, and optimize the error reporting messages for
-
Model saving and loading
-
Custom OP
- Remove unnecessary
cudaStreamSynchronizeoperations frompaddle::Tensor'scopymethod, to improve performance. (#35802)
- Remove unnecessary
- Optimize the AMP grey list when model parallel + AMP. Support the model parallel operator. The performance improves by 8%. (#33660)
- Optimize the
deviceproperty setting for reverse gradient accumulation in case of pipeline parallel. The performance improves by 1-3%. (#33946) - Optimize the debug part of the pipeline parallel executor. The performance improves by 60-140%. (#33948)
- Support the
Programcache in the pipeline parallel. The performance improves by 10-40%. (#33998, #33954) - Optimize the communication waiting for the pipeline parallel
send. The performance improves by 0.3-2%. (#34086) - Optimize the size of
send/recvdata volume in case of model parallel + pipeline parallel. The performance improves by 36% in the 8-machine test. (#34110) - Optimize the cast of parameters in the hybrid parallel + AMP. It is controlled by
optimize_cast. The performance improves by 5-7%. (#34965) - Optimize the performance when pipeline parallel + recompute + amp. The performance improves by 13%. (#34519)
- Support the
float16communication when pipeline parallel + data parallel. It is controlled bydistributed_strategy.fp16_allreduce. The performance improves by 13% performance improvement. (#34762)
- Design and implement the generic Reduce CUDA algorithm. It is applied to 7 Reduce operators, increasing by 1.0x ~ 22.7x. (#32697, #32974, #33267, #32885, #33144, #33761, #33901, #34143, #34436)
- Design and implement the generic Elementwise and Broadcast CUDA algorithms. (#32512, #32928, #33976, #32148, #32414): Applied to 41 monadic and activation operators. (#32348, #32622, #32823). The performance improves by 1.1x ~ 1.4x. It is applied to 19 dualistic (9 basic computation class, 6 comparison class, and 4 logic class) operators. (#33050, 33052, #33053, #33051, #33089) . The performance improves by 1.02x ~ 3.21x.
- Optimize the
rolloperator CUDA implementation. The performance improves by more than 10% and 50% in case of single-dimensional and multi-dimensional inputs, respectively. (#32880) - Optimize the
rolloperator index computation. The performance improves by 15% and 70% for single-dimensional and multi-dimensional input, respectively. (#33909) - Optimize the CUDA implementation of the
update_loss_scaling_opoperator. The performance improves by 2.06x. (#32554) - Optimize the
softmax_with_cross_entropy (hard label)GPU operator performance. The acceleration ratio is 1.0x ~ 10.0x. (#35660) - Optimize the CPU implementation of
index_selectforward and inverse operators. The acceleration ratio is 2.09x ~ 9.34x. (#32863, #32955) - Optimize the CPU implementation of the
batch_normoperator for 2-dimensional inputs. The acceleration ratio is 22.68x~30.00x. (#34585) - Optimize the GPU performance of the
batch_normoperator in the initialization method and 2-dimensional input. The acceleration ratio is 1.25x~25x. (#33851, #33887) - Optimize the
log_softmaxoperator performance, and fix the related bug. The kernel performance improves by 4.22x~32.29x after optimization. (#31630, #32180, #32396, #32937) - Optimize the
concat_and_splitoperator, to solve the problem that computation and communication cannot overlap when training BERT on Hygon DCU chips in dynamic graphs. The performance of BERT distributed training on Hygon DCU chip increases by 27%. (#33982) - Optimize the
fused_elemwise_actoperator, with more than ten times performance improvement in the MB computing scale. (#33480)
- Add the
build_strategy.fix_op_run_orderstrategy, to immobilize the order of op execution. The speed of the ResNet model with single machine 8 cards increases by 1.8%. (#34427) - For the dynamic graph inverse computation, support and automatically enable partial operator inplace strategy. The performance of the dynamic graph gpt model pure float16 training increases by 4.8%. (#35412)
- Optimize the dynamic graph performance by stripping logic executed only on static graphs from the execution path of dynamic graphs. (#34024)
- For the IR Pass, optimize the capability exposed as a general-purpose capability. Support both single machine and distributed optimization.The performance improves by 3%-5% in GPT mixed parallel scenarios. (#34955, #35704, #34730, #34524)
- Optimize the ctc loss grad computation, increase the speed by ~3x. Correspondingly, the GPU memory usage increases. (#34729)
- Optimize the
depthwise_convnumerical stability. (#35161) - Add the shape check at parameter creation, to ensure that the
sizeof each axis of the parameter is greater than 0. (#33265) - Optimize the
paddle.nn.LayerNormcomputation, and fix the related data overflow bugs. (#34432, #33658) - Support Windows application scenarios, integrate PaddlePaddle framework capabilities into MFC/QT/C# desktop software environments, and fix the bug in the process nesting that causes system crashes. (#34312)
- Fix the bug of the NLP model loss in the Reduce data initialization. (#34941)
- Fix the bug when
batch_size=1inpaddle.nn.LayerNorm. (#35480) - Fix the bug of incorrectly catching an error when the input of
paddle.static.nn.group_normis empty. (#35613) - Fix the bug of empty mean/variance when
is_test=Trueinpaddle.nn.functional.batch_norm. (#35328) - Fix the out-of-bounds access bug when
paddle.nn.functional.instance_normandpaddle.nn.functional.batch_normare entered as null. (#35341, #34107) - Fix the bug where quantitative models do not count the output of
paddle.nn.LayerNorm. (#33610) - Fix the bug where
paddle.nn.SyncBatchNorm.convert_sync_batchnorm()does not support 1D/3D. (#32989) - Fix the bug of failure to add the inverse in case of
is_test=Trueinpaddle.nn.BatchNorm1D, paddle.nn.BatchNorm2D, paddle.nn.BatchNorm3D. (#32678) - Fix the bug where the
Tensor.cudadoes not supportdevice_idconfigured toNone. (#34416) - Fix the bug where the
paddle.to_tensordoes not support built-in types such asTensor.dtype, core.Tensor. (#31931, #33430) - Fix the bug where the
paddle.nn.functional.log_softmaxdoes not support input dimension of 0. (#34635) - Fix the bug that the relative error between the CPU calculation result and accurate value of
paddle.nn.GroupNormunder float32 is greater than that of 1e-5. (#33176) - Fix the bug where the returned result is not 0 when the parameter
offsetexceeds the dimension size in thepaddle.trace, and fix the bug of the stack overflow when the parametersaxis1andaxis2entered are illegal values. (#33922, #35419) - Fix the bug where the output type is not int when the
paddle.suminput parameter is the bool type.The output type is wrong when the input parameter type and output parameter type are inconsistent and the number of reduce elements corresponding to the axis is 1. (#34313, #36123) - Fix the bug of the division by 0 error and array out-of-bound when
paddle.nn.conv2d/conv3d/conv2d_transpose/conv3d_transposeis the illegal input. (#35337) - Fix the heap buffer overflow bug on illegal input of
paddle.nn.conv2d_transpose. (#35340) - Fix the bug where writing a null address to
paddle.bmmcauses the program to crash at runtime. (#35098) - Fix the bug when the
castoperator cannot support Tensor conversion from int16 to float32. (#35156) - Fix the bug where the
assigndoes not support float16 or uint8. (#35153) - Fix the bug of
concat's tendency to overflow when the input is greater than shape tensor. (#34319) - Fix the bug where the
concatin dynamic graphs does not support empty tensor as an input. (#35845) - Fix the bug where the
paddle.wheredoes not support broadcast. (#35092) - Fix the bug of
paddle.reshapenot checking input legality in the empty tensor. (#35642) - Fix the bug of
layernormoperator mis-matching with cuda kernel in the large shape. ( #33748) - Fix the bug of wrong setting of stop_gradient property in the static graph of
randomclass operator. ( #33959) - Fix the bug of wrong behavior of
splitoperator with empty tensor input. (#334356) - Fix the GPU memory leak bug in tensor's slice left-value assignment. (#35013)
- Fix the bug of the dynamic graph Layers not being used bycloudpickle dump and load. (#35538)
- Fix the bug of division by zero error in the illegal parameter settings for simple_rnn_cell, gru_cell, and lstm_cell APIs. (#34627)
- Fix the bug of the null pointer dereference in case of illegal input of
paddle.nn.functional.linear. (#34696) - Fix the memory out-of-bounds bug of the
paddle.strided_slice,paddle.transpose. (#35062, #35079) - Fix the bug of the division by 0 error when the
rolloperator has an illegal input. (#34499) - Fix an array out-of-bounds bug in the illegal input of the
gatheroperator. (#34096, #34138, #34200) - Fix the bug of division by 0 error in the illegal input of the
prelu,softlaxoperators. (#34499) - Fix the bug where the
splitoperator does not perform a legality check on input parameters. (#34630) - Fix the bug where the
memcpyoperator does not support Hygon DCU chips. (#35394) - Fix the bug of training error reporting of the
sliceoperator whenbatch_size=1. (#34265) - Fix the overflow bug of the
reduce_sumoperator in the AMP. (#33960) - Fix the ANSI escape code error on windows. (#33689)
- Fix the inconsistency bug between
paddle.hubparsed file names and downloaded and saved files. (#33214) - Fix the memory leak bug when inputting empty tensor for
matmul,diag_embed, andaucoperators. (#34978) - Fix the bug of large computational accuracy error of broadcast for
paddle.less_equal, paddle.less_than, paddle.greater_equal, and paddle.greater_than. (#32941) - Fix the crash bug of
interpolateoperator in case of a large input shape. (#35577) - Fix legality check for
interpolate,unfold, andspectral_normoperators in case of empty tensor input. (#33941, #34943, #35005) - Fix a possible negative sign (integer overflow) in
paddle.flopswhen computing the output FLOPs. (#33576) - Fix the bug of reporting an error when
paddle.summaryencounters a layer whose return value contains a non-Tensor element. (#34160) - Fix the bug where the output shape is calculated incorrectly when the
pooloperator is entered illegally. (#35106) - Fix the legality check bug of the input shape for
unfold, dice_loss, and reshapeoperators. (#34673, #34757, #35016) - Fix the input zero tensor bug of the
unique, and unstackoperators. (#36021) - Fix the bug when the reverse input of stack operator is null. (#362877)
- Fix the bug of the division by 0 error in the CPU execution when the shape of the input Tensor of
paddle.inverseis[0, 0, 0]. (#34996) - Fix the bug of the CUDA error reported by
paddle.nn.functional.grid_samplefor special input cases. (#33100) - Fix a compile-time dimension calculation error in
paddle.flattenfor special input cases of static graphs. (#35321) - Fix a compile-time check error in
paddle.nn.conv2d/conv3d/conv2d\_transpose/conv3d\_transposewhen calculating output shape. (#35693) - Fix the bug where
paddle.data.flowersis prone to data reading errors in multi-card training situations. (#33738) - Fix the bug that the loss is nan when the pact quantizes the se module. (#35392)
- Fix the bug of error reporting in the quantization
flatten_contiguous_range. (35410) - Fix the bug of pact quantization in dynamic graph mode. (#35407)
- Fix the bug of the error report by channel-wise quantization bert. (#34948)
- Fix the bug with quantization when all parameters are 0. (#34647)
- Fix a bug in channel-wise quantization when the number of channels is 1. (#33753)
- Fix the bug of thread insecurity of the dynamic graph
@no_grad. (#34649) - Fix the bug where the
paddle.gradinterface will hang in some scenarios. (#34023) - Fix the bug of shape derivation in
paddle.masked_selectin static graphs. (#33167) - Fix the bug of
paddle.slicenot supportingnumpy.ndarraytype index in some scenarios, and error whenaxesis thetupletype. (#35748, #35267) - Fix the
set_valuereverse gradient truncation bug. (#34304) - Fix the
paddle.regularizer.L1Decayduplicate gradient setting bug in the non-inplace computation. (32710) - Fix the bug with learning rate not taking effect when grouping
adamwparameters. (#34468) - Optimize illegal
dilateinput check in convolution class APIs. (#35894) - Fix the bug of the
paddle.io.DataLoaderiteration mid-break error reporting. (#34501) DataLoader memory leak bug. (#34140) DataLoader wrongly reporting the warning information. (#33712) DataLoader sub-process random state consistency bug. (#33310) - Fix drop_last not taking effect in IterableDataset. (#34801)
- Fix the bug with optimizer state recovery caused by
paddle.optimizer.lr.LRScheduler. ( #33984) - Fix the bug of using
axisfor infershape ingatheroperator. (#33413) - Fix a bug of getting stuck in Executor where fetch_list type is a tuple. (#35726)
- Fix the
paddle.nn.GroupNormdivided by zero error, and add channel with the exact division check by group. (#35644) - Fix the bug with referencing the freed memory in tensor formatter. (#35399)
- Fix the bug of the
betaparameter precision loss atfloat64precision for the Adam optimizer. (#33381) - Fix the precision misalignment bug caused by unbroadcasted initialization of tensor parallel non-tangent parameters. (#35326)
- Migrate the
topkoperator in thepaddle.static.accuracyAPI to thetopk_v2operator. (#35494) - Migrate the
expandoperator totileoperator inpaddle.nn.dynamic_decode, andtopkoperator totopk_v2operator in thepaddle.nn.BeamSearchDecoder. (#35656) - Migrate the one_hot operator in
paddle.nn.functional.dice_lossAPI to theone_hot_v2operator. (#35734) - Fix the bug of usage in the static graph mode in
paddle.summary. (#35303) - Fix the multi-card startup bug in
paddle.Model.preparestatic graph mode. (#34311)
- Dynamic graph to static graph
-
Basic functions of distributed training
- Fix a potential stack overflow bug in the graph engine. (#33055)
- Fix a potential deadlock bug in the distributed training. (#34461)
- Fix the bug where tensor parallel is incorrectly sliced in the multi-headed attention computation of transformer class models. Optimize the speed of tensor parallel in mixed precision computations. (#33015)
- Fix the bug where the norm of non-distributed vars is computed for multiple times when using
paddle.nn.ClipGradientByGlobalNormin the model parallel. (#35713) - Fix the bias addition position error in the row slice in the model parallel
paddle.distributed.splitParallel Linear. (#35186) - Fix the possible hang bug in the pipeline parallel initialization communication group. (#33476)
- Fix the bug where the
TensorGPU memory in pipeline parallel is released before it is actually used. (#33996) - Fix the bug where the pipeline parallel reverse gradient accumulation
op_deviceis empty. (#33875) - Fix the bug with pipeline parallel running
sub-blockerrors. (#32727) - Fix the bug where the pipeline parallel reverse gradient accumulation
op_deviceis empty. (#33875) - Fix an occasional hang bug when initializing Sharding parallel communication. (#33327)
- Fix the
paddle.distributed.barriersynchronization flow error bug. (#33476) - Fix the
paddle.distributed.alltoallcommunication group setting error bug. (#32890) - Fix a precision misalignment caused by a static graph tensor parallel parameter initial swap broadcast error. (35326)
- Fix the bug where dynamic graph data parallel does not support custom operators such as
recomputeinheriting fromPyLayerclass. (#35401) - Fix the hang bug in case of pipeline parallel + data parallel in the mixed parallel. (#34142)
- Fix the
fleet.get_loss_scalingfailure bug in case of enabling AMP. (#33935) - Fix the Connection Refused problem caused by a
fleetmulti-machine master not waiting for other nodes to be ready. (#32889) - Fix the bug where the distributed prediction
infer_from_datasetstill updates parameter gradients. (#35698) - Fix the bug in
data_feedwhere the dense feature LOD attribute is incorrectly set. (#35000) - Fix the save bug with the
gradient_merge_condvariable when usinggradientmergefor static graphs. (#35578) - Fix the save bug with the
paddle.hubdownload file name and thent_merge_cond variable. (#35578) - Fix the bug of unclearly reporting an error when
fleetis enabled withdump_slot. (#34173) - Fix the RCCL bug on Hygon DCU chips in the hybrid parallel training. (#32808)
- Fix GPU parameter server exit error reporting bug. (#33724)
- Fix the bug of unavailability of upload/download function of the hdfs tool. (#33903)
- Fix the bug of the GPU parameter server getting stuck during training because the sample cannot exactly divide the worker number. (#32640)
- Fix the GPU parameter server error reported by using non-0 card training. (#33078)
- Fix the bug of the delta score and scale show in the GPU Parameter Server. (#33492, #33492)
- Fix the bug with GPU Parameter Server not merging dense after training, in incorrect g2sum calculation. For data norm, add the optimize op. (#35029)
-
Dynamic graph hybrid parallel
- Fix the precision error in pipeline parallel due to communication asynchronization. #35556
- Fix the precision exception bug in
paddle.distributed.fleet.meta_parallel.RowParallelLinearreverse computation under tensor parallel. #33207 - Fix a bug in tensor parallel causing parameter initialization error and precision exception due to randomness control error. #32897 (#32897)
- Fix the random hang bug when creating a communication group with
paddle.distributed.new_group(). #33141 - Fix the bug of causing an error in traversing the reverse graph to resolve control flow networking under data parallel. #32715
- Fix the bug of causing an error when synchronizing the parameters of each process under data parallel. #33955
-
Static graph hybrid parallel
- Fix a slice error in TensorParallel in Multi-Head Attention networks, and optimize the training speed when TensorParallel is used together with mixed precision. (#32897)
- Custom OP
- Remove changes to
logginglibrary global settings. (#32673) - Add
GlooParallelContext, to adapt theReducermodule logic, and provide underlying communication component support forDataParallelsubsequently supporting CPU parallel later. (#35154) - Migrate
top_kop inpaddle.metric.accuracytotop_k_v2op. (#35789) - Fix the bug where the default
attris not found running underMKLDNN. (#34567) - Fix the bug in
optimizerwheredevice_keyis not added to theclear_float_statusOP. (#34431)
-
Add the dynamic shape auto-configuration function in TensorRT sub-graph mode. Add TensorRT offline tune dynamic shape setting method. For scenarios where the model is cut into multiple TensorRT sub-graphs, improve ease of use. #34806 #35771, For examples, see the demo.
- The basic idea of the ease of use optimization: to use Paddle to run natively to statistically calculate the shape ranges of all temporary tensors in the graph for the batch data input by the user, and set the statistically calculated shape ranges to the input of TensorRT sub-graphs, thus avoiding the user to manually calculate the shape ranges of the input tensors of internal sub-graphs and improving ease of use.
- Basic process of offline tuned dynamic shape: After the user code is completed, set the config, enable the shape range collection capability c++ interface
config. CollectShapeRangeInfo("shape_range.pbtxt")or python interfaceconfig. collect_shape_range_info('shape_range.pbtxt'), to store the obtained shape range locally in prototxt format, modify the config to disable shape collection, and enable tensorrt and dynamic shape capability, c++ interfaceconfig. EnableTunedTensorRtDynamicShape("shape_range.pbtxt", true)or python interfaceconfig.enable_tuned_tensorrt_dynamic_shape('shape_range.pbtxt', True). Thus, run run directly.
- Basic process of offline tuned dynamic shape: After the user code is completed, set the config, enable the shape range collection capability c++ interface
- The basic idea of the ease of use optimization: to use Paddle to run natively to statistically calculate the shape ranges of all temporary tensors in the graph for the batch data input by the user, and set the statistically calculated shape ranges to the input of TensorRT sub-graphs, thus avoiding the user to manually calculate the shape ranges of the input tensors of internal sub-graphs and improving ease of use.
-
Add native support for Ascend series hardware
-
Quantification support
-
API enhancements
- Refactor GO API based on new version of CAPI, #33113. For the example, see the demo.
- Predict python api
copy_from_cpuandcopy_to_cpuinterfaces to support float16 data types . (#34676) - Add
config.Summary()interface to print config configuration information. (#34122) - In the prediction library
version.txt, record trt version information patch, e.g., v7.2.3.4 instead of v7. ( #33690)
-
Library volume compression
- In the Linux, the volume of the prediction library is pruned by strip, and the volume is compressed by 30m. (#34895)
-
Other updates
-
CPU related updates
- Upgrade oneDNN version to 2.3.2. ( #35040)
- Add the support of quant-aware LSTM oneDNN INT8 models. (#35382)
- Add the support of post-training LSTM oneDNN INT8 models. (#35334, #33295)
- Add the support of fusion_gru and multi_gru fusion and post-training INT8. (#33749)
- Optimize the cache mechanism of oneDNN. (#35664, #35331, #35132, #35030, #35002, #34830, #33515, #33048, #32922, #32499)
- This is implemented by adding multiple op (e.g., clip, scale, etc.) of oneDNN kernel. In the ch_ppocr_mobile_v1.1_det_infer, DPN68, fastscnn, hrnet, HRNet_W18_C, icnet, Res2Net50_26w_4s, and ssdlite_mobilenet_v3_large models, the single core performance of Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz increases by 47.8% in the oneDNN enabling against disabling. (#35601, #32975)
- Optimized oneDNN LSTM INT8 model with 1.59x performance improvement on Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz single core than that of the FP32 LSTM model. (#35382, #35334, #34820, #34137)
-
GPU and TensorRT sub-graph engine related updates
- Added support for TensorRT 8.0. We will drop support for TensorRT 6.x in a future release. (#34403, #34294, #34157, #33777, #33680, #33662, #33654)
- Add support for dynamic shape in the TensorRT
layer_normplugin. (#33448) - Add support for dynamic shape in TensorRT
hard_swishplugin. (#35214) - Add support for TensoRT
reduce_sumandgather_nd. (#33324) - Add support for int8 in TensorRT
qkv_contextplugin (#34917, #35504) - Add support for TensorRT conv3d. (#35507)
- Add support for broadcasting the input of the
multihead_matmulfusion operator. (#35780)
-
Nvidia Jetson native support enhancements
- Add the Op support, for the Jetson Nano/TX2, two devices with lower arithmetic power. We made targeted optimizations. Now add the support for 17 OPs such as
pool2d,pool_max,conv3d_transpose, etc. (#35378) - For the Jetson Nano, we add new models: DPN68, EfficientNetB0, ttfnet, fcn_hrnetw18, hardnet. (#35378)
- For Jetson TX2, add new models: deeplabv3p_resnet50, deeplabv3_resnet50, fcn_hrnetw18, hardnet, pspnet, ttfnet, unet. (#35378)
- Add the Op support, for the Jetson Nano/TX2, two devices with lower arithmetic power. We made targeted optimizations. Now add the support for 17 OPs such as
-
Kunlun XPU interface feature extensions
- Add the
set_xpu_device_idinterface to support setting the device number of the Kunlun chip in the inference (#35572)
- Add the
-
Operator fixing
- Fix split op: when axis input is less than 0, address access error occurs when converting TensorRT. Filter out the cases that neither static nor dynamic shape is supported when axis is equal to 0. (#35127)
- Fix the bug where transpose static shape is wrong when axis is
[0, 1]. (#35138) - Fix the functional alignment of gather op with native paddle op, and improve op teller filtering conditions. (#35784)
- Fix int8 branching of fc op. (#34787, #32671)
- Fix op teller filtering condition for reshape. (#34787, #34583)
- Fix poor multi-threaded inference efficiency for recurrent op. (#36053)
- Fix the overflow bug of int values in gather and scatter op. (#35544)
- Fix the ctc op divide-by-zero error. (#34724)
- Fix a crash caused by inserting a scale op when the model input contains a bool type. (#35176)
- Fix complex scaler and Tensor operations failure bug. (#33699)
-
Frame feature fixing
- Fix an out-of-bounds GPU memory access bug when batching data for some ernie models. (#35077)
- Fix a possible accuracy bug in the running of the ernie model FP16 with precision. (#34771)
- Fix the incorrect output bug due to an inconsistent order of inputs when the ernie becomes longer. (#33575)
- Fix a bug where the allocator function is abnormal in multi-stream state. (#32932)
- TensorRT sub-graph engine fixing
- Fix an out-of-bounds error reporting bug with slice plugin's ends parameter in the TensorRT dynamic shape. (#35357)
- Fix the bug of keepdim=false that is not supported when reduce op is converted to reduce_all = 1 of TensorRT. (#35145)
- Fix the decrease_axis parameter bug when slice op is converted to TensorRT. (#35100)
- Fix the bug that negative scale is not supported when nearest_interp op is converted to TensorRT dynamic shape.Fix a bug that scale has higher priority than outh and outw. (#35405)
- Fix the bug that padd op's paddings parameter is not the same as tensorrt. (#35371)
- Add the 4-dimensional padding support for conv2d op to converting to TensorRT. Filter the cases that the padding_algorithm is SAME and VALID when conv2d op is converted to TensorRT. (#35627)
- Add the handling of padding_algorithm as SAME for pool2d op converting to TensorRT. Filter the cases that ksize in exclusive mode less than or equal to padings. (#35923)
- Fix the bug of not supporting the Min and Max inputs when clip op is converted to TensorRT. (#35694)
- Fix the bug of not supporting the approximate attribute when gelu op is converted to TensorRT. (#35529)
- Fix the bug of not supporting the 2-dimensional inputs when affine_channel is converted to TensorRT. (#35496)
- Fix an unstable TensorRT sub-graph matching bug. (#35147)
- Fix the bug of the TensorRT engine not released after prediction engine destruction. (#35842, #35938)
- Fix the bug of incorrect conversion to trt of the paddle operator in opaddle-trt static mode if the shape attribute batch dimension of reshape is -1. (#34007)
- Fix the bug of not supporting the RoisNum attribute when roi_align is converted to TensorRT. Fix the incorrect computation when aligned is True and Sampling_ratio = -1 in dynamic shape. (#35549)
- Fix the bug of not supporting the AxisTensor property when concat is converted to TensorRT. (#35545)
- Fix the bug of not supporting ScaleTensor property when scale is converted to TensorRT or not supporting 1-dimensional input in the static shape. (#35225)
- Fix the bug of not supporting the MomentumTensor property when batchnorm is converted to TensorRT. (#35527)
- Fix the bug of not supporting ShapeTensor when reshape is converted to TensorRT, fix the bug of not supporting the 1-Dimensional input in the Shape property and static shape. (#35166)
- Add support for TensorRT tile operator. (#34388)
- Add support for TensorRT reduce mean operator. (#34204)
- Fix a possible crash when using gather op. (#33999)
- Fix a flag in TensorRT int8 incorrectly using debug (run only the int8 kernel, resulting in performance degradation). (#34704)
- Fix a computation error with gather_nd op when calling TensorRT on 2-dimensional inputs. (#35464)
- Fix hard_sigmoid op computation error when calling TensorRT with 2-dimensional input. (#35908)
- Fix computation error in prelu op when calling TensorRT on 2-dimensional inputs. (#35512)
- Fix a crash caused by using right slash as path separator in TensorRT inference on windows. (#33853)
- Fix the bug when pruning inverse operator script encounters an error with Chinese character comments. (#33937, #33919)
- Fix the bug of an error in compile-time single-test model download caused by incomplete single-test inference, add MD5 download validation for test model download. (#33264, #33217)
- Fix broadcast bug in blazeface model where mkldnn elementwise op is not supported. (#33549)
- Fix swin_transformer mkldnn inference error reporting bug. (#35740)
- Fix the paddlex.deploy.Predictor oneDNN multi-threaded execution unet error reporting bug. (#35231)
- Fix the bug with oneDNN setCacheCapacity not limiting memory. (#33571)
- For Windows, support
Ninja compilation build method, compilation speed, ease of use, and stability. These features are improved greatly. Windows users can perform local source code compilation for Paddle viapip install ninja. (#31161, #31449, #32987, #33140, #33155) - Only python3.7 is kept in the release mirror. Remove python3.5, python3.6, python3.8, python3.9 and paddle packages of the corresponding python versions. The mirror size is reduced.The mirror size is reduced by 30%~50%. (#32688)
- TensorRT library is used for inference. Only paddle training base functions in the release mirror are supported, without needing to support TensorRT.Delete the TensorRT library from the distribution image to prevent users from using the mirror by mistake. (#34266)
- Support Hygon DCU chip training and inference. Support up to 9 classifications and 70 models.
- For Hygon DCU, add the support of 5 PaddleDetection models.
- For Hygon DCU, add the support for 6 PaddleGAN models.
- For Hygon DCU, add the support for 13 PaddleSeg models.
- For Hygon DCU, add the support for 3 PaddleNLP models.
- For Hygon DCU, add the support for 4 PaddleOCR models.
- For Hygon DCU, add the support for 3 PaddleVideo models.
- Support Kunlun 2nd generation chip (XPU-2) training. Support ResNet50, SSD, Bert, Transformer and many other models. Support static map + dynamic map training. Support mixed precision training.
This release contains contributions from:
0x45f, 123malin, Adam Osewski, Aganlengzi, Aurelius84, Baibaifan, Bo Liu, CheQiXiao, Chen Long, Chen Weihang, CtfGo, Double_V, Ethanzjp, Fan Zhang, Feiyu Chan, Feng Xing, From00, GT-Zhang, Guanghua Yu, Guoxia Wang, Haipeng Wang, Hao Lin, Haohongxiang, Hui Zhang, Huihuang Zheng, HydrogenSulfate, IMMORTAL, JYChen, JZ-LIANG, Jacek Czaja, Jack Zhou, Jackwaterveg, Jeng Bai-Cheng, Jiangxinz, Jiaqi Liu, Jiawei Wang, JingZhuangzhuang, June Weng, Kaipeng Deng, Kqnonrime, LJQ❤️, Leo Chen, Li Min, LielinJiang, Lijunhui, Linjie Chen, Liu-xiandong, LiuWei, Ming-Xu Huang, MissPenguin, PaddlePM, Pei Yang, Peihan, Qi Li, QingshuChen, Ren Wei (任卫), Roc, Shang Zhizhou, ShenLiang, Shibo Tao, Siming Dai, Sing_chan, TCChenLong, TTerror, TeslaZhao, Thomas Young, Thunderbrook, Tongxin Bai, WJJ1995, WangXi, Wangzheee, Wei Shengyu, WeiXin, Weilong Wu, Wenyu, Wilber, XGZhang, XYZ, XYZ916829, XiangGao, Xiaoxu Chen, YUNSHEN XIE, Yanxing Shi, Yiqun Liu, YuanRisheng, Yuang Liu, Yulong Ao, Zeng Jinle, Zhang Ting, Zhang Zheng, Zhanlue Yang, Zhen Wang, Zhong Hui, Zhou Wei, andreazanetti, andyjpaddle, arlesniak, baoachun, cc, ceci3, chajchaj, chenenquan, chenjian, chentianyu03, crystal, cuicheng01, danleifeng, denglin-github, duanboqiang, dyning, feng626, feng_shuai, furnace, gongweibao, heliqi, hlygit66666, hong, hong19860320, houj04, huangjun12, huangxu96, huzhiqiang, iducn, jakpiase, jiangcheng, joanna.wozna.intel, jzhang533, kuizhiqing, levi131, lidanqing, lilong12, limingshu, littletomatodonkey, liu zhengxi, liutiexing, liuyuhui, liym27, lyuwenyu, lzzyzlbb, niuliling123, pangyoki, parap1uie-s, ronnywang, root, seemingwang, shangliang Xu, shiyutang, smallv0221, sunli, sunzhongkai588, taixiurong, tangwei12, tianshuo78520a, veyron95, wangguanqun, wangguanzhong, wanghuancoder, wangna11BD, wangxinxin08, wangzhen38, wangzhuang01, wawltor, wenbin, whs, will-jl944, wuhuachaocoding, wuhuanzhou, xiaoting, xiaoxiaohehe001, xiayanming, xiegegege, xiemoyuan, xiongkun, yaoxuefeng, yeliang2258, yingyibiao, zhangbo9674, zhangchunle, zhangkaihuo, zhaoyingli, zhiboniu, zhoujun, zhouzj, zhulei, zhupengyang, zlsh80826, zmx, zyfncg, 李季, 津, 王明冬, 石晓伟