Releases: sophgo/tpu-mlir
Releases · sophgo/tpu-mlir
v1.12
Features
- Support for backend operators implemented using PPL.
- TPUv7-runtime CModel integrated with TPU-MLIR for BM1690 model CModel inference.
- Optimized inference speed for BM1690 Stable Diffusion 3.0 at 512 resolution to 0.72 img/s (Mac utilization: 41.9%).
- Support for training graph compilation of ResNet50-v1 through FxGraphConverter.
Bug Fixes
- Performance: Fixed the issue of performance degradation in SegNet.
- Functionality: Resolved the compilation comparison issue for BM1688 DeppLabv3P.
Known Issues
- Performance: Slight performance degradation observed in BM1690 YOLOv5-6 with 4 batch INT8 on eight cores.
v1.12-beta.0
combine slice and concate to new Rope ConcatToRope Change-Id: Ib15b12fe97117b96c6fe7267c96c3f714aac6ec4
v1.11
[python] distinguish data path model-zoo from regression Change-Id: I98fa0df1524f0b38d91cda02ab5d49876f7caee8 (cherry picked from commit fa082d0b29df8a82af77839df86349aabab86949)
v1.11-beta.0
[soc_dump] add doc Change-Id: Icaf313113415a9bf0ad9c75abdcb609d661c815b
TPU-MLIR v1.10 Release
Release Note
Enhancements:
- Added CUDA support for various operations like conv2d, MatMul, dwconv, pool2d, and more.
- Improved performance for operations like MeanStdScale and softmax.
- Enhanced multi-core batch mm and added support for bm168x with CUDA.
- Refined CUDA code style and adjusted interfaces for various operations.
Bug Fixes:
- Fixed issues with matmul, calibration failures, conv pad problems, and various performance problems.
- Addressed bugs in model transformations, calibration, and various pattern issues.
- Resolved bugs in different model backends like ssd, vit, detr, and yolov5.
New Features:
- Added support for new models like resnet50, mobilenet_v2, shufflenet_v2, and yolox_s/alphapose_res50.
- Introduced new operations like RequantIntAxisOp and Depth2Space with CUDA support.
- Implemented new functionalities for better model inference and compilation.
Documentation Updates:
- Updated weight.md, calibration sections, and user interface details.
- Improved documentation for quick start, developer manual, and various tpulang interfaces.
- Enhanced documentation for model transformation parameters and tensor data arrangements.
Miscellaneous:
- Added new npz tools, modelzoo regression, and support for bmodel encryption.
- Fixed issues with various model performance, shape inference, and CUDA backend optimizations.
- Revived performance for models like yolov5s-6, bm1690 swin multicore, and more.
TPU-MLIR v1.9 Release
Release Note
Enhancements:
- Implemented output order preservation in converters like ONNX, Caffe, Torch, and TFLite.
- Added support for resnet50-v2 bm1690 f8 regression.
- Improved ILP group mlir file sequences for resnet50 training.
- Updated chip libraries and performance AI for A2 profiling.
- Added a new dump mode "COMB" and refined abs/relu conversions.
Bug Fixes:
- Fixed issues with preprocess when source layout differs from target layout.
- Addressed bugs in various operations like softmax, concat, and weight reorder in conv2d.
- Resolved bugs in model training, model transformation, and various pattern issues.
- Fixed bugs related to CUDA inference, matmul with bias, and multi-output calibration.
New Features:
- Added support for multi-graph in TPULang.
- Introduced new options in TPULang for inference and model deployment.
- Implemented various optimizations and enhancements for dynamic operations and model transformations.
Documentation Updates:
- Refined documentation for quick start quantization and user interface sections.
- Updated backend information, docker image download methods, and model deployment details in the documentation.
Miscellaneous:
- Improved performance for various models like vit, yolov5s, and bm1690.
- Introduced new functionalities like embedding multi-device slice and groupnorm train operations.
- Added support for adaptive_avgpool inference and multiple Einsum modes.
TPU-MLIR v1.8.1
Full Changelog: v1.8...v1.8.1
TPU-MLIR v1.8 Release
Highlights:
-
Enhancements:
- Added support for dynamic shape inference in various operations.
- Optimized core operations for better performance on specific models.
- Improved backend support for multiple models like BM1684X, BM1688, BM1690, SG2380, etc.
- Introduced new operations and patterns for more efficient model processing.
- Updated documentation for better clarity and user guidance.
-
Bug Fixes:
- Resolved issues related to input/output handling, kernel configurations, and model-specific bugs.
- Fixed bugs in dynamic compilation, core parallel processing, and various backend operations.
- Addressed errors in specific model post-processing steps like YOLOv5, EfficientNet, etc.
-
Performance Improvements:
- Optimized cycle calculations for multi-core models.
- Enhanced bandwidth usage statistics for better resource management.
- Accelerated compilation processes for training models using a new layer-group scheme.
-
New Features:
- Introduced new operations like attention quant block, prelu op, and various dynamic compile features.
- Added support for additional operations, weight location, and dynamic compile enhancements.
Documentation Updates:
- Updated developer manuals, quick start guides, and model-specific documentation for better understanding.
Miscellaneous:
- Streamlined workflows for faster commit checks and improved debugging processes.
- Added new test cases for regression testing and script-based model evaluations.
- Fine-tuned backend operations for improved model performance and accuracy.
TPU-MLIR v1.7 Release
Change Log
New Features
- Added support for new operations including flash attention, custom op dynamic compile, and tpulang ops.
- Enabled AttnReorder and added support for dynamic indices in ops like onehot, scatterelements, and cumsum.
- Added
--dump_dataframe
option for bmodel_checker and support for transpose with order[1, 2, 3, 0]
. - Introduced Watchpoint feature to TDB and added support for mixed-precision networks.
- Implemented optimizations for dma efficiency of flash attention and optimized backend for various models.
- Added support for local memory dump in pcie mode and added various quantization features like eva quant, swin quant, and detr quant.
- Enhanced multi-core support including support for LayerNorm and GroupNorm in coreParallel, and multi-core data slice in tensorLocation.
- Added new patterns for Cswin and Einsum operations.
- Improved support for LLM (Large Language Models) in bm1688.
Bug Fixes
- Fixed various bugs including kernel_module msg_id, SAM-VIT-encoder regression, and attention accuracy problems.
- Addressed logical issues in AddToScale pattern and issues in fp_forward.
- Resolved bugs in model info core dump, op's liveRange in coreParallel, and DevParallel bugs.
- Fixed issues in model combine with io alone and bugs in various ops like interp, RotaryPosEmbPattern, and efficient-lite4 permute.
Performance Improvements
- Improved the performance of TDB and the bmodel_checker for 1684x pcie.
- Optimized facenet and fixed performance issues of 1688 multicore.
- Enabled single-core mode optimizations where necessary.
Documentation and Testing
- Updated documentation, refined custom chapters, and ensured consistency in quick start docs.
- Added test cases for custom tpulang, multi-core with subnets, and custom cpuop.
- Fixed various documentation errors and updated the release note.
Other Changes
- Added restrictions to tpulang ops and net test cases.
- Adjusted descriptions and refined interfaces for better user experience.
- Updated backend .so files and addressed sensitive words in the codebase.
- Added support for int4 dtype in tpu_profile and ensured tool/scripts work in Python virtual environments.
Technical Preview
Features
- Added support for LLM Decoding by utilizing multi-cores to enhance processing efficiency.
- Introduced
fx2mlir
, a new functionality for enhanced MLIR conversion. - Implemented
nnvlc2.0
andnnvlc1.0
local activation and weight operations, respectively, for improved neural network performance. - Enabled
TPULANG
support for operations like sort, argsort, and additional ops, enhancing the language's functionality and flexibility. - Added
cv186x
support inrun_sensitive_layer.py
and for the TDB, expanding compatibility and debugging capabilities. - Introduced new ops and features like
Watchpoint
in TDB andactivation ops
support for scale & zero_point, broadening the range of functionalities available in thetpu-mlir
project. - Supports
BM1690
. - L2mem performs intermediate data exchange for active tensor.
Bug Fixes
- Resolved a variety of bugs affecting backend processes, including issues with the
1684x
backend,permutefuse2
,permutemulconstswap
, and more, improving overall stability and performance. - Fixed several critical issues across
tpulang
, including errors insort_by_key
operation,reshape
operations,where
operation, and more, enhancing the language's reliability for developers. - Addressed bugs in model processing, including fixes for
concat
logic,scale2conv
,scale2conv3d
,instance norm
, and several more, ensuring smoother model optimization and execution. - Corrected errors in the documentation, providing clearer and more accurate information for users and developers.
Documentation Updates
- Updated
tpulang
documentation to include new functionalities and optimizations, making it easier for users to understand and utilize the language effectively.
Performance Improvements
- Optimized TDB and
bmodel_checker
for1684x pcie
mode, significantly reducing processing times and enhancing efficiency for model analysis. - Improved the efficiency of DMA in flash attention operations, ensuring faster data handling and processing.
- Enabled IO tag mode and refined address mode for better memory management and operational flexibility.