-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Description
Release Manager
Endgame
- Code freeze: Oct, 2025
- Bug Bash date: TBD
- Release date: TBD
Main Features
SuperBench Improvement
-
- Add cuda13.0.dockerfile support (Dockerfile - add cuda13.0.dockerfile #739)
-
- Add nsys and pytorch profiler debug trace support (Enhancement: Add nsys and pytorch profiler debug trace support #744)
Micro-benchmark Improvement
-
- Collect per-snapshot per-GPU flops/temp in gpu burn (Benchmarks: Micro benchmark - collect per-snapshot per-GPU flops/temp in gpu burn #735)
-
- Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth (Benchmarks: Micro benchmark - Add simultanneously all-to-host / host-to-all bandwidth testcases to nvbandwidth #736)
-
- Add ncu profile support in cublaslt-gemm (Benchmarks: Micro benchmark - add ncu profile support in cublaslt-gemm #740)
-
- Support verification and parallel run for disk performance benchmark (Benchmarks: Micro benchmark - Support verification and parallel run for disk performance benchmark #741)
-
- Add numa support for nvbandwidth (Benchmarks: Micro benchmark - Add numa support for nvbandwidth #742)
-
- Change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference (Benchmarks: micro benchmarks - change cublasLtMatmulDescCreate scaleType from CUDA_R_32F to CUDA_R_16F in FP16 dist inference #732)
-
- Support gemm correctness check in cublaslt-gemm
-
- Multi node nccl validation enhancement
-
- mscclpp support
-
- Add new busbw metrics for NCCL/MSCCL testing with specific algorithm
-
- Fix NVBandwidth benchmark results parsing bug
-
- Support FP4 kernels for cutlass benchmark
Model Benchmark Improvement
-
- Add option to exclude data copy time in model benchmarks (Benchmark: Model benchmark - add option to exclude data copy time in model benchmarks #734)
-
- Support state-of-art LLM model training perf including Deepseek, qwen
-
- Support state-of-art LLM model inference perf including Deepseek, qwen
-
- Support state-of-art LLM module and model correctness benchmark
-
- Deterministic training support (Benchmark: Model benchmark - deterministic training support #731)
Bug fix
-
- dist-inference raise cublaslt error
-
- Add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation (Benchmarks: micro benchmarks - add --set_ib_devices option to auto-select IB device by MPI local rank in ib validation #733)
-
- NVBandwidth benchmark results parsing bug (NVBandwidth benchmark results parsing bug #748)
-
- CI/CD - Fix image merge in GitHub Action (CI/CD - Fix image merge in GitHub Action. #749)
-
- Fix pipelines - Update mlc version in dockerfiles from v3.11 to v3.12 (Fix pipelines - Update mlc version in dockerfiles from v3.11 to v3.12 #752)
-
- CI/CD - Fix python3.10 pipeline (CI/CD - Fix python3.10 pipeline #753)
-
- CI/CD - Fix Azure test pipeline (CI/CD - Fix Azure test pipeline #754)
Tools
-
- System info enhancement
Metadata
Metadata
Assignees
Labels
No labels