[WIP][CI] update coverage CI to CUDA 13.2#79330
Conversation
CI报告基于以下代码生成(30分钟更新一次): 1 Required任务 : 46/49 通过
2 失败详情🔴 Coverage build docker / Build docker — PR问题:Python 3.9 pip 脚本错误(置信度: 高)错误类型: PR问题 | 置信度: 高
关键日志:
修复建议:
关联变更: |
There was a problem hiding this comment.
Pull request overview
This PR updates the standard Coverage CI environment to build coverage images on CUDA 13.2 / Ubuntu 24.04, and adjusts the CUDA architecture target configuration.
Changes:
- Switch coverage Dockerfile generation to
nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04and removecuda.listrewrite steps that can break on NGC images. - Install CUDA 13.2 NCCL packages into the generated coverage image.
- Update
Coverage.ymlto setCUDA_ARCH_NAMEtoHopperfor coverage build/test jobs.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tools/dockerfile/ci_dockerfile.sh | Generates the coverage Dockerfile from the Ubuntu 24.04 template and adapts it for CUDA 13.2 (incl. NCCL and removing cuda.list rewrite steps). |
| .github/workflows/Coverage.yml | Changes coverage CI CUDA architecture target to Hopper for build and test jobs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| FLAGS_fraction_of_gpu_memory_to_use: 0.15 | ||
| CTEST_PARALLEL_LEVEL: 2 | ||
| WITH_GPU: "ON" | ||
| CUDA_ARCH_NAME: Volta | ||
| CUDA_ARCH_NAME: Hopper | ||
| WITH_AVX: "ON" |
| FLAGS_fraction_of_gpu_memory_to_use: 0.15 | ||
| CTEST_PARALLEL_LEVEL: 2 | ||
| WITH_GPU: "ON" | ||
| CUDA_ARCH_NAME: Auto | ||
| CUDA_ARCH_NAME: Hopper | ||
| WITH_AVX: "ON" |
| function make_ubuntu24_cu132_dockerfile(){ | ||
| dockerfile_name="Dockerfile.cuda117_cudnn8_gcc82_ubuntu18_coverage" | ||
| sed "s#<baseimg>#nvidia/cuda:12.0.1-cudnn8-devel-ubuntu22.04#g" ./Dockerfile.ubuntu22 >${dockerfile_name} | ||
| sed -i "s#<setcuda>#ENV LD_LIBRARY_PATH=/usr/local/cuda-12.0/targets/x86_64-linux/lib:\$LD_LIBRARY_PATH #g" ${dockerfile_name} | ||
| sed "s#<baseimg>#nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04#g" ./Dockerfile.ubuntu24 >${dockerfile_name} |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-06-19 21:12:06
📋 Review 摘要
PR 概述:将标准 Coverage CI 的镜像生成逻辑切到 CUDA 13.2 / Ubuntu 24.04,并调整 coverage build/test 的 CUDA 架构目标。
变更范围:.github/workflows/Coverage.yml、tools/dockerfile/ci_dockerfile.sh
影响面 Tag:[Environment Adaptation] [Execute Infrastructure]
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🔴 Bug | .github/workflows/Coverage.yml:295 |
Coverage test 仍在 BD_BJ-V100,但 build/test 固定为 Hopper(sm_90),产物无法在 V100(sm_70) 上运行 |
历史 Findings 修复情况
| Finding | 问题 | 状态 |
|---|---|---|
| F1 | 切到 H-Coverage runner 后需要沿用该 runner 组的 GPU 设备选择逻辑。 |
✅ 已修复 |
📝 PR 规范检查
标题使用了 [WIP][CI],不属于 checklist §D2 的官方 Category/Type Tag;描述结构完整且“是否引起精度变化”为“否”。
标题建议(可直接复制):
[Environment Adaptation] update coverage CI to CUDA 13.2
总体评价
Dockerfile 迁移本身未发现新的阻塞问题,但 coverage 架构目标和 runner 机型需要对齐;否则该 CI 会在测试阶段系统性失败,无法验证本 PR 的 CUDA 13.2 迁移。
| CTEST_PARALLEL_LEVEL: 2 | ||
| WITH_GPU: "ON" | ||
| CUDA_ARCH_NAME: Auto | ||
| CUDA_ARCH_NAME: Hopper |
There was a problem hiding this comment.
🔴 Bug Coverage test 还跑在 BD_BJ-V100,但这里把测试环境固定为 CUDA_ARCH_NAME=Hopper。
CUDA_ARCH_NAME 会经 ci/utils.sh / ci/run_setup.sh 传给 CMake;cmake/cuda.cmake 中 Hopper 只展开为 CUDA_ARCH_BIN=90,而 BD_BJ-V100 是 V100(sm_70)runner。build 产出的 wheel 只含 Hopper cubin 时,后续 coverage test 在 V100 上会触发 CUDA 架构不兼容,而不是验证 CUDA 13.2 迁移。
建议修复方式:
- 要保留
BD_BJ-V100,这里和 build 侧保持Volta/Auto,或显式包含70。 - 要验证 Hopper / CUDA 13.2,则将 Coverage test 切到
H-Coverage类 Hopper runner,并沿用该 workflow 的determine_gpu_runner/GPU_DEVICES选择逻辑。
PR Category
Environment Adaptation
PR Types
Improvements
Description
该 PR 将标准 Coverage CI 的 coverage 镜像生成逻辑切到 CUDA 13.2 / Ubuntu 24.04,并将 coverage build / test 的 CUDA 架构目标调整为 Hopper。
主要改动:
nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04作为基础镜像。Dockerfile.ubuntu24继承来的cuda.list重写步骤,避免 CUDA 13.2 NGC base 镜像中不存在/etc/apt/sources.list.d/cuda.list时 docker build 失败。BD_BJ-V100。本地验证:
prek --files .github/workflows/Coverage.yml tools/dockerfile/ci_dockerfile.shgit diff --check -- .github/workflows/Coverage.yml tools/dockerfile/ci_dockerfile.shbash -n tools/dockerfile/ci_dockerfile.shDockerfile.cuda117_cudnn8_gcc82_ubuntu18_coverage,确认基础镜像为 CUDA 13.2 / Ubuntu 24.04,包含 cuda13.2 NCCL 包,没有 TensorRT 安装行,也不再包含对/etc/apt/sources.list.d/cuda.list的mv/sed操作。是否引起精度变化
否