Skip to content

[WIP][CI] update coverage CI to CUDA 13.2#79330

Open
gouzil wants to merge 2 commits into
PaddlePaddle:developfrom
gouzil:coverage-ci-cuda132
Open

[WIP][CI] update coverage CI to CUDA 13.2#79330
gouzil wants to merge 2 commits into
PaddlePaddle:developfrom
gouzil:coverage-ci-cuda132

Conversation

@gouzil

@gouzil gouzil commented Jun 17, 2026

Copy link
Copy Markdown
Member

PR Category

Environment Adaptation

PR Types

Improvements

Description

该 PR 将标准 Coverage CI 的 coverage 镜像生成逻辑切到 CUDA 13.2 / Ubuntu 24.04,并将 coverage build / test 的 CUDA 架构目标调整为 Hopper。

主要改动:

  • 更新 coverage Dockerfile 生成函数,使用 nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 作为基础镜像。
  • 为生成的 coverage 镜像安装 CUDA 13.2 对应的 NCCL 包。
  • 保持 coverage 镜像不安装 TensorRT,延续原 coverage 镜像逻辑。
  • 删除从 Dockerfile.ubuntu24 继承来的 cuda.list 重写步骤,避免 CUDA 13.2 NGC base 镜像中不存在 /etc/apt/sources.list.d/cuda.list 时 docker build 失败。
  • 保持 Coverage test runner group 为原来的 BD_BJ-V100

本地验证:

  • prek --files .github/workflows/Coverage.yml tools/dockerfile/ci_dockerfile.sh
  • git diff --check -- .github/workflows/Coverage.yml tools/dockerfile/ci_dockerfile.sh
  • bash -n tools/dockerfile/ci_dockerfile.sh
  • 使用 GNU sed 生成 Dockerfile.cuda117_cudnn8_gcc82_ubuntu18_coverage,确认基础镜像为 CUDA 13.2 / Ubuntu 24.04,包含 cuda13.2 NCCL 包,没有 TensorRT 安装行,也不再包含对 /etc/apt/sources.list.d/cuda.listmv / sed 操作。

是否引起精度变化

PaddlePaddle-bot

This comment was marked as outdated.

@paddle-bot paddle-bot Bot added the contributor External developers label Jun 17, 2026
@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 18, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-24 00:51:33 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: 83b5a13 | Merge base: f362de0 (branch: develop)


1 Required任务 : 46/49 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
82(0) 82 79 1 0 0 2
任务 错误类型 置信度 日志
Coverage build docker / Build docker PR问题:Python 3.9 pip 脚本错误 Job

2 失败详情

🔴 Coverage build docker / Build docker — PR问题:Python 3.9 pip 脚本错误(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: Docker 构建阶段

用例 错误摘要
Build docker images python3.9 get-pip.py 执行失败,当前通用 get-pip.py 最低仅支持 Python 3.10

关键日志:

Step 27/52 : RUN wget -q https://bootstrap.pypa.io/get-pip.py
Step 28/52 : RUN sed -i 's#"install", "--upgrade", "--force-reinstall"#"install", "--upgrade", "--force-reinstall", "--break-system-packages"#' get-pip.py
Step 29/52 : RUN python3.9 get-pip.py && python3.10 get-pip.py && python3.11 get-pip.py && python3.12 get-pip.py
ERROR: This script does not work on Python 3.9. The minimum supported Python version is 3.10. Please use https://bootstrap.pypa.io/pip/3.9/get-pip.py instead.
The command '/bin/sh -c python3.9 get-pip.py && python3.10 get-pip.py && python3.11 get-pip.py && python3.12 get-pip.py' returned a non-zero code: 1
  • 根因摘要: Ubuntu24 模板对 Python3.9 误用通用 get-pip.py
    PR 将 coverage 镜像生成函数切换为 make_ubuntu24_cu132_dockerfile(),该函数基于 Dockerfile.ubuntu24 生成 coverage Dockerfile。Dockerfile.ubuntu24:65-68 只下载通用 get-pip.py 并直接用于 python3.9,但日志中该脚本已明确要求最低 Python 3.10;相比之下,原 Dockerfile.ubuntu22:71-76 会下载 https://bootstrap.pypa.io/pip/3.9/get-pip.py 并用 get-pip-3.9.py 安装 Python 3.9 的 pip。

修复建议:

  1. tools/dockerfile/Dockerfile.ubuntu24 中沿用 Dockerfile.ubuntu22 的处理方式:额外下载 pip/3.9/get-pip.py -O get-pip-3.9.py,对该文件同步追加 --break-system-packages,并将 python3.9 get-pip.py 改为 python3.9 get-pip-3.9.py
  2. 或在 tools/dockerfile/ci_dockerfile.sh:76-97 的 coverage 生成逻辑中对生成后的 coverage Dockerfile 做同等替换,避免 CUDA 13.2 / Ubuntu 24.04 coverage 镜像继续使用不兼容脚本。

关联变更: tools/dockerfile/ci_dockerfile.sh:76-79 将 coverage 生成源切到 Dockerfile.ubuntu24;失败模板位置为 tools/dockerfile/Dockerfile.ubuntu24:65-68

Copilot AI review requested due to automatic review settings June 19, 2026 13:03

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the standard Coverage CI environment to build coverage images on CUDA 13.2 / Ubuntu 24.04, and adjusts the CUDA architecture target configuration.

Changes:

  • Switch coverage Dockerfile generation to nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04 and remove cuda.list rewrite steps that can break on NGC images.
  • Install CUDA 13.2 NCCL packages into the generated coverage image.
  • Update Coverage.yml to set CUDA_ARCH_NAME to Hopper for coverage build/test jobs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tools/dockerfile/ci_dockerfile.sh Generates the coverage Dockerfile from the Ubuntu 24.04 template and adapts it for CUDA 13.2 (incl. NCCL and removing cuda.list rewrite steps).
.github/workflows/Coverage.yml Changes coverage CI CUDA architecture target to Hopper for build and test jobs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 98 to 102
FLAGS_fraction_of_gpu_memory_to_use: 0.15
CTEST_PARALLEL_LEVEL: 2
WITH_GPU: "ON"
CUDA_ARCH_NAME: Volta
CUDA_ARCH_NAME: Hopper
WITH_AVX: "ON"
Comment on lines 292 to 296
FLAGS_fraction_of_gpu_memory_to_use: 0.15
CTEST_PARALLEL_LEVEL: 2
WITH_GPU: "ON"
CUDA_ARCH_NAME: Auto
CUDA_ARCH_NAME: Hopper
WITH_AVX: "ON"
Comment on lines +76 to +78
function make_ubuntu24_cu132_dockerfile(){
dockerfile_name="Dockerfile.cuda117_cudnn8_gcc82_ubuntu18_coverage"
sed "s#<baseimg>#nvidia/cuda:12.0.1-cudnn8-devel-ubuntu22.04#g" ./Dockerfile.ubuntu22 >${dockerfile_name}
sed -i "s#<setcuda>#ENV LD_LIBRARY_PATH=/usr/local/cuda-12.0/targets/x86_64-linux/lib:\$LD_LIBRARY_PATH #g" ${dockerfile_name}
sed "s#<baseimg>#nvcr.io/nvidia/cuda:13.2.0-cudnn-devel-ubuntu24.04#g" ./Dockerfile.ubuntu24 >${dockerfile_name}

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-19 21:12:06

📋 Review 摘要

PR 概述:将标准 Coverage CI 的镜像生成逻辑切到 CUDA 13.2 / Ubuntu 24.04,并调整 coverage build/test 的 CUDA 架构目标。
变更范围.github/workflows/Coverage.ymltools/dockerfile/ci_dockerfile.sh
影响面 Tag[Environment Adaptation] [Execute Infrastructure]

问题

级别 文件 概述
🔴 Bug .github/workflows/Coverage.yml:295 Coverage test 仍在 BD_BJ-V100,但 build/test 固定为 Hopper(sm_90),产物无法在 V100(sm_70) 上运行

历史 Findings 修复情况

Finding 问题 状态
F1 切到 H-Coverage runner 后需要沿用该 runner 组的 GPU 设备选择逻辑。 ✅ 已修复

📝 PR 规范检查

标题使用了 [WIP][CI],不属于 checklist §D2 的官方 Category/Type Tag;描述结构完整且“是否引起精度变化”为“否”。

标题建议(可直接复制):

  • [Environment Adaptation] update coverage CI to CUDA 13.2

总体评价

Dockerfile 迁移本身未发现新的阻塞问题,但 coverage 架构目标和 runner 机型需要对齐;否则该 CI 会在测试阶段系统性失败,无法验证本 PR 的 CUDA 13.2 迁移。

CTEST_PARALLEL_LEVEL: 2
WITH_GPU: "ON"
CUDA_ARCH_NAME: Auto
CUDA_ARCH_NAME: Hopper

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug Coverage test 还跑在 BD_BJ-V100,但这里把测试环境固定为 CUDA_ARCH_NAME=Hopper

CUDA_ARCH_NAME 会经 ci/utils.sh / ci/run_setup.sh 传给 CMake;cmake/cuda.cmakeHopper 只展开为 CUDA_ARCH_BIN=90,而 BD_BJ-V100 是 V100(sm_70)runner。build 产出的 wheel 只含 Hopper cubin 时,后续 coverage test 在 V100 上会触发 CUDA 架构不兼容,而不是验证 CUDA 13.2 迁移。

建议修复方式:

  • 要保留 BD_BJ-V100,这里和 build 侧保持 Volta / Auto,或显式包含 70
  • 要验证 Hopper / CUDA 13.2,则将 Coverage test 切到 H-Coverage 类 Hopper runner,并沿用该 workflow 的 determine_gpu_runner / GPU_DEVICES 选择逻辑。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants