Skip to content

Improve CI speed #7992

@wangkuiyi

Description

@wangkuiyi

Our CI has been running slow recently. Qing-Qing, Yu Yang, Helin, Chen Xi, Ya-ming, Yi-bing, and I discussed this issue and here are what we learned and what we are going to do:

A. Reduce the number of SM architectures

  1. We are building many SM architectures in the CI: https://github.com/PaddlePaddle/Paddle/blob/develop/cmake/cuda.cmake.
  2. According to the experiment of Qing-qing, [Speed up compiling]: reduce the NVCC compiling (some .cu operators can be compiled by G++) #5491, nvcc could run faster if we generate less number of SM architectures.

Helin is going to configure the CI system to generate only one SM architecture when checking PRs, but generating all SM architecture code in the nightly build of the develop branch.

B. Migrate the CI system to two servers

We are running four TeamCity agents on four GPU desktops, each with one GPU and a desktop-level CPU (a few cores). We have two idle servers, each with 6 GPUs and a powerful CPU with 56 cores.

Helin will migrate the CI system to the servers.

C. Distribute unit tests to multiple GPUs

Our CI system runs unit tests by calling ctest -j N, where N is the number of processes that run unit tests in parallel. However, all these N processes are using the same GPU.

Qing-qing is going to study if we can make cmake/ctest to use more than one GPUs.

D. Add an environment variable to distinguish unit tests and regression tests.

Unit tests and regression tests are tested on CI server for every PR. They should be distinguished. Only unit tests should be run for every PR. Nightly builds should run all tests. We should add an environment flag to control it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions