Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update buildkite, manifests, github action workflows #444

Closed
wants to merge 2 commits into from

Conversation

Sbozzolo
Copy link
Member

No description provided.

@sriharshakandala
Copy link
Member

sriharshakandala commented Feb 10, 2024

@Sbozzolo Sbozzolo force-pushed the gb/upadapt branch 4 times, most recently from 8c49cb6 to 5fb3b98 Compare February 12, 2024 00:54
@Sbozzolo
Copy link
Member Author

The buildkite pipeline had several problems. I fixed them and now most jobs are twice as fast.

The GPU unit test seems to be the only one adversely affected. @sriharshakandala, do you want to have a look at this?

https://buildkite.com/clima/rrtmgp-ci/builds/582#018d9acf-9053-433b-8a76-a0593b20f8d9

@Sbozzolo Sbozzolo changed the title Up compat for Adapt Update buildkite, manifests, github action workflows Feb 12, 2024
@Sbozzolo Sbozzolo mentioned this pull request Feb 12, 2024
Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated Show resolved Hide resolved
@charleskawczynski
Copy link
Member

Changes overall look good to me, except a couple items in the project toml

@Sbozzolo Sbozzolo force-pushed the gb/upadapt branch 2 times, most recently from 38f8790 to bd6d616 Compare February 12, 2024 15:17
@Sbozzolo
Copy link
Member Author

I consildated the environments to only have perf (because that's the only one that is being run on buildkite)

@Sbozzolo
Copy link
Member Author

@charleskawczynski do you have any idea what could be the reason behind this increase in time https://buildkite.com/clima/rrtmgp-ci/builds/592#018d9ed1-b8ac-4121-8118-2d3930baa764 compared to main?

It happens only on buildkite, @sriharshakandala ran the code on the cluster and found the same speed as main

@Sbozzolo Sbozzolo force-pushed the gb/upadapt branch 2 times, most recently from 9e6c960 to 04499cb Compare February 14, 2024 15:27
@sriharshakandala
Copy link
Member

@Sbozzolo : Please plan on including #448 in this release.

@Sbozzolo Sbozzolo force-pushed the gb/upadapt branch 7 times, most recently from df61ee9 to ceabf07 Compare February 14, 2024 18:31
@Sbozzolo
Copy link
Member Author

I spent 3 more hours on this and I narrowed down the problem the CUDA updates. I can reproduce on the cluster on the P100 when I use CUDA 5.2, but it still fast when using CUDA 5.1.

Fast:

julia> CUDA.versioninfo()
CUDA runtime 12.2, local installation
CUDA driver 12.3
NVIDIA driver 535.54.3, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 20.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 5.1.2
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.10.1+0
- CUDA_Runtime_Discovery: 0.2.3

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: Tesla P100-PCIE-16GB (sm_60, 15.893 GiB / 16.000 GiB available)

Slow:

CUDA runtime 12.2, local installation
CUDA driver 12.3
NVIDIA driver 535.54.3, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.2.1
- CURAND: 10.3.3
- CUFFT: 11.0.8
- CUSOLVER: 11.5.0
- CUSPARSE: 12.1.1
- CUPTI: 20.0.0
- NVML: 12.0.0+535.54.3

Julia packages: 
- CUDA: 5.2.0
- CUDA_Driver_jll: 0.7.0+1
- CUDA_Runtime_jll: 0.11.1+0
- CUDA_Runtime_Discovery: 0.2.3

Toolchain:
- Julia: 1.10.0
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.2
- CUDA_Runtime_jll.local: true

1 device:
  0: Tesla P100-PCIE-16GB (sm_60, 15.893 GiB / 16.000 GiB available)

Only changes:

  [79e6a3ab] ↑ Adapt v3.7.2 ⇒ v4.0.1
  [052768ef] ↑ CUDA v5.1.2 ⇒ v5.2.0
  [0c68f7d7] ↑ GPUArrays v9.1.0 ⇒ v10.0.2
  [46192b85] ↑ GPUArraysCore v0.1.5 ⇒ v0.1.6
  [76a88914] ↑ CUDA_Runtime_jll v0.10.1+0 ⇒ v0.11.1+0

I also checked that using the system and the artifact runtime produce the same results.

@sriharshakandala do you want to take this on and investigate further?

@charleskawczynski
Copy link
Member

I'm going to rebase this PR, cc @Sbozzolo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants