Skip to content

CI Incidents

Gabriele Bozzola edited this page May 28, 2024 · 54 revisions

Everything wrong with CI and related

Current system status: 🟒 (#3, #4, #6)

Days since the last incident: 15

27 May 2024

Update to CUDA 5.4 is broken. Error is related to poll not found.

🟒 20 May 2024 -- 22 May 2024

Atmos GPU benchmark edmf jobs time out

https://buildkite.com/clima/climaatmos-ci/builds/18919 https://buildkite.com/clima/climaatmos-ci/builds/18898 https://buildkite.com/clima/climaatmos-ci/builds/18921

Fixed by https://github.com/CliMA/ClimaAtmos.jl/pull/3039

🟑 Coupler frequent depot failures -- (#12)

The coupler depot fails pretty frequently but it's not clear why. We want to document instances of its failure here to find a pattern in them.

  • 17 April 2024: Multiple packages fail to precompile; the package files for them can't be found in the depot (failing build)
    • Fixed: deleting BlockArrays from the depot to force recompilation fixed the error.
    • Example error message:

Failed to precompile ClimaCorePlots [cf7c7e5a-b407-4c48-9047-11a94a308626] to "/central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/compiled/v1.10/ClimaCorePlots/jl_EfTnJc". ERROR: LoadError: SystemError: opening file "/central/scratch/esm/slurm-buildkite/shared_depot/packages/BlockArrays/wTlvd/src/blockindices.jl": No such file or directory

  • Previous run on coupler CI failed because of a lack of memory (see build). Maybe this failure caused the run to exit incorrectly and corrupt some files.
    • Error message:

slurmstepd: error: Detected 2 oom-kill event(s) in StepId=40868531.1. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: hpc-90-30: task 6: Out Of Memory srun: launch/slurm: _step_signal: Terminating StepId=40868531.1 slurmstepd: error: *** STEP 40868531.1 ON hpc-90-30 CANCELLED AT 2024-04-16T21:43:03 ***

🟑 14 March 2024 -- (#6)

Two Atmos perf jobs have varying allocations between builds (not runs). No clear cause has been identified.

flame graph GPU job

  • hpc-33-13 with allocs 2196768 (build)
  • On p100:
  • hpc-25-20 with allocs 1662248 (build)
  • hpc-25-23 with allocs 1719992 (build)
  • failed on hpc-26-14 with allocs 2127136 (build)

flame graph perf job (diagnostics)

  • hpc-24-21 with allocs 20596072 (build)
  • hpc-22-21 with allocs 20596072 (build)
  • hpc-22-12 with allocs 10877544 (build)
  • hpc-22-13 with allocs 20596072 (build)

Solved? No

🟑 13 March 2024 -- (#4)

Some ClimaAtmos jobs time out when being profiled with nsys.

Error:

The timeout expired.

Example: build

Solved? NO

We are currently working around it by removing the nsys calls.

🟑 13 March 2024 -- (#3)

Some of the ClimaAtmos allocation jobs seem to be non-deterministic.

Eg

  • flame graph GPU job
  1. failed with allocations 1.3x previous limit (build)
  2. updated the buildkite pipeline to request only P100, which:
  • passed on hpc-25-20 with allocs 1662248 (build)
  • passed on hpc-25-23 with allocs 1719992 (build)
  • failed on hpc-26-14 with allocs 2127136 (build)
  • flame graph perf job (diagnostics)
  1. initially failed with allocations 1.9x previous limit (build)
  2. I increased the allocation limit to 20596072
  • passed on hpc-22-21 with allocs 20596072 (build)
  • failed on hpc-22-12 with allocs 10877544 because alloc limit was too large (build)
  1. decreased allocation limit to 17877544
  • failed on hpc-22-13 with allocs 20596072 (build)

Solved? NO

Past incidents

Click me for solved incidents

🟑 06 March 2024 -- (#5)

4 GPU jobs on Clima run with errors that indicate problems with direct memory access (build). It does not prevent merging PRs, but seems to lead to some problems.

Error:

[clima.gps.caltech.edu:3961726] Failed to register remote memory, rc=-1

This is also seen in ClimaCoupler runs with 4 GPUs, but not with 2 or 1 GPUs (build).

30 April 2024: Seen again in ClimaCoupler, even after increasing slurm_mem to 20GB for atmos-only runs and 26GB for coupled runs (build). It happens when using either Float32 or Float64.

Solved? Yes

We see this error when clima is saturated (i.e. all GPUs are in use). For example, this run displays the remote memory error and was run when all 8 GPUs were in use. This run corresponds to the same commit, but was run with no other jobs running on clima (4/8 GPUs idle), and it does not display the remote memory error.

This error doesn't seem to impact performance or correctness, and has a simple workaround if we really need a run without it, so I think it's somewhat resolved.

🟑 22 March 2024 -- (#10)

ClimaLand CI failing because files in depot can't be found. E.g. in this build.

Error message:

ERROR: LoadError: InitError: could not load library "/central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/artifacts/ee4659b15eaedd38f43a66a85f17d45f8a4401c7/lib/libload_time_mpi_constants.so"
/central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/artifacts/ee4659b15eaedd38f43a66a85f17d45f8a4401c7/lib/libload_time_mpi_constants.so: cannot open shared object file: No such file or directory

This happened yesterday and we cleared the land depot, but since it's happening again we should investigate further.

Solved? No

πŸ”΄ 9 April 2024 -- 13 April 2024 (#11)

CI across all repos has had very high wait times today. This is because the GPU queue on new-central is full; as of 5pm, 983 jobs are waiting, 98 of which are ours. Our node is also down, so we can't run jobs on that either. Both CPU and GPU runs are stuck waiting for agents. This is pretty disruptive as nothing can be merged while this remains unresolved.

Solved? Yes

Due to: external factors

A combination of problems:

  • nodes failing on central
  • high utilization of the cluster
  • issues with the scheduler (if there are several jobs waiting in one queue, jobs in other queues wouldn't start even if resources are available)

🟑 1 April 2024 -- 2 April 2024 (#11)

Random job failure in ClimaAtmos ci. Example build. The jobs run fine when retrying.

Error message:

srun: error: Unable to confirm allocation for job 40597723: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 40597723
🚨 Error: Error setting up job executor: The global environment hook exited with status 1

Also seen in ClimaLand.jl (build), and runs fine when re-run.

Error message:

slurm_load_jobs error: Socket timed out on send/recv operation
🚨 Error: Error setting up job executor: The global environment hook exited with status 1

Due to

slurm-buildkite has a setup and teardown steps to create the same temporary folders on multiple nodes (for MPI runs). These steps were calling slurm from within a slurm job. Sometimes, slurm would timeout, and the job would abort.

To workaround this, we use pdsh. For example, for the teardown step:

-srun --ntasks-per-node=1 --ntasks=${SLURM_JOB_NUM_NODES} rm -rf "/tmp/slurm_${SLURM_JOB_ID}"
+pdsh -w $SLURM_NODELIST rm -rf "/tmp/slurm_${SLURM_JOB_ID}"

pdsh does not invoke slurm, but executes the command on all nodes, making the solution more robust with respect to slurm outages.

Solved? YES

🟑 25 March 2024 -- (#11)

Random job failure in ClimaAtmos ci. The jobs run fine when retrying.

Error message:

srun: error: Unable to confirm allocation for job 40597723: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 40597723
🚨 Error: Error setting up job executor: The global environment hook exited with status 1

Solved? No, but it doesn't seem to be occuring anymore

🟑 20 March 2024 -- (#9)

Random job failure during solve.

Screen Shot 2024-03-21 at 8 00 21 AM

Due to: bug

Simulation failing but proper error message throwing failing due to string interpolation (not GPU friendly)

https://buildkite.com/clima/climacoupler-longruns/builds/502#018e6135-a881-42d8-a63b-9e27837cb1fd

Solved? Will be solved in https://github.com/CliMA/ClimaCoupler.jl/issues/711.

🟑 19 March 2024 -- (#8)

Some ClimaAtmos jobs failed due to the following error. They run fine when retrying.

Error:

srun: error: Unable to confirm allocation for job 40519699: Unable to contact slurm controller (connect failure)
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 40519699
🚨 Error: Error tearing down job executor: The global pre-exit hook exited with status 1

Solved? No, but it doesn't seem to be occuring anymore

πŸ”΄ 18 March 2024 (#7)

The nsight module was broken. Admins updated the module and removed the old one.

Due to

External factors.

Solved? YES

Update to climacommon/2024_03_18.

🟑 06 March 2024 -- 07 March 2024 (#2)

SLURM on clima schedules multiple GPU jobs on the same GPUs.

Eg:

image

Due to

SLRUM option SLURM_GPU_BIND=none. The option was introduce to allow device to device communication in MPI runs, given that SLRUM did not support this feature well. (see comment). This flag messed with how GPUs were used.

Solved? YES

  • Diagnosed problem
  • Tested and verified that removing the flag distributes jobs correctly
  • Found that the new version of SLURM contains a potential fix for this
  • Asked Scott to upgrade SLRUM
  • Verified that new flag behaves as expected

Required change:

-  SLURM_GPU_BIND: none # https://github.com/open-mpi/ompi/issues/11949#issuecomment-1737712291
+  SLRUM_GRES_FLAGS: "allow-task-sharing"

πŸ”΄ 06 March 2024 (#1)

Module system on central is broken. Modules are not correctly loaded, jobs crash.

Example error message

We are on a RedHat 9 node
Loading climacommon/2024_02_27
  ERROR: Unable to locate a modulefile for 'nsight-systems/2023.3.1'
  ERROR: Load of requirement nsight-systems/2023.3.1 failed

Due to

External factors (IMSS pushed a broken update to login3)

Solved? YES

  • Established that the problem is not on our side.
  • Sent an email to IMSS reporting the problem with steps to reproduce.

Legend:

πŸ”΄ = we cannot merge PRs

🟑 = we can still merge PRs, but things are not working as expected.

🟒 = everything is working as expected