-
Notifications
You must be signed in to change notification settings - Fork 1
CI Incidents
Update to CUDA 5.4 is broken. Error is related to poll
not found.
Atmos GPU benchmark edmf jobs time out
https://buildkite.com/clima/climaatmos-ci/builds/18919 https://buildkite.com/clima/climaatmos-ci/builds/18898 https://buildkite.com/clima/climaatmos-ci/builds/18921
Fixed by https://github.com/CliMA/ClimaAtmos.jl/pull/3039
The coupler depot fails pretty frequently but it's not clear why. We want to document instances of its failure here to find a pattern in them.
- 17 April 2024: Multiple packages fail to precompile; the package files for them can't be found in the depot (failing build)
- Fixed: deleting
BlockArrays
from the depot to force recompilation fixed the error. - Example error message:
- Fixed: deleting
Failed to precompile ClimaCorePlots [cf7c7e5a-b407-4c48-9047-11a94a308626] to "/central/scratch/esm/slurm-buildkite/climacoupler-ci/depot/cpu/compiled/v1.10/ClimaCorePlots/jl_EfTnJc". ERROR: LoadError: SystemError: opening file "/central/scratch/esm/slurm-buildkite/shared_depot/packages/BlockArrays/wTlvd/src/blockindices.jl": No such file or directory
- Previous run on coupler CI failed because of a lack of memory (see build). Maybe this failure caused the run to exit incorrectly and corrupt some files.
- Error message:
slurmstepd: error: Detected 2 oom-kill event(s) in StepId=40868531.1. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: hpc-90-30: task 6: Out Of Memory srun: launch/slurm: _step_signal: Terminating StepId=40868531.1 slurmstepd: error: *** STEP 40868531.1 ON hpc-90-30 CANCELLED AT 2024-04-16T21:43:03 ***
Two Atmos perf jobs have varying allocations between builds (not runs). No clear cause has been identified.
flame graph GPU job
- hpc-33-13 with allocs 2196768 (build)
- On p100:
- hpc-25-20 with allocs 1662248 (build)
- hpc-25-23 with allocs 1719992 (build)
- failed on hpc-26-14 with allocs 2127136 (build)
flame graph perf job (diagnostics)
- hpc-24-21 with allocs 20596072 (build)
- hpc-22-21 with allocs 20596072 (build)
- hpc-22-12 with allocs 10877544 (build)
- hpc-22-13 with allocs 20596072 (build)
Solved? No
Some ClimaAtmos jobs time out when being profiled with nsys.
Error:
The timeout expired.
Example: build
We are currently working around it by removing the nsys calls.
Some of the ClimaAtmos allocation jobs seem to be non-deterministic.
Eg
- flame graph GPU job
- failed with allocations 1.3x previous limit (build)
- updated the buildkite pipeline to request only P100, which:
- passed on hpc-25-20 with allocs 1662248 (build)
- passed on hpc-25-23 with allocs 1719992 (build)
- failed on hpc-26-14 with allocs 2127136 (build)
- flame graph perf job (diagnostics)
- initially failed with allocations 1.9x previous limit (build)
- I increased the allocation limit to 20596072
- passed on hpc-22-21 with allocs 20596072 (build)
- failed on hpc-22-12 with allocs 10877544 because alloc limit was too large (build)
- decreased allocation limit to 17877544
- failed on hpc-22-13 with allocs 20596072 (build)
Click me for solved incidents
4 GPU jobs on Clima run with errors that indicate problems with direct memory access (build). It does not prevent merging PRs, but seems to lead to some problems.
Error:
[clima.gps.caltech.edu:3961726] Failed to register remote memory, rc=-1
This is also seen in ClimaCoupler runs with 4 GPUs, but not with 2 or 1 GPUs (build).
30 April 2024: Seen again in ClimaCoupler, even after increasing slurm_mem
to 20GB for atmos-only runs and 26GB for coupled runs (build). It happens when using either Float32 or Float64.
We see this error when clima is saturated (i.e. all GPUs are in use). For example, this run displays the remote memory error and was run when all 8 GPUs were in use. This run corresponds to the same commit, but was run with no other jobs running on clima (4/8 GPUs idle), and it does not display the remote memory error.
This error doesn't seem to impact performance or correctness, and has a simple workaround if we really need a run without it, so I think it's somewhat resolved.
ClimaLand CI failing because files in depot can't be found. E.g. in this build.
Error message:
ERROR: LoadError: InitError: could not load library "/central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/artifacts/ee4659b15eaedd38f43a66a85f17d45f8a4401c7/lib/libload_time_mpi_constants.so"
/central/scratch/esm/slurm-buildkite/climaland-ci/depot/default/artifacts/ee4659b15eaedd38f43a66a85f17d45f8a4401c7/lib/libload_time_mpi_constants.so: cannot open shared object file: No such file or directory
This happened yesterday and we cleared the land depot, but since it's happening again we should investigate further.
Solved? No
CI across all repos has had very high wait times today. This is because the GPU queue on new-central is full; as of 5pm, 983 jobs are waiting, 98 of which are ours. Our node is also down, so we can't run jobs on that either. Both CPU and GPU runs are stuck waiting for agents. This is pretty disruptive as nothing can be merged while this remains unresolved.
Solved? Yes
A combination of problems:
- nodes failing on central
- high utilization of the cluster
- issues with the scheduler (if there are several jobs waiting in one queue, jobs in other queues wouldn't start even if resources are available)
Random job failure in ClimaAtmos ci. Example build. The jobs run fine when retrying.
Error message:
srun: error: Unable to confirm allocation for job 40597723: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 40597723
π¨ Error: Error setting up job executor: The global environment hook exited with status 1
Also seen in ClimaLand.jl (build), and runs fine when re-run.
Error message:
slurm_load_jobs error: Socket timed out on send/recv operation
π¨ Error: Error setting up job executor: The global environment hook exited with status 1
slurm-buildkite
has a setup and teardown steps to create the same temporary folders on multiple nodes (for MPI runs). These steps were calling slurm from within a slurm job. Sometimes, slurm would timeout, and the job would abort.
To workaround this, we use pdsh
. For example, for the teardown step:
-srun --ntasks-per-node=1 --ntasks=${SLURM_JOB_NUM_NODES} rm -rf "/tmp/slurm_${SLURM_JOB_ID}"
+pdsh -w $SLURM_NODELIST rm -rf "/tmp/slurm_${SLURM_JOB_ID}"
pdsh
does not invoke slurm, but executes the command on all nodes, making the solution more robust with respect to slurm outages.
Solved? YES
Random job failure in ClimaAtmos ci. The jobs run fine when retrying.
Error message:
srun: error: Unable to confirm allocation for job 40597723: Socket timed out on send/recv operation
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 40597723
π¨ Error: Error setting up job executor: The global environment hook exited with status 1
Solved? No, but it doesn't seem to be occuring anymore
Random job failure during solve.
Due to: bug
Simulation failing but proper error message throwing failing due to string interpolation (not GPU friendly)
https://buildkite.com/clima/climacoupler-longruns/builds/502#018e6135-a881-42d8-a63b-9e27837cb1fd
Solved? Will be solved in https://github.com/CliMA/ClimaCoupler.jl/issues/711.
Some ClimaAtmos jobs failed due to the following error. They run fine when retrying.
Error:
srun: error: Unable to confirm allocation for job 40519699: Unable to contact slurm controller (connect failure)
srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 40519699
π¨ Error: Error tearing down job executor: The global pre-exit hook exited with status 1
Solved? No, but it doesn't seem to be occuring anymore
The nsight
module was broken. Admins updated the module and removed the old one.
External factors.
Update to climacommon/2024_03_18
.
SLURM on clima schedules multiple GPU jobs on the same GPUs.
Eg:
SLRUM option SLURM_GPU_BIND=none
. The option was introduce to allow device to device communication in MPI runs, given that SLRUM did not support this feature well. (see comment). This flag messed with how GPUs were used.
- Diagnosed problem
- Tested and verified that removing the flag distributes jobs correctly
- Found that the new version of SLURM contains a potential fix for this
- Asked Scott to upgrade SLRUM
- Verified that new flag behaves as expected
- SLURM_GPU_BIND: none # https://github.com/open-mpi/ompi/issues/11949#issuecomment-1737712291
+ SLRUM_GRES_FLAGS: "allow-task-sharing"
Module system on central is broken. Modules are not correctly loaded, jobs crash.
We are on a RedHat 9 node
Loading climacommon/2024_02_27
ERROR: Unable to locate a modulefile for 'nsight-systems/2023.3.1'
ERROR: Load of requirement nsight-systems/2023.3.1 failed
External factors (IMSS pushed a broken update to login3
)
- Established that the problem is not on our side.
- Sent an email to IMSS reporting the problem with steps to reproduce.
Legend:
π΄ = we cannot merge PRs
π‘ = we can still merge PRs, but things are not working as expected.
π’ = everything is working as expected