-
Notifications
You must be signed in to change notification settings - Fork 134
ci: use single-CUDA NVHPC Docker images to reduce runner disk usage #1350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+101
−48
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
26572fb
ci: use single-CUDA Docker tags for NVHPC to fit runner disk
sbryngelson d6339ae
Fix comment formatting in CMakeLists.txt
sbryngelson 3b651b1
Refactor NVHPC setup in GitHub Actions workflow
sbryngelson fb18e49
Merge branch 'master' into small-docker
sbryngelson ae9462d
ci: fix git 'dubious ownership' spam in NVHPC Docker container
sbryngelson 4aaa559
Merge branch 'master' into small-docker
sbryngelson File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -213,25 +213,25 @@ jobs: | |
| fail-fast: false | ||
| continue-on-error: true | ||
| runs-on: ${{ matrix.nvhpc && 'ubuntu-22.04' || format('{0}-latest', matrix.os) }} | ||
| container: | ||
| image: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }} | ||
| options: ${{ matrix.nvhpc && '--security-opt seccomp=unconfined' || '' }} | ||
| env: | ||
| CC: ${{ matrix.nvhpc && 'nvc' || '' }} | ||
| CXX: ${{ matrix.nvhpc && 'nvc++' || '' }} | ||
| FC: ${{ matrix.nvhpc && 'nvfortran' || '' }} | ||
| OMPI_ALLOW_RUN_AS_ROOT: ${{ matrix.nvhpc && '1' || '' }} | ||
| OMPI_ALLOW_RUN_AS_ROOT_CONFIRM: ${{ matrix.nvhpc && '1' || '' }} | ||
| PMIX_MCA_gds: ${{ matrix.nvhpc && 'hash' || '' }} | ||
| OMPI_MCA_hwloc_base_binding_policy: ${{ matrix.nvhpc && 'none' || '' }} | ||
| FFLAGS: ${{ matrix.nvhpc && '-tp=px -Kieee -noswitcherror' || '' }} | ||
| CFLAGS: ${{ matrix.nvhpc && '-tp=px' || '' }} | ||
| CXXFLAGS: ${{ matrix.nvhpc && '-tp=px' || '' }} | ||
| # Image tag for NVHPC jobs; empty for non-NVHPC jobs. | ||
| NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }} | ||
|
|
||
| steps: | ||
| - name: Git safe directory | ||
| # ── NVHPC: free disk before pulling the ~25-30 GB cuda_multi image ── | ||
| - name: Free disk space | ||
| if: matrix.nvhpc | ||
| run: git config --global --add safe.directory /__w/MFC/MFC | ||
| run: | | ||
| echo "=== Disk before cleanup ===" | ||
| df -h / | ||
| sudo rm -rf /usr/share/dotnet /usr/local/lib/android \ | ||
| /opt/ghc /usr/local/share/boost /opt/hostedtoolcache \ | ||
| /usr/local/graalvm /usr/local/.ghcup \ | ||
| /usr/local/share/chromium /usr/local/lib/node_modules | ||
| sudo docker image prune -af | ||
| sudo apt-get clean | ||
| echo "=== Disk after cleanup ===" | ||
| df -h / | ||
|
|
||
| - name: Clone | ||
| uses: actions/checkout@v4 | ||
|
|
@@ -274,6 +274,67 @@ jobs: | |
| echo "Coverage cache: none available — full test suite will run" | ||
| fi | ||
|
|
||
| # ── NVHPC: pull image and start a long-lived container ────────────── | ||
| # Replaces the container: directive so we can free disk space first. | ||
| # Uses "docker run -d ... sleep infinity" + "docker exec" to preserve | ||
| # installed packages and env vars across steps. | ||
| - name: Pull NVHPC container | ||
| if: matrix.nvhpc | ||
| run: docker pull "$NVHPC_IMAGE" | ||
|
|
||
| - name: Start NVHPC container | ||
| if: matrix.nvhpc | ||
| run: | | ||
| docker run -d --name nvhpc \ | ||
| --security-opt seccomp=unconfined \ | ||
| -v "${{ github.workspace }}:/workspace" \ | ||
| -w /workspace \ | ||
| -e CC=nvc \ | ||
| -e CXX=nvc++ \ | ||
| -e FC=nvfortran \ | ||
| -e OMPI_ALLOW_RUN_AS_ROOT=1 \ | ||
| -e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \ | ||
| -e PMIX_MCA_gds=hash \ | ||
| -e OMPI_MCA_hwloc_base_binding_policy=none \ | ||
| -e "FFLAGS=-tp=px -Kieee -noswitcherror" \ | ||
| -e CFLAGS=-tp=px \ | ||
| -e CXXFLAGS=-tp=px \ | ||
| "$NVHPC_IMAGE" sleep infinity | ||
|
|
||
| - name: Setup NVHPC | ||
| if: matrix.nvhpc | ||
| run: | | ||
| docker exec nvhpc bash -c ' | ||
| set -e | ||
| apt-get update -y | ||
| apt-get install -y cmake python3 python3-venv python3-pip \ | ||
| libfftw3-dev libhdf5-dev hdf5-tools git | ||
|
|
||
| # The repo is bind-mounted from the host so git sees a different | ||
| # owner. Mark it safe to suppress "dubious ownership" errors that | ||
| # otherwise spam 80 000+ lines into the CI log. | ||
| git config --global --add safe.directory /workspace | ||
|
|
||
| # Set up NVHPC HPC-X MPI runtime paths | ||
| HPCX_DIR=$(dirname "$(find /opt/nvidia/hpc_sdk -path "*/hpcx/hpcx-*/ompi/bin/mpirun" | head -1)")/../.. | ||
| MPI_LIB=$(mpifort --showme:link | grep -oP "(?<=-L)\S+" | head -1) | ||
|
|
||
| # Persist env vars for subsequent docker exec calls | ||
| cat > /etc/nvhpc-env.sh <<EOF | ||
| export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH | ||
| export OMPI_MCA_rmaps_base_oversubscribe=1 | ||
| EOF | ||
|
Comment on lines
+323
to
+326
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 1. Broken heredoc terminator The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so bash never recognizes the terminator and consumes the remainder of the script as heredoc content. This prevents the environment setup from completing correctly and breaks later NVHPC steps that source /etc/nvhpc-env.sh. Agent Prompt
|
||
|
|
||
| # Debug: confirm compiler flags are set | ||
| echo "=== NVHPC Environment ===" | ||
| echo "FFLAGS=$FFLAGS" | ||
| echo "CFLAGS=$CFLAGS" | ||
| echo "CXXFLAGS=$CXXFLAGS" | ||
| nvfortran --version | ||
| cat /proc/cpuinfo | grep "model name" | head -1 | ||
| ' | ||
|
|
||
| # ── Standard (non-NVHPC) setup ───────────────────────────────────── | ||
| - name: Setup MacOS | ||
| if: matrix.os == 'macos' && !matrix.nvhpc | ||
| run: | | ||
|
|
@@ -313,30 +374,7 @@ jobs: | |
| echo "MPICC=mpiicx" >> $GITHUB_ENV | ||
| echo "MPICXX=mpiicpx" >> $GITHUB_ENV | ||
|
|
||
| # --- NVHPC container setup --- | ||
| - name: Setup NVHPC | ||
| if: matrix.nvhpc | ||
| run: | | ||
| apt-get update -y | ||
| apt-get install -y cmake python3 python3-venv python3-pip \ | ||
| libfftw3-dev libhdf5-dev hdf5-tools git | ||
| # Set up NVHPC HPC-X MPI runtime paths | ||
| HPCX_DIR=$(dirname "$(find /opt/nvidia/hpc_sdk -path "*/hpcx/hpcx-*/ompi/bin/mpirun" | head -1)")/../.. | ||
| MPI_LIB=$(mpifort --showme:link | grep -oP '(?<=-L)\S+' | head -1) | ||
| echo "LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:${LD_LIBRARY_PATH}" >> $GITHUB_ENV | ||
| # Container MPI fixes: PMIx shared-memory, hwloc binding | ||
| echo "PMIX_MCA_gds=hash" >> $GITHUB_ENV | ||
| echo "OMPI_MCA_hwloc_base_binding_policy=none" >> $GITHUB_ENV | ||
| echo "OMPI_MCA_rmaps_base_oversubscribe=1" >> $GITHUB_ENV | ||
| # Debug: confirm compiler flags are set | ||
| echo "=== NVHPC Environment ===" | ||
| echo "FFLAGS=$FFLAGS" | ||
| echo "CFLAGS=$CFLAGS" | ||
| echo "CXXFLAGS=$CXXFLAGS" | ||
| nvfortran --version | ||
| cat /proc/cpuinfo | grep "model name" | head -1 | ||
|
|
||
| # --- Standard build + test --- | ||
| # ── Standard build + test ─────────────────────────────────────────── | ||
| - name: Build | ||
| if: '!matrix.nvhpc' | ||
| run: | | ||
|
|
@@ -354,22 +392,37 @@ jobs: | |
| TEST_PCT: ${{ matrix.debug == 'reldebug' && '-% 20' || '' }} | ||
| ONLY_CHANGES: ${{ github.event_name == 'pull_request' && '--only-changes' || '' }} | ||
|
|
||
| # --- NVHPC build + test --- | ||
| # ── NVHPC build + test (via docker exec into long-lived container) ── | ||
| - name: Build (NVHPC) | ||
| if: matrix.nvhpc && matrix.target == 'cpu' | ||
| run: /bin/bash mfc.sh test -v --dry-run -j $(nproc) --test-all | ||
| run: | | ||
| docker exec nvhpc bash -c ' | ||
| source /etc/nvhpc-env.sh | ||
| /bin/bash mfc.sh test -v --dry-run -j $(nproc) --test-all | ||
| ' | ||
|
|
||
| - name: Build (NVHPC GPU) | ||
| if: matrix.nvhpc && matrix.target == 'gpu' | ||
| run: | | ||
| /bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu acc | ||
| /bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu mp | ||
| run: | | ||
| docker exec nvhpc bash -c ' | ||
| source /etc/nvhpc-env.sh | ||
| /bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu acc | ||
| /bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu mp | ||
| ' | ||
|
|
||
| - name: Test (NVHPC) | ||
| if: matrix.nvhpc && matrix.target == 'cpu' | ||
| run: | | ||
| ulimit -s unlimited || ulimit -s 65536 || true | ||
| /bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --test-all | ||
| run: | | ||
| docker exec nvhpc bash -c ' | ||
| source /etc/nvhpc-env.sh | ||
| ulimit -s unlimited || ulimit -s 65536 || true | ||
| /bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --test-all | ||
| ' | ||
|
|
||
| # ── Cleanup ───────────────────────────────────────────────────────── | ||
| - name: Stop NVHPC container | ||
| if: always() && matrix.nvhpc | ||
| run: docker rm -f nvhpc || true | ||
|
|
||
| self: | ||
| name: "${{ matrix.cluster_name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }}${{ matrix.shard != '' && format(' [{0}]', matrix.shard) || '' }})" | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Nvhpc tag still cuda_multi
🐞 Bug≡ CorrectnessAgent Prompt
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools