Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
147 changes: 100 additions & 47 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -213,25 +213,25 @@ jobs:
fail-fast: false
continue-on-error: true
runs-on: ${{ matrix.nvhpc && 'ubuntu-22.04' || format('{0}-latest', matrix.os) }}
container:
image: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }}
options: ${{ matrix.nvhpc && '--security-opt seccomp=unconfined' || '' }}
env:
CC: ${{ matrix.nvhpc && 'nvc' || '' }}
CXX: ${{ matrix.nvhpc && 'nvc++' || '' }}
FC: ${{ matrix.nvhpc && 'nvfortran' || '' }}
OMPI_ALLOW_RUN_AS_ROOT: ${{ matrix.nvhpc && '1' || '' }}
OMPI_ALLOW_RUN_AS_ROOT_CONFIRM: ${{ matrix.nvhpc && '1' || '' }}
PMIX_MCA_gds: ${{ matrix.nvhpc && 'hash' || '' }}
OMPI_MCA_hwloc_base_binding_policy: ${{ matrix.nvhpc && 'none' || '' }}
FFLAGS: ${{ matrix.nvhpc && '-tp=px -Kieee -noswitcherror' || '' }}
CFLAGS: ${{ matrix.nvhpc && '-tp=px' || '' }}
CXXFLAGS: ${{ matrix.nvhpc && '-tp=px' || '' }}
# Image tag for NVHPC jobs; empty for non-NVHPC jobs.
NVHPC_IMAGE: ${{ matrix.nvhpc && format('nvcr.io/nvidia/nvhpc:{0}-devel-cuda_multi-ubuntu22.04', matrix.nvhpc) || '' }}
Comment on lines +217 to +218
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

2. Nvhpc tag still cuda_multi 🐞 Bug ≡ Correctness

NVHPC_IMAGE is still constructed with the cuda_multi tag, so NVHPC jobs will continue pulling
the multi-CUDA images rather than the single-CUDA tags described in the PR. This undermines the PR’s
disk-usage reduction intent and keeps the workflow dependent on large-image disk cleanup behavior.
Agent Prompt
### Issue description
The workflow still pulls `nvcr.io/nvidia/nvhpc:* -devel-cuda_multi-ubuntu22.04`, so it does not implement the PR’s stated switch to single-CUDA tags.

### Issue Context
PR description explicitly states moving off `cuda_multi` to single-CUDA tags to reduce runner disk usage.

### Fix Focus Areas
- .github/workflows/test.yml[216-218]
- .github/workflows/test.yml[277-303]

### Suggested fix
Add an explicit CUDA tag to the NVHPC matrix (e.g. `cuda: '12.6'`) and construct the image from that field:

- Extend each NVHPC `matrix.include` entry with a `cuda` value.
- Change `NVHPC_IMAGE` to:
  `nvcr.io/nvidia/nvhpc:${{ matrix.nvhpc }}-devel-cuda${{ matrix.cuda }}-ubuntu22.04`

This makes the intended tag change explicit and prevents accidentally continuing to pull `cuda_multi`.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


steps:
- name: Git safe directory
# ── NVHPC: free disk before pulling the ~25-30 GB cuda_multi image ──
- name: Free disk space
if: matrix.nvhpc
run: git config --global --add safe.directory /__w/MFC/MFC
run: |
echo "=== Disk before cleanup ==="
df -h /
sudo rm -rf /usr/share/dotnet /usr/local/lib/android \
/opt/ghc /usr/local/share/boost /opt/hostedtoolcache \
/usr/local/graalvm /usr/local/.ghcup \
/usr/local/share/chromium /usr/local/lib/node_modules
sudo docker image prune -af
sudo apt-get clean
echo "=== Disk after cleanup ==="
df -h /

- name: Clone
uses: actions/checkout@v4
Expand Down Expand Up @@ -274,6 +274,67 @@ jobs:
echo "Coverage cache: none available — full test suite will run"
fi

# ── NVHPC: pull image and start a long-lived container ──────────────
# Replaces the container: directive so we can free disk space first.
# Uses "docker run -d ... sleep infinity" + "docker exec" to preserve
# installed packages and env vars across steps.
- name: Pull NVHPC container
if: matrix.nvhpc
run: docker pull "$NVHPC_IMAGE"

- name: Start NVHPC container
if: matrix.nvhpc
run: |
docker run -d --name nvhpc \
--security-opt seccomp=unconfined \
-v "${{ github.workspace }}:/workspace" \
-w /workspace \
-e CC=nvc \
-e CXX=nvc++ \
-e FC=nvfortran \
-e OMPI_ALLOW_RUN_AS_ROOT=1 \
-e OMPI_ALLOW_RUN_AS_ROOT_CONFIRM=1 \
-e PMIX_MCA_gds=hash \
-e OMPI_MCA_hwloc_base_binding_policy=none \
-e "FFLAGS=-tp=px -Kieee -noswitcherror" \
-e CFLAGS=-tp=px \
-e CXXFLAGS=-tp=px \
"$NVHPC_IMAGE" sleep infinity

- name: Setup NVHPC
if: matrix.nvhpc
run: |
docker exec nvhpc bash -c '
set -e
apt-get update -y
apt-get install -y cmake python3 python3-venv python3-pip \
libfftw3-dev libhdf5-dev hdf5-tools git

# The repo is bind-mounted from the host so git sees a different
# owner. Mark it safe to suppress "dubious ownership" errors that
# otherwise spam 80 000+ lines into the CI log.
git config --global --add safe.directory /workspace

# Set up NVHPC HPC-X MPI runtime paths
HPCX_DIR=$(dirname "$(find /opt/nvidia/hpc_sdk -path "*/hpcx/hpcx-*/ompi/bin/mpirun" | head -1)")/../..
MPI_LIB=$(mpifort --showme:link | grep -oP "(?<=-L)\S+" | head -1)

# Persist env vars for subsequent docker exec calls
cat > /etc/nvhpc-env.sh <<EOF
export LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:\$LD_LIBRARY_PATH
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
Comment on lines +323 to +326
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Action required

1. Broken heredoc terminator 🐞 Bug ≡ Correctness

The NVHPC setup step writes /etc/nvhpc-env.sh with a heredoc whose closing EOF is indented, so
bash never recognizes the terminator and consumes the remainder of the script as heredoc content.
This prevents the environment setup from completing correctly and breaks later NVHPC steps that
source /etc/nvhpc-env.sh.
Agent Prompt
### Issue description
The `Setup NVHPC` step uses a heredoc to generate `/etc/nvhpc-env.sh`, but the closing `EOF` delimiter is indented. In bash, the heredoc terminator must match exactly at the start of the line (unless using `<<-` with tabs), so the heredoc won’t close and the remainder of the script is treated as heredoc content.

### Issue Context
This breaks later steps that run `docker exec ... source /etc/nvhpc-env.sh`.

### Fix Focus Areas
- .github/workflows/test.yml[307-336]

### Suggested fix
Inside the `bash -c ' ... '` script, left-align the heredoc delimiter (and ideally the heredoc body) so the terminator is at column 1, e.g.:

```sh
cat > /etc/nvhpc-env.sh <<'EOF'
export LD_LIBRARY_PATH=...
export OMPI_MCA_rmaps_base_oversubscribe=1
EOF
```

Alternatively, avoid heredocs entirely and use `printf` to write the file.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


# Debug: confirm compiler flags are set
echo "=== NVHPC Environment ==="
echo "FFLAGS=$FFLAGS"
echo "CFLAGS=$CFLAGS"
echo "CXXFLAGS=$CXXFLAGS"
nvfortran --version
cat /proc/cpuinfo | grep "model name" | head -1
'

# ── Standard (non-NVHPC) setup ─────────────────────────────────────
- name: Setup MacOS
if: matrix.os == 'macos' && !matrix.nvhpc
run: |
Expand Down Expand Up @@ -313,30 +374,7 @@ jobs:
echo "MPICC=mpiicx" >> $GITHUB_ENV
echo "MPICXX=mpiicpx" >> $GITHUB_ENV

# --- NVHPC container setup ---
- name: Setup NVHPC
if: matrix.nvhpc
run: |
apt-get update -y
apt-get install -y cmake python3 python3-venv python3-pip \
libfftw3-dev libhdf5-dev hdf5-tools git
# Set up NVHPC HPC-X MPI runtime paths
HPCX_DIR=$(dirname "$(find /opt/nvidia/hpc_sdk -path "*/hpcx/hpcx-*/ompi/bin/mpirun" | head -1)")/../..
MPI_LIB=$(mpifort --showme:link | grep -oP '(?<=-L)\S+' | head -1)
echo "LD_LIBRARY_PATH=${MPI_LIB}:${HPCX_DIR}/ucx/lib:${HPCX_DIR}/ucc/lib:${LD_LIBRARY_PATH}" >> $GITHUB_ENV
# Container MPI fixes: PMIx shared-memory, hwloc binding
echo "PMIX_MCA_gds=hash" >> $GITHUB_ENV
echo "OMPI_MCA_hwloc_base_binding_policy=none" >> $GITHUB_ENV
echo "OMPI_MCA_rmaps_base_oversubscribe=1" >> $GITHUB_ENV
# Debug: confirm compiler flags are set
echo "=== NVHPC Environment ==="
echo "FFLAGS=$FFLAGS"
echo "CFLAGS=$CFLAGS"
echo "CXXFLAGS=$CXXFLAGS"
nvfortran --version
cat /proc/cpuinfo | grep "model name" | head -1

# --- Standard build + test ---
# ── Standard build + test ───────────────────────────────────────────
- name: Build
if: '!matrix.nvhpc'
run: |
Expand All @@ -354,22 +392,37 @@ jobs:
TEST_PCT: ${{ matrix.debug == 'reldebug' && '-% 20' || '' }}
ONLY_CHANGES: ${{ github.event_name == 'pull_request' && '--only-changes' || '' }}

# --- NVHPC build + test ---
# ── NVHPC build + test (via docker exec into long-lived container) ──
- name: Build (NVHPC)
if: matrix.nvhpc && matrix.target == 'cpu'
run: /bin/bash mfc.sh test -v --dry-run -j $(nproc) --test-all
run: |
docker exec nvhpc bash -c '
source /etc/nvhpc-env.sh
/bin/bash mfc.sh test -v --dry-run -j $(nproc) --test-all
'

- name: Build (NVHPC GPU)
if: matrix.nvhpc && matrix.target == 'gpu'
run: |
/bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu acc
/bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu mp
run: |
docker exec nvhpc bash -c '
source /etc/nvhpc-env.sh
/bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu acc
/bin/bash mfc.sh test -v --dry-run -j 2 --test-all --gpu mp
'

- name: Test (NVHPC)
if: matrix.nvhpc && matrix.target == 'cpu'
run: |
ulimit -s unlimited || ulimit -s 65536 || true
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --test-all
run: |
docker exec nvhpc bash -c '
source /etc/nvhpc-env.sh
ulimit -s unlimited || ulimit -s 65536 || true
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --test-all
'

# ── Cleanup ─────────────────────────────────────────────────────────
- name: Stop NVHPC container
if: always() && matrix.nvhpc
run: docker rm -f nvhpc || true

self:
name: "${{ matrix.cluster_name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }}${{ matrix.shard != '' && format(' [{0}]', matrix.shard) || '' }})"
Expand Down
2 changes: 1 addition & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
cmake_minimum_required(VERSION 3.20)


# We include C as a language because - for some reason -
# We include C as a language because - for some reason
# FIND_LIBRARY_USE_LIB64_PATHS is otherwise ignored.

project(MFC LANGUAGES C CXX Fortran)
Expand Down
Loading