Skip to content

14NGiestas/mfi

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

161 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MFI

Modern Fortran interfaces to BLAS and LAPACK

MFI provides generic, type-agnostic wrappers around BLAS and LAPACK routines. Instead of writing type-specific calls with dozens of arguments, you write one call that works for real32, real64, complex(real32), and complex(real64).

Example: $C = A \cdot B$

program main
    use mfi_blas, only: mfi_gemm
    implicit none
    real :: A(4,4), B(4,4), C(4,4)
    ! ... fill A and B ...
    call mfi_gemm(A, B, C)   ! That's it. No leading dims, no m/n/k, no alpha/beta.
end program

Quick Start

Recommended: Nix Flake (zero config)

git clone https://github.com/14NGiestas/mfi.git
cd mfi
nix develop          # cpu-only shell with gfortran, fpm, fypp, BLAS, LAPACK
nix develop .#gpu-modern   # with CUDA 12.3
nix develop .#gpu-legacy   # with CUDA 11.8
nix develop .#gpu-zluda    # AMD GPU via ZLUDA (pkgs.zluda from nixpkgs)
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Requires Nix with flakes enabled.

Manual setup

Tool Minimum version
fpm β‰₯ 0.13.0
fypp any
Fortran compiler gfortran 12+ (recommended)
pip install fypp

Install BLAS and LAPACK from your package manager:

Distro Package
Arch openblas-lapack-static (AUR)
Ubuntu/Debian libblas-dev liblapack-dev
Fedora openblas-devel lapack-devel

Build & Test

git clone https://github.com/14NGiestas/mfi.git
cd mfi
make              # generates .f90 from .fpp/.fypp templates
fpm test          # runs the test suite

Using MFI as a Dependency

Add to your project's fpm.toml:

# CPU-only (stable)
[dependencies]
mfi = { git = "https://github.com/14NGiestas/mfi.git", branch = "mfi-fpm" }

That's all β€” fpm handles the rest. No make needed in your own project.


GPU Acceleration with cuBLAS

MFI can transparently dispatch BLAS calls to cuBLAS when compiled with the cublas feature. The same mfi_gemm, mfi_gemv, etc. calls run on the GPU without code changes.

Try it in your browser:

Open In Colab

Local build with cuBLAS

make
fpm build --profile cublas
fpm test --profile cublas

Runtime CPU / GPU switching

MFI uses lazy initialization β€” no setup code is needed. When compiled with the cublas feature, GPU dispatching is controlled entirely by the MFI_USE_CUBLAS environment variable:

# CPU (default)
./build/app/app

# GPU
MFI_USE_CUBLAS=1 ./build/app/app

The same call mfi_gemm(A, B, C) runs on CPU or GPU without any code changes.

For OpenMP-parallel programs, also set OMP_NUM_THREADS to pre-allocate per-thread cuBLAS handles:

MFI_USE_CUBLAS=1 OMP_NUM_THREADS=8 ./build/app/app

Manual CPU/GPU switching (advanced)

If you need fine-grained control within a single program (e.g., run most computations on GPU but force a specific call to CPU), use mfi_force_gpu / mfi_force_cpu:

call mfi_gemm(A, B, C)       ! CPU (default)

call mfi_force_gpu
call mfi_gemm(D, E, F)       ! GPU
call mfi_force_cpu

call mfi_gemm(G, H, I)       ! CPU again

Note: When compiled without the cublas feature, mfi_force_gpu and mfi_force_cpu are no-op stubs β€” your code compiles and runs normally on CPU without any #ifdef changes. Simply recompile with --profile cublas to activate GPU acceleration.

Clean shutdown (optional)

Call mfi_cublas_finalize() at program end to release GPU resources. The OS cleans up on exit anyway.


AMD GPU support via ZLUDA

ZLUDA is a drop-in replacement for the CUDA runtime that runs on AMD GPUs using the HIP SDK. Because MFI's GPU backend only uses standard CUDA/cuBLAS APIs (cuda_runtime.h, cublas_v2.h, -lcublas, -lcudart), the existing cublas build works on AMD hardware without any source changes β€” you just redirect the linker and runtime to ZLUDA's libraries.

Prerequisites

With Nix: the ROCm/HIP userspace stack (rocmPackages.clr for the HIP runtime, rocmPackages.rocm-runtime for the HSA runtime) and CUDA compile-time headers are all provided by the gpu-zluda devShell. You still need to download ZLUDA itself β€” it is a pre-built binary that cannot currently be built from nixpkgs 24.11 β€” and point ZLUDA_PATH at its directory before entering the shell. The only host requirement beyond that is the AMD GPU kernel driver (the amdgpu kernel module and firmware), which Nix cannot provide.

Without Nix: install the full ROCm/HIP SDK and download ZLUDA from its releases page.

Linux

With Nix (recommended): ROCm/HIP and CUDA headers are provided automatically. Download ZLUDA from its releases page, then:

ZLUDA_PATH=/path/to/zluda nix develop .#gpu-zluda
make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/app

The shell prints a warning and usage hint if ZLUDA_PATH is unset.

Without Nix: after installing the ROCm/HIP SDK and ZLUDA (see Prerequisites above), set the env vars manually:

export CPATH="/path/to/zluda/include:$CPATH"
export LIBRARY_PATH="/path/to/zluda/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/path/to/zluda/lib:$LD_LIBRARY_PATH"

make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/app

Windows

Install AMD Software: Adrenalin Edition and the HIP SDK, then use the ZLUDA launcher (recommended) or manually prepend the ZLUDA DLL directory to PATH:

REM recommended: zluda launcher
zluda -- fpm build --profile zluda

REM or manually
set PATH=C:\path\to\zluda;%PATH%
fpm build --profile zluda

Consumer projects on AMD

# AMD GPU via ZLUDA (set env vars before building, LD_LIBRARY_PATH before running)
mfi = { git="https://github.com/14NGiestas/mfi.git", branch="mfi-fpm", features = ["zluda"] }

The zluda and cublas fpm features are identical in fpm.toml; both compile the same C/Fortran source. Use whichever name makes intent clearer in your project. Note that features = ["cublas"] also works β€” only the label differs.


Troubleshooting

Problem Solution
CUBLAS_STATUS_NOT_INITIALIZED cuBLAS handle not created. Set MFI_USE_CUBLAS=1 or call mfi_force_gpu before the first BLAS call.
cuda_runtime.h not found CUDA Toolkit (or ZLUDA headers) not in include path. See gpu_test.ipynb for a Colab setup, or set CPATH to ZLUDA's include/ directory.
libcublas.so not found at runtime LD_LIBRARY_PATH does not include CUDA/ZLUDA libs. Also ensure CPATH and LIBRARY_PATH were set at build time.
ZLUDA: HIP_VISIBLE_DEVICES not set On multi-GPU systems set HIP_VISIBLE_DEVICES=0 (or the desired device index).
ZLUDA: silent wrong results Check MFI_DEBUG=1 output and ensure ZLUDA version β‰₯ the latest pre-release.
i?amin symbols missing Your BLAS provider lacks extensions. Use the default profile (without MFI_LINK_EXTERNAL) or switch to OpenBLAS.
Tests fail on CPU build Known pre-existing failures: cunmrq, sorg2r, sorgr2, cungr2, cung2r, sormrq, heevx (segfault).

Interface Levels

MFI exposes four interface levels for BLAS, from bare-metal to fully modern:

Level Example Arguments
Raw F77 call cgemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N) 13
Improved F77 call f77_gemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N) 13 (no c/d/s/z prefix)
MFI typed call mfi_sgemm(A, B, C) 3 (type-specific)
MFI generic call mfi_gemm(A, B, C) 3 (type-agnostic)

For full API documentation, see the generated reference.


Supported Routines

BLAS

Level 1

Click to expand
Status Name Description
πŸ‘ asum Sum of vector magnitudes
πŸ‘ axpy Scalar-vector product
πŸ‘ copy Copy vector
πŸ‘ dot Dot product
πŸ‘ dotc Dot product conjugated
πŸ‘ dotu Dot product unconjugated
f77 sdsdot Extended precision inner product
f77 dsdot Extended precision inner product with double result
πŸ‘ nrm2 Vector 2-norm (Euclidean norm)
πŸ‘ rot Plane rotation
πŸ‘ rotg Generate Givens rotation
πŸ‘ rotm Modified Givens rotation
πŸ‘ rotmg Generate modified Givens rotation
πŸ‘ scal Vector-scalar product
πŸ‘ swap Vector-vector swap

Level 1 β€” Extensions

Click to expand
Status Name Description
πŸ‘ iamax Index of maximum absolute value element
πŸ‘ iamin Index of minimum absolute value element
πŸ‘ lamch Machine precision parameters

Level 2

Click to expand
Status Name Description
πŸ‘ gbmv Matrix-vector product (general band)
πŸ‘ gemv Matrix-vector product (general)
πŸ‘ ger Rank-1 update (general)
πŸ‘ gerc Rank-1 update (general, conjugated)
πŸ‘ geru Rank-1 update (general, unconjugated)
πŸ‘ hbmv Matrix-vector product (Hermitian band)
πŸ‘ hemv Matrix-vector product (Hermitian)
πŸ‘ her Rank-1 update (Hermitian)
πŸ‘ her2 Rank-2 update (Hermitian)
πŸ‘ hpmv Matrix-vector product (Hermitian packed)
πŸ‘ hpr Rank-1 update (Hermitian packed)
πŸ‘ hpr2 Rank-2 update (Hermitian packed)
πŸ‘ sbmv Matrix-vector product (symmetric band)
πŸ‘ spmv Matrix-vector product (symmetric packed)
πŸ‘ spr Rank-1 update (symmetric packed)
πŸ‘ spr2 Rank-2 update (symmetric packed)
πŸ‘ symv Matrix-vector product (symmetric)
πŸ‘ syr Rank-1 update (symmetric)
πŸ‘ syr2 Rank-2 update (symmetric)
πŸ‘ tbmv Matrix-vector product (triangular band)
πŸ‘ tbsv Solve (triangular band)
πŸ‘ tpmv Matrix-vector product (triangular packed)
πŸ‘ tpsv Solve (triangular packed)
πŸ‘ trmv Matrix-vector product (triangular)
πŸ‘ trsv Solve (triangular)

Level 3

Click to expand
Status GPU Name Description
πŸ‘ βœ… gemm General matrix-matrix product
πŸ‘ βœ… hemm Hermitian Γ— general matrix product
πŸ‘ herk Hermitian rank-k update
πŸ‘ her2k Hermitian rank-2k update
πŸ‘ βœ… symm Symmetric Γ— general matrix product
πŸ‘ syrk Symmetric rank-k update
πŸ‘ syr2k Symmetric rank-2k update
πŸ‘ βœ… trmm Triangular Γ— general matrix product
πŸ‘ βœ… trsm Solve with triangular matrix

LAPACK

LAPACK coverage is growing β€” routines are implemented as needed.

Factorization and Solve

Click to expand
Status Name Description
πŸ‘ geqrf QR factorization
πŸ‘ gerqf RQ factorization
πŸ‘ getrf LU factorization
πŸ‘ getri Matrix inverse (from LU)
πŸ‘ getrs Solve with LU-factored matrix
πŸ‘ gesv Solve linear system (LU + solve)
πŸ‘ hetrf Bunch-Kaufman factorization (Hermitian)
πŸ‘ pocon Condition number estimate (Cholesky)
πŸ‘ potrf Cholesky factorization
πŸ‘ potri Matrix inverse (from Cholesky)
πŸ‘ potrs Solve with Cholesky-factored matrix
πŸ‘ sytrf Bunch-Kaufman factorization (symmetric)
πŸ‘ trtrs Solve with triangular matrix

Orthogonal / Unitary Factors

Click to expand
Status Name Description
πŸ‘ orgqr Generate Q from QR (real)
πŸ‘ orgrq Generate Q from RQ (real)
πŸ‘ ormqr Multiply by Q from QR (real)
f77 ormrq Multiply by Q from RQ (real)
πŸ‘ org2r Generate Q from QR2 (real)
πŸ‘ orm2r Multiply by Q from QR2 (real)
πŸ‘ orgr2 Generate Q from RQ2 (real)
πŸ‘ ormr2 Multiply by Q from RQ2 (real)
πŸ‘ ungqr Generate Q from QR (complex)
πŸ‘ ungrq Generate Q from RQ (complex)
πŸ‘ unmqr Multiply by Q from QR (complex)
f77 unmrq Multiply by Q from RQ (complex)
πŸ‘ ung2r Generate Q from QR2 (complex)
πŸ‘ unm2r Multiply by Q from QR2 (complex)
πŸ‘ ungr2 Generate Q from RQ2 (complex)
πŸ‘ unmr2 Multiply by Q from RQ2 (complex)

Eigenvalues and SVD

Click to expand
Status Name Description
πŸ‘ gesvd Singular value decomposition
πŸ‘ heevd Hermitian eigenvalues (divide & conquer)
πŸ‘ hegvd Generalized Hermitian eigenproblem (divide & conquer)
πŸ‘ heevr Hermitian eigenvalues (relatively robust)
f77 heevx Hermitian eigenvalues (expert)

Least Squares

Click to expand
Status Name Description
f77 gels Least squares (QR/LQ)
f77 gelst Least squares (QR/LQ, T matrix)
f77 gelss Least squares (SVD, QR iteration)
f77 gelsd Least squares (SVD, divide & conquer)
f77 gelsy Least squares (complete orthogonal)
f77 getsls Least squares (tall-skinny QR/LQ)
f77 gglse Equality-constrained least squares
f77 ggglm Gauss-Markov linear model

Auxiliary

Name Types Description
mfi_lartg s, d, c, z Generate plane rotation

Continuous Integration

CI uses Nix flakes with magic-nix-cache-action for fast, reproducible builds.

Event Behavior
Push to main Full test matrix + deploy to mfi-fpm
Push to impl/cublas Full test matrix + deploy to mfi-cublas
PR to main Full test matrix
Manual dispatch Full test matrix