MFI provides generic, type-agnostic wrappers around BLAS and LAPACK routines.
Instead of writing type-specific calls with dozens of arguments, you write one
call that works for real32, real64, complex(real32), and complex(real64).
program main
use mfi_blas, only: mfi_gemm
implicit none
real :: A(4,4), B(4,4), C(4,4)
! ... fill A and B ...
call mfi_gemm(A, B, C) ! That's it. No leading dims, no m/n/k, no alpha/beta.
end programgit clone https://github.com/14NGiestas/mfi.git
cd mfi
nix develop # cpu-only shell with gfortran, fpm, fypp, BLAS, LAPACK
nix develop .#gpu-modern # with CUDA 12.3
nix develop .#gpu-legacy # with CUDA 11.8
nix develop .#gpu-zluda # AMD GPU via ZLUDA (pkgs.zluda from nixpkgs)
make # generates .f90 from .fpp/.fypp templates
fpm test # runs the test suiteRequires Nix with flakes enabled.
| Tool | Minimum version |
|---|---|
| fpm | β₯ 0.13.0 |
| fypp | any |
| Fortran compiler | gfortran 12+ (recommended) |
pip install fyppInstall BLAS and LAPACK from your package manager:
| Distro | Package |
|---|---|
| Arch | openblas-lapack-static (AUR) |
| Ubuntu/Debian | libblas-dev liblapack-dev |
| Fedora | openblas-devel lapack-devel |
git clone https://github.com/14NGiestas/mfi.git
cd mfi
make # generates .f90 from .fpp/.fypp templates
fpm test # runs the test suiteAdd to your project's fpm.toml:
# CPU-only (stable)
[dependencies]
mfi = { git = "https://github.com/14NGiestas/mfi.git", branch = "mfi-fpm" }That's all β fpm handles the rest. No make needed in your own project.
MFI can transparently dispatch BLAS calls to cuBLAS when compiled with the
cublas feature. The same mfi_gemm, mfi_gemv, etc. calls run on the GPU
without code changes.
Try it in your browser:
make
fpm build --profile cublas
fpm test --profile cublasMFI uses lazy initialization β no setup code is needed. When compiled with the
cublas feature, GPU dispatching is controlled entirely by the
MFI_USE_CUBLAS environment variable:
# CPU (default)
./build/app/app
# GPU
MFI_USE_CUBLAS=1 ./build/app/appThe same call mfi_gemm(A, B, C) runs on CPU or GPU without any code changes.
For OpenMP-parallel programs, also set OMP_NUM_THREADS to pre-allocate
per-thread cuBLAS handles:
MFI_USE_CUBLAS=1 OMP_NUM_THREADS=8 ./build/app/appIf you need fine-grained control within a single program (e.g., run most
computations on GPU but force a specific call to CPU), use
mfi_force_gpu / mfi_force_cpu:
call mfi_gemm(A, B, C) ! CPU (default)
call mfi_force_gpu
call mfi_gemm(D, E, F) ! GPU
call mfi_force_cpu
call mfi_gemm(G, H, I) ! CPU againNote: When compiled without the
cublasfeature,mfi_force_gpuandmfi_force_cpuare no-op stubs β your code compiles and runs normally on CPU without any#ifdefchanges. Simply recompile with--profile cublasto activate GPU acceleration.
Call mfi_cublas_finalize() at program end to release GPU resources.
The OS cleans up on exit anyway.
ZLUDA is a drop-in replacement for the CUDA
runtime that runs on AMD GPUs using the HIP SDK. Because MFI's GPU backend
only uses standard CUDA/cuBLAS APIs (cuda_runtime.h, cublas_v2.h,
-lcublas, -lcudart), the existing cublas build works on AMD hardware
without any source changes β you just redirect the linker and runtime to
ZLUDA's libraries.
With Nix: the ROCm/HIP userspace stack (rocmPackages.clr for the HIP runtime,
rocmPackages.rocm-runtime for the HSA runtime) and CUDA compile-time headers are all
provided by the gpu-zluda devShell. You still need to download ZLUDA itself β it is a
pre-built binary that cannot currently be built from nixpkgs 24.11 β and point
ZLUDA_PATH at its directory before entering the shell. The only host requirement beyond
that is the AMD GPU kernel driver (the
amdgpu kernel module and firmware), which Nix cannot provide.
Without Nix: install the full ROCm/HIP SDK and download ZLUDA from its releases page.
With Nix (recommended): ROCm/HIP and CUDA headers are provided automatically. Download ZLUDA from its releases page, then:
ZLUDA_PATH=/path/to/zluda nix develop .#gpu-zluda
make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/appThe shell prints a warning and usage hint if ZLUDA_PATH is unset.
Without Nix: after installing the ROCm/HIP SDK and ZLUDA (see Prerequisites above), set the env vars manually:
export CPATH="/path/to/zluda/include:$CPATH"
export LIBRARY_PATH="/path/to/zluda/lib:$LIBRARY_PATH"
export LD_LIBRARY_PATH="/path/to/zluda/lib:$LD_LIBRARY_PATH"
make
fpm build --profile zluda
MFI_USE_CUBLAS=1 ./build/gfortran_*/app/appInstall AMD Software: Adrenalin Edition and the HIP SDK, then use the ZLUDA
launcher (recommended) or manually prepend the ZLUDA DLL directory to PATH:
REM recommended: zluda launcher
zluda -- fpm build --profile zluda
REM or manually
set PATH=C:\path\to\zluda;%PATH%
fpm build --profile zluda# AMD GPU via ZLUDA (set env vars before building, LD_LIBRARY_PATH before running)
mfi = { git="https://github.com/14NGiestas/mfi.git", branch="mfi-fpm", features = ["zluda"] }The zluda and cublas fpm features are identical in fpm.toml; both compile
the same C/Fortran source. Use whichever name makes intent clearer in your
project. Note that features = ["cublas"] also works β only the label differs.
| Problem | Solution |
|---|---|
CUBLAS_STATUS_NOT_INITIALIZED |
cuBLAS handle not created. Set MFI_USE_CUBLAS=1 or call mfi_force_gpu before the first BLAS call. |
cuda_runtime.h not found |
CUDA Toolkit (or ZLUDA headers) not in include path. See gpu_test.ipynb for a Colab setup, or set CPATH to ZLUDA's include/ directory. |
libcublas.so not found at runtime |
LD_LIBRARY_PATH does not include CUDA/ZLUDA libs. Also ensure CPATH and LIBRARY_PATH were set at build time. |
ZLUDA: HIP_VISIBLE_DEVICES not set |
On multi-GPU systems set HIP_VISIBLE_DEVICES=0 (or the desired device index). |
| ZLUDA: silent wrong results | Check MFI_DEBUG=1 output and ensure ZLUDA version β₯ the latest pre-release. |
i?amin symbols missing |
Your BLAS provider lacks extensions. Use the default profile (without MFI_LINK_EXTERNAL) or switch to OpenBLAS. |
| Tests fail on CPU build | Known pre-existing failures: cunmrq, sorg2r, sorgr2, cungr2, cung2r, sormrq, heevx (segfault). |
MFI exposes four interface levels for BLAS, from bare-metal to fully modern:
| Level | Example | Arguments |
|---|---|---|
| Raw F77 | call cgemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N) |
13 |
| Improved F77 | call f77_gemm('N','N', N, N, N, alpha, A, N, B, N, beta, C, N) |
13 (no c/d/s/z prefix) |
| MFI typed | call mfi_sgemm(A, B, C) |
3 (type-specific) |
| MFI generic | call mfi_gemm(A, B, C) |
3 (type-agnostic) |
For full API documentation, see the generated reference.
Click to expand
| Status | Name | Description |
|---|---|---|
| π | asum | Sum of vector magnitudes |
| π | axpy | Scalar-vector product |
| π | copy | Copy vector |
| π | dot | Dot product |
| π | dotc | Dot product conjugated |
| π | dotu | Dot product unconjugated |
| f77 | sdsdot | Extended precision inner product |
| f77 | dsdot | Extended precision inner product with double result |
| π | nrm2 | Vector 2-norm (Euclidean norm) |
| π | rot | Plane rotation |
| π | rotg | Generate Givens rotation |
| π | rotm | Modified Givens rotation |
| π | rotmg | Generate modified Givens rotation |
| π | scal | Vector-scalar product |
| π | swap | Vector-vector swap |
Click to expand
| Status | Name | Description |
|---|---|---|
| π | iamax | Index of maximum absolute value element |
| π | iamin | Index of minimum absolute value element |
| π | lamch | Machine precision parameters |
Click to expand
| Status | Name | Description |
|---|---|---|
| π | gbmv | Matrix-vector product (general band) |
| π | gemv | Matrix-vector product (general) |
| π | ger | Rank-1 update (general) |
| π | gerc | Rank-1 update (general, conjugated) |
| π | geru | Rank-1 update (general, unconjugated) |
| π | hbmv | Matrix-vector product (Hermitian band) |
| π | hemv | Matrix-vector product (Hermitian) |
| π | her | Rank-1 update (Hermitian) |
| π | her2 | Rank-2 update (Hermitian) |
| π | hpmv | Matrix-vector product (Hermitian packed) |
| π | hpr | Rank-1 update (Hermitian packed) |
| π | hpr2 | Rank-2 update (Hermitian packed) |
| π | sbmv | Matrix-vector product (symmetric band) |
| π | spmv | Matrix-vector product (symmetric packed) |
| π | spr | Rank-1 update (symmetric packed) |
| π | spr2 | Rank-2 update (symmetric packed) |
| π | symv | Matrix-vector product (symmetric) |
| π | syr | Rank-1 update (symmetric) |
| π | syr2 | Rank-2 update (symmetric) |
| π | tbmv | Matrix-vector product (triangular band) |
| π | tbsv | Solve (triangular band) |
| π | tpmv | Matrix-vector product (triangular packed) |
| π | tpsv | Solve (triangular packed) |
| π | trmv | Matrix-vector product (triangular) |
| π | trsv | Solve (triangular) |
Click to expand
| Status | GPU | Name | Description |
|---|---|---|---|
| π | β | gemm | General matrix-matrix product |
| π | β | hemm | Hermitian Γ general matrix product |
| π | herk | Hermitian rank-k update | |
| π | her2k | Hermitian rank-2k update | |
| π | β | symm | Symmetric Γ general matrix product |
| π | syrk | Symmetric rank-k update | |
| π | syr2k | Symmetric rank-2k update | |
| π | β | trmm | Triangular Γ general matrix product |
| π | β | trsm | Solve with triangular matrix |
LAPACK coverage is growing β routines are implemented as needed.
Click to expand
| Status | Name | Description |
|---|---|---|
| π | geqrf | QR factorization |
| π | gerqf | RQ factorization |
| π | getrf | LU factorization |
| π | getri | Matrix inverse (from LU) |
| π | getrs | Solve with LU-factored matrix |
| π | gesv | Solve linear system (LU + solve) |
| π | hetrf | Bunch-Kaufman factorization (Hermitian) |
| π | pocon | Condition number estimate (Cholesky) |
| π | potrf | Cholesky factorization |
| π | potri | Matrix inverse (from Cholesky) |
| π | potrs | Solve with Cholesky-factored matrix |
| π | sytrf | Bunch-Kaufman factorization (symmetric) |
| π | trtrs | Solve with triangular matrix |
Click to expand
| Status | Name | Description |
|---|---|---|
| π | orgqr | Generate Q from QR (real) |
| π | orgrq | Generate Q from RQ (real) |
| π | ormqr | Multiply by Q from QR (real) |
| f77 | ormrq | Multiply by Q from RQ (real) |
| π | org2r | Generate Q from QR2 (real) |
| π | orm2r | Multiply by Q from QR2 (real) |
| π | orgr2 | Generate Q from RQ2 (real) |
| π | ormr2 | Multiply by Q from RQ2 (real) |
| π | ungqr | Generate Q from QR (complex) |
| π | ungrq | Generate Q from RQ (complex) |
| π | unmqr | Multiply by Q from QR (complex) |
| f77 | unmrq | Multiply by Q from RQ (complex) |
| π | ung2r | Generate Q from QR2 (complex) |
| π | unm2r | Multiply by Q from QR2 (complex) |
| π | ungr2 | Generate Q from RQ2 (complex) |
| π | unmr2 | Multiply by Q from RQ2 (complex) |
Click to expand
| Status | Name | Description |
|---|---|---|
| π | gesvd | Singular value decomposition |
| π | heevd | Hermitian eigenvalues (divide & conquer) |
| π | hegvd | Generalized Hermitian eigenproblem (divide & conquer) |
| π | heevr | Hermitian eigenvalues (relatively robust) |
| f77 | heevx | Hermitian eigenvalues (expert) |
Click to expand
| Status | Name | Description |
|---|---|---|
| f77 | gels | Least squares (QR/LQ) |
| f77 | gelst | Least squares (QR/LQ, T matrix) |
| f77 | gelss | Least squares (SVD, QR iteration) |
| f77 | gelsd | Least squares (SVD, divide & conquer) |
| f77 | gelsy | Least squares (complete orthogonal) |
| f77 | getsls | Least squares (tall-skinny QR/LQ) |
| f77 | gglse | Equality-constrained least squares |
| f77 | ggglm | Gauss-Markov linear model |
| Name | Types | Description |
|---|---|---|
| mfi_lartg | s, d, c, z | Generate plane rotation |
CI uses Nix flakes with magic-nix-cache-action for fast, reproducible builds.
| Event | Behavior |
|---|---|
Push to main |
Full test matrix + deploy to mfi-fpm |
Push to impl/cublas |
Full test matrix + deploy to mfi-cublas |
PR to main |
Full test matrix |
| Manual dispatch | Full test matrix |