-
Couldn't load subscription status.
- Fork 31
Description
Bug Description
COSMA's K-dimension splitting strategy (parallel (k / 2)) produces catastrophically incorrect results (93.6% errors) for certain matrix dimensions, while working perfectly for smaller matrices.
UPDATE: This was NOT a COSMA algorithm bug! The bug was in the validation tolerance using absolute error instead of relative error.
Root Cause (CONFIRMED)
The validation code in utils/cosma_utils.hpp was using:
isOK = isOK && (std::abs(globC[i] - globCcheck[i]) < epsilon); // epsilon = 1e-8For large matrix multiplications:
- Result values have magnitude ~27,000
- Computation errors: ~0.02 (relative error ~7e-7, within float32 precision!)
- Absolute tolerance: 1e-8
- Result: 93.6% "errors" reported, but COSMA was computing correct results!
Fix
Pull Request: #154
Changed to relative error validation:
double abs_error = std::abs(globC[i] - globCcheck[i]);
double scale = std::max(std::abs(globC[i]), std::abs(globCcheck[i]));
double rel_error = (scale > 1e-10) ? abs_error / scale : abs_error;
double tolerance = (sizeof(Scalar) == 4) ? 1e-5 : epsilon;
isOK = isOK && (rel_error < tolerance);Verification
After fix:
# 32×896×896 float32: NOW PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float
# Result is OK ✅
# 32×10000×896 float32: NOW PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 10000 -k 896 --test --type float
# Result is OK ✅
# 32×896×896 float64: PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type double
# Result is OK ✅Apology
Sorry for the false alarm! COSMA's K-split algorithm was working correctly all along. The issue was that the validation tolerance was too strict for realistic floating-point computations, especially for:
- Large matrix dimensions (where results have large magnitude)
- Float32 precision (which needs ~1e-5 relative tolerance, not 1e-8 absolute)
The identical float/double errors (which I thought proved it was a logic bug) were actually because both were numerically correct - just failing an overly strict validation!
Environment
- COSMA Version: v2.6.0 (commit a3101bb)
- System: 2-socket Intel Xeon, 28 cores/socket, NUMA-aware
- MPI: OpenMPI 4.1.x
- BLAS: OpenBLAS 0.3.x
- Compiler: GCC 11.4
Files changed:
utils/cosma_utils.hpp(validation logic)