Skip to content

Validation tolerance for mathematical correctness should use relative error in most circumstances #153

@dbsanfte

Description

@dbsanfte

Bug Description

COSMA's K-dimension splitting strategy (parallel (k / 2)) produces catastrophically incorrect results (93.6% errors) for certain matrix dimensions, while working perfectly for smaller matrices.

UPDATE: This was NOT a COSMA algorithm bug! The bug was in the validation tolerance using absolute error instead of relative error.

Root Cause (CONFIRMED)

The validation code in utils/cosma_utils.hpp was using:

isOK = isOK && (std::abs(globC[i] - globCcheck[i]) < epsilon);  // epsilon = 1e-8

For large matrix multiplications:

  • Result values have magnitude ~27,000
  • Computation errors: ~0.02 (relative error ~7e-7, within float32 precision!)
  • Absolute tolerance: 1e-8
  • Result: 93.6% "errors" reported, but COSMA was computing correct results!

Fix

Pull Request: #154

Changed to relative error validation:

double abs_error = std::abs(globC[i] - globCcheck[i]);
double scale = std::max(std::abs(globC[i]), std::abs(globCcheck[i]));
double rel_error = (scale > 1e-10) ? abs_error / scale : abs_error;
double tolerance = (sizeof(Scalar) == 4) ? 1e-5 : epsilon;
isOK = isOK && (rel_error < tolerance);

Verification

After fix:

# 32×896×896 float32: NOW PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float
# Result is OK ✅

# 32×10000×896 float32: NOW PASSES  
mpirun -np 2 cosma_miniapp -m 32 -n 10000 -k 896 --test --type float
# Result is OK ✅

# 32×896×896 float64: PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type double
# Result is OK ✅

Apology

Sorry for the false alarm! COSMA's K-split algorithm was working correctly all along. The issue was that the validation tolerance was too strict for realistic floating-point computations, especially for:

  • Large matrix dimensions (where results have large magnitude)
  • Float32 precision (which needs ~1e-5 relative tolerance, not 1e-8 absolute)

The identical float/double errors (which I thought proved it was a logic bug) were actually because both were numerically correct - just failing an overly strict validation!

Environment

  • COSMA Version: v2.6.0 (commit a3101bb)
  • System: 2-socket Intel Xeon, 28 cores/socket, NUMA-aware
  • MPI: OpenMPI 4.1.x
  • BLAS: OpenBLAS 0.3.x
  • Compiler: GCC 11.4

Files changed:

  • utils/cosma_utils.hpp (validation logic)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions