Skip to content

Conversation

@dbsanfte
Copy link

@dbsanfte dbsanfte commented Oct 7, 2025

Problem

The validation code in cosma_utils.hpp was using absolute error tolerance (< 1e-8) to validate matrix multiplication results. This causes false negatives for large matrix multiplications where result values have magnitude ~10^4 or greater.

For example, with 32×896×896 float32 matrices:

  • Result values: ~27,000 magnitude
  • Actual errors: ~0.02 (relative error ~7e-7, well within float32 precision)
  • Absolute tolerance: 1e-8
  • Result: 93.6% "errors" reported, but computation was actually correct!

This is the root cause of issue #153 which appeared to be a K-split correctness bug.

Solution

Switch from absolute error to relative error validation:

// Before
isOK = isOK && (std::abs(globC[i] - globCcheck[i]) < epsilon);

// After  
double abs_error = std::abs(globC[i] - globCcheck[i]);
double scale = std::max(std::abs(globC[i]), std::abs(globCcheck[i]));
double rel_error = (scale > 1e-10) ? abs_error / scale : abs_error;
double tolerance = (sizeof(Scalar) == 4) ? 1e-5 : epsilon;
isOK = isOK && (rel_error < tolerance);

Key improvements:

  1. Use relative error for numerical values with magnitude > 1e-10
  2. Use appropriate tolerances for data type:
    • Float32: 1e-5 (accounts for ~7 digits of precision)
    • Float64: 1e-8 (accounts for ~15 digits of precision)
  3. Fall back to absolute error for values near zero

Testing

Verified fix resolves the false negatives:

# Before fix: 93.8% errors (FALSE POSITIVE)
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float
# Result is NOT OK
# Total errors: 26912 out of 28672 elements (93.8616%)

# After fix: PASSES
mpirun -np 2 cosma_miniapp -m 32 -n 896 -k 896 --test --type float  
# Result is OK
# Result is CORRECT!

Additional validation:

  • ✅ 32×10000×896 float32: now passes (was 93.6% false errors)
  • ✅ 32×896×896 float64: passes with stricter 1e-8 tolerance
  • ✅ 32×32×32 float64: regression test still passes

Impact

This fixes validation for:

  • Large matrix dimensions (where results have large magnitude)
  • Float32 precision (which was essentially unusable before)
  • K-split and other distributed strategies (which were flagged incorrectly)

The actual COSMA algorithm was computing correct results all along - only the validation was broken.

Related Issues

Closes #153

@simonpintarelli
Copy link
Member

cscs-ci run GH200

@simonpintarelli
Copy link
Member

Thanks for the PR @dbsanfte!

I agree it's the correct way to use the relative tolerance and adjust for fp32.

The PR contains additional commits, e.g. adding a mutex for global_coords in the Mapper class. It's not clear to why these are required, as far as I can judge (I'm not very familiar with this code) it's not required for the standard API COSMA provides. In case these commits didn't slip in by accident, could you please open a separate PR for it?

@dbsanfte
Copy link
Author

Oh those did slip in by accident, sorry. They're not strictly required for this pr.

The validation was using absolute error tolerance (1e-8) which fails for
large matrix multiplication results (magnitude ~1e4). This caused false
negatives where COSMA computed correct results but failed validation.

Changes:
- Switch from absolute error to relative error for validation
- Use 1e-5 tolerance for float32 (appropriate for single precision)
- Use 1e-8 tolerance for float64 (appropriate for double precision)
- Handle small values near zero with absolute error fallback

This fixes issue eth-cscs#153 where K-split strategy was incorrectly reported
as producing 93.6% errors when actual relative errors were < 1e-6.

Tested with:
- 32x896x896 float32: now passes (was 93.8% false errors)
- 32x10000x896 float32: now passes (was 93.6% false errors)
- 32x32x32 float64: still passes (regression test)
@dbsanfte dbsanfte force-pushed the fix/k-split-coordinate-mapping branch from 13ed177 to ac569da Compare October 19, 2025 10:33
@dbsanfte
Copy link
Author

Fixed to only include the relevant change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Validation tolerance for mathematical correctness should use relative error in most circumstances

2 participants