Skip to content

REGRESSION: Format conversion still produces large diffs after #177 fix #181

@noahgift

Description

@noahgift

REGRESSION: Format Conversion Still Failing After #177 Fix

Status: REGRESSION from closed #177
Severity: P0 (CRITICAL - Data Corruption)
Component: apr-rosetta / realizear
Discovered By: apr-model-qa-playbook requalification (2026-01-30)
Blocking: Model qualification certification


Executive Summary

Issue #177 was closed, but requalification testing on 2026-01-30 shows format conversion still fails with large output differences. The Jidoka detection is working (diffs are flagged), but the root cause fix is incomplete.


Regression Evidence

Test Environment

Date: 2026-01-30T14:59:00Z
Host: noah-Lambda-Vector
Model: Qwen/Qwen2.5-Coder-1.5B-Instruct (GGUF Q4_K_M)
Path: /home/noah/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-1.5B-Instruct-GGUF/snapshots/.../qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Playbook: qwen2.5-coder-1.5b-ci.playbook.yaml

Test Results

Total scenarios: 57
Passed: 50
Failed: 7  ← ALL 7 ARE FORMAT CONVERSION
Pass rate: 89.3%  ← Should be 100%

Detailed Failures

Gate Conversion Diff Tolerance Verdict
F-CONV-001 GGUF → APR 6.77e-1 1.00e-6 ❌ FAIL (677,000× over tolerance)
F-CONV-002 APR → GGUF 4.16e-1 1.00e-6 ❌ FAIL (416,000× over tolerance)
F-CONV-003 GGUF → SafeTensors Infrastructure error - ❌ FAIL (see below)
F-CONV-004 SafeTensors → GGUF 4.16e-1 1.00e-6 ❌ FAIL
F-CONV-005 APR → SafeTensors Infrastructure error - ❌ FAIL (see below)
F-CONV-006 SafeTensors → APR 6.77e-1 1.00e-6 ❌ FAIL
F-CONV-RT-001 Round-trip Blocked - ❌ FAIL

Raw Evidence from evidence.json

{
  "gate_id": "F-CONV-G-A",
  "outcome": "Falsified",
  "reason": "Conversion Gguf → Apr produced different output (diff: 6.77e-1, ε: 1.00e-6)",
  "output": "6de63189564fc936",
  "timestamp": "2026-01-30T14:07:23.xxx"
}

{
  "gate_id": "F-CONV-A-G", 
  "outcome": "Falsified",
  "reason": "Conversion Apr → Gguf produced different output (diff: 4.16e-1, ε: 1.00e-6)",
  "output": "0356a3e657672e25",
  "timestamp": "2026-01-30T14:07:35.xxx"
}

Comparison: Before vs After #177 Fix

Metric Before #177 After #177 Status
NaN detection ❌ Silent ✅ Detected FIXED
Inf detection ❌ Silent ✅ Detected FIXED
Output diff (GGUF→APR) 8.46e-1 6.77e-1 WORSE → BETTER (15% improvement)
Output diff (APR→GGUF) 6.34e-1 4.16e-1 WORSE → BETTER (34% improvement)
Within tolerance (ε=1e-6) ❌ No ❌ No STILL FAILING
Round-trip lossless ❌ No ❌ No STILL FAILING

Conclusion: #177 fix improved detection and reduced diff magnitude, but diffs are still 400,000× to 700,000× above tolerance.


Root Cause Hypothesis

The #177 fix addressed:

  1. ✅ NaN/Inf detection (Jidoka working)
  2. ✅ Some quantization parameter handling

But did NOT address:

  1. ❌ Quantization scale/offset precision loss
  2. ❌ Block-wise quantization metadata transfer
  3. ❌ Q4_K_M super-block structure preservation

Technical Detail

Q4_K_M uses a two-level quantization structure:

Super-block (256 elements):
  - Scale (fp16)
  - Min (fp16)
  - 32× Sub-blocks of 8 elements each
    - Sub-scale (6-bit)
    - 4-bit quantized weights

If the super-block scales are truncated or misaligned during conversion, all weights in that block will be off by a multiplicative factor, leading to the large cumulative diffs we observe.


Suggested Additional Fixes

1. Preserve Full Quantization Metadata

struct Q4KMSuperBlock {
    d: f16,      // Super-block scale - MUST preserve full precision
    dmin: f16,   // Super-block min - MUST preserve full precision
    scales: [u8; 12],  // Sub-block scales - MUST preserve bit-exact
    qs: [u8; 128],     // Quantized values
}

// During conversion, ensure:
// 1. d and dmin are NOT downcast to f32 then back to f16
// 2. scales array is copied bit-exact, not recomputed
// 3. Block alignment matches source format

2. Add Tensor-Level Validation

fn validate_conversion(source: &Tensor, converted: &Tensor) -> Result<()> {
    let diff = (source.to_f32() - converted.to_f32()).abs().max();
    if diff > EPSILON {
        return Err(ConversionError::LossyConversion { 
            diff, 
            tolerance: EPSILON,
            tensor_name: source.name.clone(),
        });
    }
    Ok(())
}

3. Test Each Quantization Type Separately

# Test suite should cover:
apr rosetta convert model_q4_k_m.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_q5_k_m.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_q8_0.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_f16.gguf test.apr && apr rosetta convert test.apr model_back.gguf
# All should produce diff < 1e-6

MQS Impact

Metric Current Required
Score 41.1/100 87+/100
Grade F B or higher
Conversion gates 0/7 7/7
Lost points ~45 0

Verification Criteria

Issue is resolved when:

cd ../apr-model-qa-playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-ci.playbook.yaml \
  --subprocess --model-path <model.gguf> --no-gpu --output output/verify

# Required:
# - F-CONV-001 through F-CONV-006: ALL PASS (diff < 1e-6)
# - F-CONV-RT-001: PASS (round-trip lossless)
# - MQS Score: 87+/100
# - Pass rate: 100%

References

  • Original issue: P0 CRITICAL: Format conversion introduces NaN/Inf corruption in tensor weights #177 (CLOSED - but regression detected)
  • Evidence file: ../apr-model-qa-playbook/output/qwen-requalify/evidence.json
  • MQS report: ../apr-model-qa-playbook/output/qwen-requalify/mqs.json
  • Verification playbook: ../apr-model-qa-playbook/playbooks/verify/TICKET-177.yaml
  • Spec: Section 4 (Format Conversion Testing), tolerance = 1e-6

Filed by: apr-model-qa-playbook requalification (automated)
Related: #177 (regression), #172 (original P0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions