REGRESSION: Format conversion still produces large diffs after #177 fix

# REGRESSION: Format Conversion Still Failing After #177 Fix

**Status:** REGRESSION from closed #177
**Severity:** P0 (CRITICAL - Data Corruption)
**Component:** apr-rosetta / realizear
**Discovered By:** apr-model-qa-playbook requalification (2026-01-30)
**Blocking:** Model qualification certification

---

## Executive Summary

Issue #177 was closed, but requalification testing on 2026-01-30 shows **format conversion still fails** with large output differences. The Jidoka detection is working (diffs are flagged), but the root cause fix is incomplete.

---

## Regression Evidence

### Test Environment
```
Date: 2026-01-30T14:59:00Z
Host: noah-Lambda-Vector
Model: Qwen/Qwen2.5-Coder-1.5B-Instruct (GGUF Q4_K_M)
Path: /home/noah/.cache/huggingface/hub/models--Qwen--Qwen2.5-Coder-1.5B-Instruct-GGUF/snapshots/.../qwen2.5-coder-1.5b-instruct-q4_k_m.gguf
Playbook: qwen2.5-coder-1.5b-ci.playbook.yaml
```

### Test Results
```
Total scenarios: 57
Passed: 50
Failed: 7  ← ALL 7 ARE FORMAT CONVERSION
Pass rate: 89.3%  ← Should be 100%
```

### Detailed Failures

| Gate | Conversion | Diff | Tolerance | Verdict |
|------|------------|------|-----------|---------|
| F-CONV-001 | GGUF → APR | 6.77e-1 | 1.00e-6 | ❌ FAIL (677,000× over tolerance) |
| F-CONV-002 | APR → GGUF | 4.16e-1 | 1.00e-6 | ❌ FAIL (416,000× over tolerance) |
| F-CONV-003 | GGUF → SafeTensors | Infrastructure error | - | ❌ FAIL (see below) |
| F-CONV-004 | SafeTensors → GGUF | 4.16e-1 | 1.00e-6 | ❌ FAIL |
| F-CONV-005 | APR → SafeTensors | Infrastructure error | - | ❌ FAIL (see below) |
| F-CONV-006 | SafeTensors → APR | 6.77e-1 | 1.00e-6 | ❌ FAIL |
| F-CONV-RT-001 | Round-trip | Blocked | - | ❌ FAIL |

### Raw Evidence from evidence.json

```json
{
  "gate_id": "F-CONV-G-A",
  "outcome": "Falsified",
  "reason": "Conversion Gguf → Apr produced different output (diff: 6.77e-1, ε: 1.00e-6)",
  "output": "6de63189564fc936",
  "timestamp": "2026-01-30T14:07:23.xxx"
}

{
  "gate_id": "F-CONV-A-G", 
  "outcome": "Falsified",
  "reason": "Conversion Apr → Gguf produced different output (diff: 4.16e-1, ε: 1.00e-6)",
  "output": "0356a3e657672e25",
  "timestamp": "2026-01-30T14:07:35.xxx"
}
```

---

## Comparison: Before vs After #177 Fix

| Metric | Before #177 | After #177 | Status |
|--------|-------------|------------|--------|
| NaN detection | ❌ Silent | ✅ Detected | FIXED |
| Inf detection | ❌ Silent | ✅ Detected | FIXED |
| Output diff (GGUF→APR) | 8.46e-1 | 6.77e-1 | WORSE → BETTER (15% improvement) |
| Output diff (APR→GGUF) | 6.34e-1 | 4.16e-1 | WORSE → BETTER (34% improvement) |
| Within tolerance (ε=1e-6) | ❌ No | ❌ No | **STILL FAILING** |
| Round-trip lossless | ❌ No | ❌ No | **STILL FAILING** |

**Conclusion:** #177 fix improved detection and reduced diff magnitude, but diffs are still 400,000× to 700,000× above tolerance.

---

## Root Cause Hypothesis

The #177 fix addressed:
1. ✅ NaN/Inf detection (Jidoka working)
2. ✅ Some quantization parameter handling

But did NOT address:
1. ❌ Quantization scale/offset precision loss
2. ❌ Block-wise quantization metadata transfer
3. ❌ Q4_K_M super-block structure preservation

### Technical Detail

Q4_K_M uses a two-level quantization structure:
```
Super-block (256 elements):
  - Scale (fp16)
  - Min (fp16)
  - 32× Sub-blocks of 8 elements each
    - Sub-scale (6-bit)
    - 4-bit quantized weights
```

If the super-block scales are truncated or misaligned during conversion, all weights in that block will be off by a multiplicative factor, leading to the large cumulative diffs we observe.

---

## Suggested Additional Fixes

### 1. Preserve Full Quantization Metadata
```rust
struct Q4KMSuperBlock {
    d: f16,      // Super-block scale - MUST preserve full precision
    dmin: f16,   // Super-block min - MUST preserve full precision
    scales: [u8; 12],  // Sub-block scales - MUST preserve bit-exact
    qs: [u8; 128],     // Quantized values
}

// During conversion, ensure:
// 1. d and dmin are NOT downcast to f32 then back to f16
// 2. scales array is copied bit-exact, not recomputed
// 3. Block alignment matches source format
```

### 2. Add Tensor-Level Validation
```rust
fn validate_conversion(source: &Tensor, converted: &Tensor) -> Result<()> {
    let diff = (source.to_f32() - converted.to_f32()).abs().max();
    if diff > EPSILON {
        return Err(ConversionError::LossyConversion { 
            diff, 
            tolerance: EPSILON,
            tensor_name: source.name.clone(),
        });
    }
    Ok(())
}
```

### 3. Test Each Quantization Type Separately
```bash
# Test suite should cover:
apr rosetta convert model_q4_k_m.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_q5_k_m.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_q8_0.gguf test.apr && apr rosetta convert test.apr model_back.gguf
apr rosetta convert model_f16.gguf test.apr && apr rosetta convert test.apr model_back.gguf
# All should produce diff < 1e-6
```

---

## MQS Impact

| Metric | Current | Required |
|--------|---------|----------|
| Score | 41.1/100 | 87+/100 |
| Grade | F | B or higher |
| Conversion gates | 0/7 | 7/7 |
| Lost points | ~45 | 0 |

---

## Verification Criteria

Issue is resolved when:

```bash
cd ../apr-model-qa-playbook
cargo run --bin apr-qa -- run playbooks/models/qwen2.5-coder-1.5b-ci.playbook.yaml \
  --subprocess --model-path <model.gguf> --no-gpu --output output/verify

# Required:
# - F-CONV-001 through F-CONV-006: ALL PASS (diff < 1e-6)
# - F-CONV-RT-001: PASS (round-trip lossless)
# - MQS Score: 87+/100
# - Pass rate: 100%
```

---

## References

- Original issue: #177 (CLOSED - but regression detected)
- Evidence file: `../apr-model-qa-playbook/output/qwen-requalify/evidence.json`
- MQS report: `../apr-model-qa-playbook/output/qwen-requalify/mqs.json`
- Verification playbook: `../apr-model-qa-playbook/playbooks/verify/TICKET-177.yaml`
- Spec: Section 4 (Format Conversion Testing), tolerance = 1e-6

---

**Filed by:** apr-model-qa-playbook requalification (automated)
**Related:** #177 (regression), #172 (original P0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REGRESSION: Format conversion still produces large diffs after #177 fix #181

REGRESSION: Format Conversion Still Failing After #177 Fix

Executive Summary

Regression Evidence

Test Environment

Test Results

Detailed Failures

Raw Evidence from evidence.json

Comparison: Before vs After #177 Fix

Root Cause Hypothesis

Technical Detail

Suggested Additional Fixes

1. Preserve Full Quantization Metadata

2. Add Tensor-Level Validation

3. Test Each Quantization Type Separately

MQS Impact

Verification Criteria

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Gate	Conversion	Diff	Tolerance	Verdict
F-CONV-001	GGUF → APR	6.77e-1	1.00e-6	❌ FAIL (677,000× over tolerance)
F-CONV-002	APR → GGUF	4.16e-1	1.00e-6	❌ FAIL (416,000× over tolerance)
F-CONV-003	GGUF → SafeTensors	Infrastructure error	-	❌ FAIL (see below)
F-CONV-004	SafeTensors → GGUF	4.16e-1	1.00e-6	❌ FAIL
F-CONV-005	APR → SafeTensors	Infrastructure error	-	❌ FAIL (see below)
F-CONV-006	SafeTensors → APR	6.77e-1	1.00e-6	❌ FAIL
F-CONV-RT-001	Round-trip	Blocked	-	❌ FAIL

Metric	Before #177	After #177	Status
NaN detection	❌ Silent	✅ Detected	FIXED
Inf detection	❌ Silent	✅ Detected	FIXED
Output diff (GGUF→APR)	8.46e-1	6.77e-1	WORSE → BETTER (15% improvement)
Output diff (APR→GGUF)	6.34e-1	4.16e-1	WORSE → BETTER (34% improvement)
Within tolerance (ε=1e-6)	❌ No	❌ No	STILL FAILING
Round-trip lossless	❌ No	❌ No	STILL FAILING

Metric	Current	Required
Score	41.1/100	87+/100
Grade	F	B or higher
Conversion gates	0/7	7/7
Lost points	~45	0

REGRESSION: Format conversion still produces large diffs after #177 fix #181

Description

REGRESSION: Format Conversion Still Failing After #177 Fix

Executive Summary

Regression Evidence

Test Environment

Test Results

Detailed Failures

Raw Evidence from evidence.json

Comparison: Before vs After #177 Fix

Root Cause Hypothesis

Technical Detail

Suggested Additional Fixes

1. Preserve Full Quantization Metadata

2. Add Tensor-Level Validation

3. Test Each Quantization Type Separately

MQS Impact

Verification Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions