Skip to content

APR Rosetta: Add format-aware differential tracing to detect embedding/weight layout bugs #187

@noahgift

Description

@noahgift

Problem

APR Q4_K inference produces garbage (PAD tokens 151935) while GGUF Q4_K produces correct output. This class of bug has occurred 50+ times and is extremely difficult to debug without proper tracing.

Root Cause Found: Embedding tensor stored as [hidden_dim, vocab_size] (GGML convention) but embed() expects [vocab_size, hidden_dim] layout. The transposition mismatch causes token lookups to read wrong data.

Current State

  • --trace only shows timing data, not tensor values
  • No comparison between formats (GGUF vs APR)
  • No automatic detection of layout mismatches
  • No embedding sanity checks

Requirements

1. Enhanced Default Logging (P0)

When running apr rosetta conversions or apr run with APR files:

  • Log tensor shapes and verify they match expected model config
  • Log first 5 values of embedding tensor after load
  • Detect and warn on [hidden_dim, vocab_size] vs [vocab_size, hidden_dim] mismatch

2. Format-Aware Differential Tracing (P1)

New --trace-diff flag for apr run:

apr run model.gguf model.apr "2+2?" --trace-diff
  • Compare token-by-token output between two model formats
  • Show first divergence point
  • Classify bug type (WEIGHT_LOAD_FAILURE, EMBEDDING_FAILURE, etc.)

3. Embedding Sanity Check (P0)

Add validation in APR loader:

// Verify embedding layout matches expected [vocab_size, hidden_dim]
let expected_size = vocab_size * hidden_dim;
if token_embedding.len() != expected_size {
    warn!("Embedding size mismatch: got {}, expected {}", ...);
}
// Check first token produces non-zero, non-garbage values
let test_embed = embed(&[0]);
if test_embed.iter().all(|&x| x == 0.0) {
    error!("Embedding produces all zeros - likely transposition bug");
}

Acceptance Criteria

  • APR loader logs embedding shape on load (always, not just debug mode)
  • APR loader detects and warns on embedding transposition
  • apr run --trace shows tensor value samples, not just timing
  • Bug classification enum exists for common failure modes
  • Regression test for embedding transposition detection

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions