Skip to content

Add Binary-Level Borsh Serialization Documentation #82

@rz1989s

Description

@rz1989s

Problem Statement

Context7 Benchmark Impact: Question 1 scored 78/100

Context7 Q1 feedback:

"However, the context lacks detailed explanations of the actual serialization/deserialization mechanisms at the binary level, error handling patterns, and practical troubleshooting guidance when type mismatches occur between Rust and TypeScript implementations."

Specific gaps:

  • Missing binary-level serialization mechanism explanations
  • No low-level Borsh format documentation
  • Lack of hex dump interpretation guides
  • No debugging tools for serialization issues

Proposed Solution

Create comprehensive documentation explaining how LUMOS uses Borsh serialization at the binary level, including type-specific encodings, debugging techniques, and common serialization bugs.

1. New File: docs/internals/borsh-serialization.md

Comprehensive Binary Format Guide:

# Borsh Serialization Internals

## Overview

LUMOS uses [Borsh](https://borsh.io) (Binary Object Representation Serializer for Hashing) to ensure deterministic serialization between Rust and TypeScript. This guide explains the low-level binary format for each type.

## Serialization Pipeline

### 1. LUMOS Schema → IR
```rust
// schema.lumos
#[solana]
#[account]
struct PlayerAccount {
    wallet: PublicKey,
    level: u16,
    score: u64,
}

2. IR → Rust Struct with BorshSerialize

use borsh::{BorshSerialize, BorshDeserialize};

#[derive(BorshSerialize, BorshDeserialize)]
pub struct PlayerAccount {
    pub wallet: Pubkey,
    pub level: u16,
    pub score: u64,
}

3. Data → Binary Format

When serialized, the struct becomes:

[32 bytes: wallet][2 bytes: level][8 bytes: score]
Total: 42 bytes

Binary Layout by Type

Primitive Types

Unsigned Integers (Little-Endian)

  • u8: 1 byte
    • Example: 2550xFF
  • u16: 2 bytes
    • Example: 10000xE8 0x03
  • u32: 4 bytes
    • Example: 10000000x40 0x42 0x0F 0x00
  • u64: 8 bytes
    • Example: 10000000000x00 0xCA 0x9A 0x3B 0x00 0x00 0x00 0x00
  • u128: 16 bytes
    • Example: Large value serialized in little-endian

Signed Integers (Little-Endian, Two's Complement)

  • i8: 1 byte
    • Example: -10xFF
  • i16: 2 bytes
    • Example: -10000x18 0xFC
  • i32: 4 bytes
  • i64: 8 bytes
  • i128: 16 bytes

Boolean

  • bool: 1 byte
    • true0x01
    • false0x00

Solana-Specific Types

PublicKey

  • Size: 32 bytes (fixed)
  • Format: Raw bytes of the public key
  • Example:
    PublicKey("11111111111111111111111111111111")
    → 0x00 0x00 0x00 0x00 ... (32 bytes)
    

Signature

  • Size: 64 bytes (fixed)
  • Format: Raw signature bytes

String

  • Format: [4-byte length prefix][UTF-8 bytes]
  • Example: "hello"
    Length: 5 → 0x05 0x00 0x00 0x00
    UTF-8: "hello" → 0x68 0x65 0x6C 0x6C 0x6F
    Total: [0x05 0x00 0x00 0x00 0x68 0x65 0x6C 0x6C 0x6F]
    

Vec

  • Format: [4-byte length][element 1][element 2]...[element n]
  • Example: Vec<u16>([10, 20, 30])
    Length: 3 → 0x03 0x00 0x00 0x00
    Element 1: 10 → 0x0A 0x00
    Element 2: 20 → 0x14 0x00
    Element 3: 30 → 0x1E 0x00
    Total: [0x03 0x00 0x00 0x00 0x0A 0x00 0x14 0x00 0x1E 0x00]
    

Option

  • Format: [1-byte discriminant][value if Some]
  • Discriminant:
    • None0x00
    • Some(value)0x01 [serialized value]
  • Example: Option<u32>
    • None0x00
    • Some(1000)0x01 0x40 0x42 0x0F 0x00

Enum

  • Format: [1-byte discriminant][variant data]
  • Discriminant: Sequential (0, 1, 2, ...)

Example Enum:

enum GameState {
    Active,              // discriminant: 0
    Paused,              // discriminant: 1
    Finished { score: u64 },  // discriminant: 2
}

Serialization:

  • GameState::Active0x00
  • GameState::Paused0x01
  • GameState::Finished { score: 1000 }0x02 0xE8 0x03 0x00 0x00 0x00 0x00 0x00 0x00

Struct

  • Format: Fields serialized in declaration order
  • No padding: Fields are tightly packed
  • Field order matters: Changing field order breaks serialization

Example:

struct Player {
    level: u16,      // Offset 0: 2 bytes
    score: u64,      // Offset 2: 8 bytes
    name: String,    // Offset 10: 4 + string length
}

Serialization of Player { level: 5, score: 100, name: "Alice" }:

[0x05 0x00]                              // level: 5
[0x64 0x00 0x00 0x00 0x00 0x00 0x00 0x00] // score: 100
[0x05 0x00 0x00 0x00]                    // name length: 5
[0x41 0x6C 0x69 0x63 0x65]               // name: "Alice"

Nested Structures

Example:

struct Inventory {
    items: Vec<String>,
}

struct Player {
    wallet: Pubkey,
    inventory: Inventory,
}

Serialization (nested structures are inlined):

[32 bytes: wallet]
[4 bytes: items length]
[4 bytes: item 1 length][item 1 UTF-8 bytes]
[4 bytes: item 2 length][item 2 UTF-8 bytes]
...

Anchor Accounts

Anchor adds an 8-byte discriminator at the start of account data:

[8-byte discriminator][borsh-serialized data]

The discriminator is a hash of the account type name, used for type safety.

Example:

#[account]
pub struct PlayerAccount {
    pub level: u16,
    pub score: u64,
}

On-chain data:

[8 bytes: discriminator][2 bytes: level][8 bytes: score]
Total: 18 bytes

Debugging Serialization Issues

Hex Dump Interpretation

Tool: hexdump

hexdump -C account_data.bin

Example Output:

00000000  92 bc 2c 1a 8e 4f 7a 6d  05 00 64 00 00 00 00 00  |..,..Ozm..d.....|
00000010  00 00                                             |..|

Interpretation:

  • Bytes 0-7: Discriminator 92 bc 2c 1a 8e 4f 7a 6d
  • Bytes 8-9: level = 50x05 0x00 (little-endian u16)
  • Bytes 10-17: score = 1000x64 0x00 0x00 0x00 0x00 0x00 0x00 0x00

Common Serialization Bugs

1. Field Order Mismatch

Problem:

// Rust
struct Player {
    level: u16,
    score: u64,
}

// TypeScript (WRONG!)
interface Player {
    score: number;  // Wrong order!
    level: number;
}

Solution: LUMOS ensures field order matches. If manually writing schemas, maintain declaration order.

2. Endianness Issues

Problem: Reading multi-byte integers in wrong byte order

Solution: Borsh uses little-endian for all integers. Ensure your tooling expects this.

3. String Encoding

Problem: Non-UTF-8 strings causing deserialization failures

Solution: Validate UTF-8 before serialization:

let name = String::from_utf8(bytes).map_err(|_| ErrorCode::InvalidUtf8)?;

4. Discriminator Confusion

Problem: Forgetting to skip 8-byte discriminator in Anchor accounts

Solution:

// ❌ WRONG
const player = borsh.deserialize(PlayerAccountSchema, accountInfo.data);

// ✅ CORRECT
const player = borsh.deserialize(PlayerAccountSchema, accountInfo.data.slice(8));

Manual Serialization Example

use borsh::BorshSerialize;

let player = PlayerAccount {
    wallet: Pubkey::default(),
    level: 10,
    score: 500,
};

let bytes = player.try_to_vec().unwrap();
println!("Serialized bytes: {:02X?}", bytes);

Output:

Serialized bytes: [00, 00, 00, ..., 0A, 00, F4, 01, 00, 00, 00, 00, 00, 00]
                   [    32-byte PublicKey   ][level][      score      ]

### 2. New Directory: `examples/borsh-internals/`

examples/borsh-internals/
├── schema.lumos # Test schema with various types
├── Cargo.toml
├── src/
│ ├── binary_inspector.rs # Print hex dumps of serialized data
│ ├── manual_serialize.rs # Manual Borsh encoding examples
│ ├── type_sizes.rs # Calculate sizes of all types
│ └── lib.rs # Export utilities
└── README.md # Binary format reference


**`src/binary_inspector.rs`:**
```rust
use borsh::BorshSerialize;
use generated::PlayerAccount;

pub fn inspect_account(account: &PlayerAccount) {
    let bytes = account.try_to_vec().unwrap();
    
    println!("Total size: {} bytes", bytes.len());
    println!("Hex dump:");
    for (i, chunk) in bytes.chunks(16).enumerate() {
        print!("{:08x}  ", i * 16);
        for byte in chunk {
            print!("{:02x} ", byte);
        }
        println!();
    }
    
    println!("\nField breakdown:");
    println!("  wallet (32 bytes): {:02X?}", &bytes[0..32]);
    println!("  level (2 bytes): {:02X?}", &bytes[32..34]);
    println!("  score (8 bytes): {:02X?}", &bytes[34..42]);
}

src/type_sizes.rs:

use std::mem::size_of;
use borsh::BorshSerialize;

pub fn print_type_sizes() {
    println!("Primitive Types:");
    println!("  u8: {} byte", size_of::<u8>());
    println!("  u16: {} bytes", size_of::<u16>());
    println!("  u32: {} bytes", size_of::<u32>());
    println!("  u64: {} bytes", size_of::<u64>());
    println!("  u128: {} bytes", size_of::<u128>());
    
    println!("\nSolana Types:");
    println!("  Pubkey: {} bytes", size_of::<anchor_lang::prelude::Pubkey>());
    
    println!("\nVariable-Length Types:");
    let empty_vec: Vec<u8> = vec![];
    let vec_3: Vec<u8> = vec![1, 2, 3];
    println!("  Vec<u8> (empty): {} bytes", empty_vec.try_to_vec().unwrap().len());
    println!("  Vec<u8> (3 items): {} bytes", vec_3.try_to_vec().unwrap().len());
    
    let none: Option<u64> = None;
    let some: Option<u64> = Some(100);
    println!("  Option<u64> (None): {} byte", none.try_to_vec().unwrap().len());
    println!("  Option<u64> (Some): {} bytes", some.try_to_vec().unwrap().len());
}

3. Add Binary Layout Diagrams

In docs/reference/type-mapping.md, add visual diagrams:

## Binary Layout Examples

### Example: PlayerAccount
```rust
struct PlayerAccount {
    wallet: PublicKey,  // 32 bytes
    level: u16,         // 2 bytes
    score: u64,         // 8 bytes
}

Memory Layout:

┌─────────────────────────────────────┬──────────┬─────────────────────┐
│         wallet (32 bytes)           │  level   │    score (8 bytes)  │
│                                     │(2 bytes) │                     │
└─────────────────────────────────────┴──────────┴─────────────────────┘
 0                                   31 32     33 34                  41

Total Size: 42 bytes


## Acceptance Criteria

- [ ] `docs/internals/borsh-serialization.md` written with:
  - [ ] Complete binary format reference for all types
  - [ ] Serialization pipeline explanation
  - [ ] Nested structure examples
  - [ ] Anchor discriminator documentation
  - [ ] Common bug patterns and solutions
- [ ] New `examples/borsh-internals/` directory with:
  - [ ] Binary inspector tool (hex dump utility)
  - [ ] Manual serialization examples
  - [ ] Type size calculator
  - [ ] Comprehensive README
- [ ] Binary layout diagrams added to `docs/reference/type-mapping.md`
- [ ] Reference table for all type encodings
- [ ] **Target:** Context7 Q1 score ≥ 88 (+10 points)

## Impact

**Context7 Benchmark:**
- Q1: 78 → 88 (+10 points)

**Overall Score:** 84.1 → 85.1 (+1.0 point)

**User Value:**
- Deep understanding of serialization format
- Better debugging capabilities
- Reduced serialization bugs
- Educational resource for Borsh/Solana development

## Related

- Context7 Benchmark Question 1
- Borsh specification: https://borsh.io
- Type mapping documentation

## Priority Justification

🟢 **MEDIUM** - Technical depth improvement, valuable for advanced users and debugging

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:docsDocumentation (lumos-lang.org, guides)phase-1-llmoPhase 1 LLMO (Stealth mode documentation foundation)priority:highHigh priority, should be addressed soontype:documentationDocumentation improvements or additionstype:enhancementImprovement to existing feature

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions