x.json2: `decode[T]` is ~30% slower than cJSON `json.decode` on small nested structs (567 ms vs 437 ms / 1M iters); 2× speedup patch ready

### Describe the bug

`x.json2.decode` for a small nested struct is **~30% slower** than `json.decode` (the legacy cJSON-based encoder) on `master`. Same shape: 1 outer field, 1 string field, 1 inner struct with 2 fields. 1 000 000 iterations on the same machine, `-prod -cc gcc`.

```
SPENT  567.940 ms  json2.decode[Stru](json_data)!         ← x.json2 on master
SPENT  437.685 ms  old_json.decode(Stru, json_data)!      ← cJSON
                   ratio: x.json2 ≈ 1.30× cJSON
```

The `x.json2` README states the goal is "as fast as or faster than cJSON". Today the gap is largest on the most common shape (small nested structs), which is the dominant workload for any service decoding tens of thousands of small JSON payloads per second.

The benchmark script `bench.v` already exists under `x.json2`'s test folder. The numbers above are reproducible by anyone with `./v -prod -cc gcc crun <path to bench.v>` after a `./v wipe-cache`.

V code (`bench.v` excerpt, the same file shipped under `x.json2` tests):
```v
import x.json2
import json as old_json
import benchmark

const max_iterations = 1_000_000

pub struct Stru {
	val  int
	val2 string
	val3 Stru2
}

pub struct Stru2 {
	a         int
	churrasco string
}

fn main() {
	json_data := '{"val": 1, "val2": "lala", "val3": {"a": 2, "churrasco": "leleu"}}'
	mut b := benchmark.start()
	for _ in 0 .. max_iterations { _ := json2.decode[Stru](json_data)! }
	b.measure('json2.decode[Stru]')
	for _ in 0 .. max_iterations { _ := old_json.decode(Stru, json_data)! }
	b.measure('old_json.decode(Stru)')
}
```

C backend result (root cause):
```c
// The decoder keeps a heap-allocated singly-linked list (LinkedList[ValueInfo])
// of every value in the JSON. Each Node is a small heap allocation, freed at end
// of decode. For a 4-key payload that is 4 mallocs+frees per call.
// On top of that, the inline non-embed struct decoder builds a fresh
// LinkedList[StructFieldInfo] *per decode call*, with one heap-allocated node
// per struct field, then walks it with pointer chasing. For Stru that is
// another 4 mallocs + 4 frees + 1 list free per call.
// cJSON parses into an arena and lets the wrapper pull pointers directly, so
// it pays a single arena alloc per call instead of 8 small ones.
```

### Reproduction Steps

```sh
./v wipe-cache
./v -prod -cc gcc crun vlib/x/json2/tests/bench.v
```

### Expected Behavior

```sh
SPENT  ~440 ms  json2.decode[Stru](json_data)!     # parity with cJSON or better
SPENT  ~440 ms  old_json.decode(Stru, json_data)!
```

### Current Behavior (master)

```sh
SPENT  567.940 ms  json2.decode[Stru](json_data)!     # ~30% slower than cJSON
SPENT  437.685 ms  old_json.decode(Stru, json_data)!
SPENT  607.027 ms  json2.decode[SumTypes](json_data)! # ~28% slower than cJSON
SPENT  475.774 ms  old_json.decode(SumTypes, json_data)!
```

(Other shapes — top-level array of int, `map[string]string`, `StructTypeOption[string]` — are at parity or already faster than cJSON.)

### Possible Solution

The hot path for nested structs is dominated by per-node allocation in `LinkedList[ValueInfo]` and per-call construction of `LinkedList[StructFieldInfo]` (one heap node per struct field, every decode). Three orthogonal changes that compound:

1. **Per-T cached `StructFieldInfo` + array iteration.** The non-embed inline struct decoder used to (a) allocate a `LinkedList[StructFieldInfo]` node *per struct field per decode call*, (b) walk it with pointer chasing, and (c) free it. Build it once per type via a `cached_struct_field_infos[T]()` static — same pattern as `cached_field_infos` already used in the encoder — and iterate by index over a contiguous slice. The mutable `is_decoded` flag can be extracted into a per-call `u64` bitmask (no allocation; up to 64 fields, with overflow to `[]bool` for the rare wider struct). For the `Stru` benchmark this removes 4 mallocs + 4 frees + 1 array free per call; against 1M iterations that is 9M GC ops eliminated.

2. **`decode_string` no-escape fast path.** When the JSON string body contains no `\`, return `decoder.json[pos+1..pos+length-1]` directly — a string-header slice, no body copy. The presence of an escape can be checked with a single `C.memchr` call.

3. *(Tried but not worth it.)* A bump-allocator arena for `Node[ValueInfo]` (single `malloc(N * sizeof(Node))` instead of N small `&Node{}`) **regressed** by ~10% in local testing, because Boehm GC's small-object freelist serves `GC_MALLOC(32)` faster than one `GC_MALLOC(1600)`, and the per-call setup/teardown of the arena (plus the predicted-not-taken branch in the push) adds more cost than it saves. Lesson: under Boehm GC, arena allocators do not win for small short-lived objects — keep the lazy small-allocations.

I have changes (1) and (2) ready as a patch; happy to send a PR. Local measurement of the patched build:

| Benchmark | master | patched | Δ vs master | vs cJSON |
|---|---|---|---|---|
| `Stru` (nested struct) | 567 ms | **274 ms** | 2.07× faster | **1.57× faster than cJSON** |
| `SumTypes` (nested) | 607 ms | **323 ms** | 1.88× faster | 1.41× faster than cJSON |
| `StructType[string]` | 111 ms | **87 ms** | 1.28× | 1.30× faster than cJSON |
| `StructTypeOption[string]` | 135 ms | **104 ms** | 1.30× | 1.40× faster than cJSON |
| `StructType[int]` | 132 ms | **113 ms** | 1.17× | 1.21× faster than cJSON |
| `map[string]string` | 180 ms | **159 ms** | 1.13× | 1.40× faster than cJSON |
| `string` (single value) | 71 ms | **48 ms** | 1.48× | n/a |
| `StringAlias` | 71 ms | **47 ms** | 1.51× | n/a |

All 57 `x.json2` tests still pass on the patched build.

### Additional Information/Context

`x.json2`'s `bench.v` already exercises the worst case at 1M iterations. The two slow cases dominate the budget for any service that decodes tens of thousands of small JSON payloads per second; they should be the first targets.

### V version

```sh
V 0.5.1 1b3385cc34ff783e793d1a26a8ec5be587c80fe0.40b3711
```

### Environment details (OS name and version, etc.)

```
|V full version      |V 0.5.1 1b3385cc34ff783e793d1a26a8ec5be587c80fe0.40b3711
|:-------------------|:-------------------
|OS                  |linux, Ubuntu 24.04 LTS
|Processor           |16 cpus, 64bit, little endian, AMD Ryzen 7 5800H with Radeon Graphics
|Memory              |8.17GB/30.7GB
|                    |
|V executable        |/home/hitalo/Documents/v/v
|V last modified time|2026-04-18 09:18:00
|                    |
|V home dir          |OK, value: /home/hitalo/Documents/v
|VMODULES            |OK, value: /home/hitalo/.vmodules
|VTMP                |OK, value: /tmp/v_1000
|Current working dir |OK, value: /home/hitalo/Documents/v
|                    |
|Git version         |git version 2.43.0
|V git status        |0.5.1-1006-g40b3711b-dirty
|.git/config present |true
|                    |
|cc version          |cc (GCC) 14.2.0
|gcc version         |gcc (GCC) 14.2.0
|clang version       |Ubuntu clang version 18.1.3 (1)
|tcc version         |tcc version 0.9.28rc 2025-02-13 HEAD@f8bd136d (x86_64 Linux)
|tcc git status      |thirdparty-linux-amd64 696c1d84
|emcc version        |emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.6 ()
|glibc version       |ldd (Ubuntu GLIBC 2.39-0ubuntu8.3) 2.39
```

> [!NOTE]
> You can use the 👍 reaction to increase the issue's priority for developers.
>
> Please note that only the 👍 reaction to the issue itself counts as a vote.
> Other reactions and those to comments will not be taken into account.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

x.json2: `decode[T]` is ~30% slower than cJSON `json.decode` on small nested structs (567 ms vs 437 ms / 1M iters); 2× speedup patch ready #26911

Describe the bug

Reproduction Steps

Expected Behavior

Current Behavior (master)

Possible Solution

Additional Information/Context

V version

Environment details (OS name and version, etc.)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	master	patched	Δ vs master	vs cJSON
`Stru` (nested struct)	567 ms	274 ms	2.07× faster	1.57× faster than cJSON
`SumTypes` (nested)	607 ms	323 ms	1.88× faster	1.41× faster than cJSON
`StructType[string]`	111 ms	87 ms	1.28×	1.30× faster than cJSON
`StructTypeOption[string]`	135 ms	104 ms	1.30×	1.40× faster than cJSON
`StructType[int]`	132 ms	113 ms	1.17×	1.21× faster than cJSON
`map[string]string`	180 ms	159 ms	1.13×	1.40× faster than cJSON
`string` (single value)	71 ms	48 ms	1.48×	n/a
`StringAlias`	71 ms	47 ms	1.51×	n/a

Uh oh!

x.json2: decode[T] is ~30% slower than cJSON json.decode on small nested structs (567 ms vs 437 ms / 1M iters); 2× speedup patch ready #26911

Description

Describe the bug

Reproduction Steps

Expected Behavior

Current Behavior (master)

Possible Solution

Additional Information/Context

V version

Environment details (OS name and version, etc.)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

x.json2: `decode[T]` is ~30% slower than cJSON `json.decode` on small nested structs (567 ms vs 437 ms / 1M iters); 2× speedup patch ready #26911