You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
x.json2.decode for a small nested struct is ~30% slower than json.decode (the legacy cJSON-based encoder) on master. Same shape: 1 outer field, 1 string field, 1 inner struct with 2 fields. 1 000 000 iterations on the same machine, -prod -cc gcc.
SPENT 567.940 ms json2.decode[Stru](json_data)! ← x.json2 on master
SPENT 437.685 ms old_json.decode(Stru, json_data)! ← cJSON
ratio: x.json2 ≈ 1.30× cJSON
The x.json2 README states the goal is "as fast as or faster than cJSON". Today the gap is largest on the most common shape (small nested structs), which is the dominant workload for any service decoding tens of thousands of small JSON payloads per second.
The benchmark script bench.v already exists under x.json2's test folder. The numbers above are reproducible by anyone with ./v -prod -cc gcc crun <path to bench.v> after a ./v wipe-cache.
V code (bench.v excerpt, the same file shipped under x.json2 tests):
importx.json2importjsonasold_jsonimportbenchmarkconstmax_iterations=1_000_000pubstructStru {
val int
val2string
val3 Stru2
}
pubstructStru2 {
a int
churrasco string
}
fnmain() {
json_data:='{"val": 1, "val2": "lala", "val3": {"a": 2, "churrasco": "leleu"}}'mutb:= benchmark.start()
for _ in0 .. max_iterations { _:= json2.decode[Stru](json_data)! }
b.measure('json2.decode[Stru]')
for _ in0 .. max_iterations { _:= old_json.decode(Stru, json_data)! }
b.measure('old_json.decode(Stru)')
}
C backend result (root cause):
// The decoder keeps a heap-allocated singly-linked list (LinkedList[ValueInfo])// of every value in the JSON. Each Node is a small heap allocation, freed at end// of decode. For a 4-key payload that is 4 mallocs+frees per call.// On top of that, the inline non-embed struct decoder builds a fresh// LinkedList[StructFieldInfo] *per decode call*, with one heap-allocated node// per struct field, then walks it with pointer chasing. For Stru that is// another 4 mallocs + 4 frees + 1 list free per call.// cJSON parses into an arena and lets the wrapper pull pointers directly, so// it pays a single arena alloc per call instead of 8 small ones.
SPENT ~440 ms json2.decode[Stru](json_data)!# parity with cJSON or better
SPENT ~440 ms old_json.decode(Stru, json_data)!
Current Behavior (master)
SPENT 567.940 ms json2.decode[Stru](json_data)!# ~30% slower than cJSON
SPENT 437.685 ms old_json.decode(Stru, json_data)!
SPENT 607.027 ms json2.decode[SumTypes](json_data)!# ~28% slower than cJSON
SPENT 475.774 ms old_json.decode(SumTypes, json_data)!
(Other shapes — top-level array of int, map[string]string, StructTypeOption[string] — are at parity or already faster than cJSON.)
Possible Solution
The hot path for nested structs is dominated by per-node allocation in LinkedList[ValueInfo] and per-call construction of LinkedList[StructFieldInfo] (one heap node per struct field, every decode). Three orthogonal changes that compound:
Per-T cached StructFieldInfo + array iteration. The non-embed inline struct decoder used to (a) allocate a LinkedList[StructFieldInfo] node per struct field per decode call, (b) walk it with pointer chasing, and (c) free it. Build it once per type via a cached_struct_field_infos[T]() static — same pattern as cached_field_infos already used in the encoder — and iterate by index over a contiguous slice. The mutable is_decoded flag can be extracted into a per-call u64 bitmask (no allocation; up to 64 fields, with overflow to []bool for the rare wider struct). For the Stru benchmark this removes 4 mallocs + 4 frees + 1 array free per call; against 1M iterations that is 9M GC ops eliminated.
decode_string no-escape fast path. When the JSON string body contains no \, return decoder.json[pos+1..pos+length-1] directly — a string-header slice, no body copy. The presence of an escape can be checked with a single C.memchr call.
(Tried but not worth it.) A bump-allocator arena for Node[ValueInfo] (single malloc(N * sizeof(Node)) instead of N small &Node{}) regressed by ~10% in local testing, because Boehm GC's small-object freelist serves GC_MALLOC(32) faster than one GC_MALLOC(1600), and the per-call setup/teardown of the arena (plus the predicted-not-taken branch in the push) adds more cost than it saves. Lesson: under Boehm GC, arena allocators do not win for small short-lived objects — keep the lazy small-allocations.
I have changes (1) and (2) ready as a patch; happy to send a PR. Local measurement of the patched build:
Benchmark
master
patched
Δ vs master
vs cJSON
Stru (nested struct)
567 ms
274 ms
2.07× faster
1.57× faster than cJSON
SumTypes (nested)
607 ms
323 ms
1.88× faster
1.41× faster than cJSON
StructType[string]
111 ms
87 ms
1.28×
1.30× faster than cJSON
StructTypeOption[string]
135 ms
104 ms
1.30×
1.40× faster than cJSON
StructType[int]
132 ms
113 ms
1.17×
1.21× faster than cJSON
map[string]string
180 ms
159 ms
1.13×
1.40× faster than cJSON
string (single value)
71 ms
48 ms
1.48×
n/a
StringAlias
71 ms
47 ms
1.51×
n/a
All 57 x.json2 tests still pass on the patched build.
Additional Information/Context
x.json2's bench.v already exercises the worst case at 1M iterations. The two slow cases dominate the budget for any service that decodes tens of thousands of small JSON payloads per second; they should be the first targets.
V version
V 0.5.1 1b3385cc34ff783e793d1a26a8ec5be587c80fe0.40b3711
Environment details (OS name and version, etc.)
|V full version |V 0.5.1 1b3385cc34ff783e793d1a26a8ec5be587c80fe0.40b3711
|:-------------------|:-------------------
|OS |linux, Ubuntu 24.04 LTS
|Processor |16 cpus, 64bit, little endian, AMD Ryzen 7 5800H with Radeon Graphics
|Memory |8.17GB/30.7GB
| |
|V executable |/home/hitalo/Documents/v/v
|V last modified time|2026-04-18 09:18:00
| |
|V home dir |OK, value: /home/hitalo/Documents/v
|VMODULES |OK, value: /home/hitalo/.vmodules
|VTMP |OK, value: /tmp/v_1000
|Current working dir |OK, value: /home/hitalo/Documents/v
| |
|Git version |git version 2.43.0
|V git status |0.5.1-1006-g40b3711b-dirty
|.git/config present |true
| |
|cc version |cc (GCC) 14.2.0
|gcc version |gcc (GCC) 14.2.0
|clang version |Ubuntu clang version 18.1.3 (1)
|tcc version |tcc version 0.9.28rc 2025-02-13 HEAD@f8bd136d (x86_64 Linux)
|tcc git status |thirdparty-linux-amd64 696c1d84
|emcc version |emcc (Emscripten gcc/clang-like replacement + linker emulating GNU ld) 3.1.6 ()
|glibc version |ldd (Ubuntu GLIBC 2.39-0ubuntu8.3) 2.39
Note
You can use the 👍 reaction to increase the issue's priority for developers.
Please note that only the 👍 reaction to the issue itself counts as a vote.
Other reactions and those to comments will not be taken into account.
Describe the bug
x.json2.decodefor a small nested struct is ~30% slower thanjson.decode(the legacy cJSON-based encoder) onmaster. Same shape: 1 outer field, 1 string field, 1 inner struct with 2 fields. 1 000 000 iterations on the same machine,-prod -cc gcc.The
x.json2README states the goal is "as fast as or faster than cJSON". Today the gap is largest on the most common shape (small nested structs), which is the dominant workload for any service decoding tens of thousands of small JSON payloads per second.The benchmark script
bench.valready exists underx.json2's test folder. The numbers above are reproducible by anyone with./v -prod -cc gcc crun <path to bench.v>after a./v wipe-cache.V code (
bench.vexcerpt, the same file shipped underx.json2tests):C backend result (root cause):
Reproduction Steps
Expected Behavior
Current Behavior (master)
(Other shapes — top-level array of int,
map[string]string,StructTypeOption[string]— are at parity or already faster than cJSON.)Possible Solution
The hot path for nested structs is dominated by per-node allocation in
LinkedList[ValueInfo]and per-call construction ofLinkedList[StructFieldInfo](one heap node per struct field, every decode). Three orthogonal changes that compound:Per-T cached
StructFieldInfo+ array iteration. The non-embed inline struct decoder used to (a) allocate aLinkedList[StructFieldInfo]node per struct field per decode call, (b) walk it with pointer chasing, and (c) free it. Build it once per type via acached_struct_field_infos[T]()static — same pattern ascached_field_infosalready used in the encoder — and iterate by index over a contiguous slice. The mutableis_decodedflag can be extracted into a per-callu64bitmask (no allocation; up to 64 fields, with overflow to[]boolfor the rare wider struct). For theStrubenchmark this removes 4 mallocs + 4 frees + 1 array free per call; against 1M iterations that is 9M GC ops eliminated.decode_stringno-escape fast path. When the JSON string body contains no\, returndecoder.json[pos+1..pos+length-1]directly — a string-header slice, no body copy. The presence of an escape can be checked with a singleC.memchrcall.(Tried but not worth it.) A bump-allocator arena for
Node[ValueInfo](singlemalloc(N * sizeof(Node))instead of N small&Node{}) regressed by ~10% in local testing, because Boehm GC's small-object freelist servesGC_MALLOC(32)faster than oneGC_MALLOC(1600), and the per-call setup/teardown of the arena (plus the predicted-not-taken branch in the push) adds more cost than it saves. Lesson: under Boehm GC, arena allocators do not win for small short-lived objects — keep the lazy small-allocations.I have changes (1) and (2) ready as a patch; happy to send a PR. Local measurement of the patched build:
Stru(nested struct)SumTypes(nested)StructType[string]StructTypeOption[string]StructType[int]map[string]stringstring(single value)StringAliasAll 57
x.json2tests still pass on the patched build.Additional Information/Context
x.json2'sbench.valready exercises the worst case at 1M iterations. The two slow cases dominate the budget for any service that decodes tens of thousands of small JSON payloads per second; they should be the first targets.V version
Environment details (OS name and version, etc.)
Note
You can use the 👍 reaction to increase the issue's priority for developers.
Please note that only the 👍 reaction to the issue itself counts as a vote.
Other reactions and those to comments will not be taken into account.