Description
Each JSON::Lexer
owns its StringPool
for key strings, and calling from_json
on a JSON::Serializable
type creates a fresh JSON::Lexer
. Those string pools are good at reusing JSON keys in the same string, but not when deserializing multiple objects from different strings. Consider this:
require "json"
record Point, x : Int32, y : Int32 do
include JSON::Serializable
end
10000.times do |i|
Point.from_json %({"x":#{i},"y":#{i}})
end
This will create 10,000 StringPool
s, the strings x
and y
will each be looked up exactly once per StringPool
, and all of them will miss. Recently I had profiled a Crystal application only to discover that those key strings are among the largest living objects allocated throughout program execution. IMO the standard library should do better than this.
A naive approach would be sharing one StringPool
among all deserializations, say via a class variable. This means those pools won't ever be garbage-collected, and if the lexers look them up on all JSON keys, the memory would blow up whenever one encounters a Hash(String, _)
field or a JSON::Serializable::Unmapped
type, where JSONs with many different keys are expected to parse successfully.
Instead, what we really want is a pool of string literals for all the field names. Conceptually, it translates a Bytes
to a String
, with the constraint that the target strings must reside in read-only memory and not allocate:
record LiteralPool, literals : Array(String) do
def get(bytes : Bytes)
@literals.find &.to_slice.== bytes
end
end
class JSON::Lexer
# needs suitable forwarding in `StringBased` and `IOBased`
def initialize(@literal_pool : LiteralPool)
# ...
end
private def consume_string_with_buffer(&)
# ...
if @expects_object_key
@token.string_value = @literal_pool.get(@buffer.to_slice) || @string_pool.get(@buffer)
else
@token.string_value = @buffer.to_s
end
end
end
class JSON::PullParser
def initialize(input, literal_pool)
@lexer = Lexer.new input, literal_pool
end
end
struct Point
# for exposition only; the `%w` would be generated by a macro
class_getter __json_literals : LiteralPool { LiteralPool.new %w(x y) }
end
def Point.from_json(string_or_io)
parser = JSON::PullParser.new(string_or_io, Point.__json_literals)
new parser
end
The LiteralPool
implementation above would be slower than a StringPool
, but alternative data structures exist (e.g. prefix tree) that would have comparable time complexities, and some can even themselves be defined in read-only memory. (Note that Hash
isn't one of them as Object#hash
is never a constant expression.)
Now the StringPool
would only be useful for Hash(String, _)
and JSON::Serializable::Unmapped
in JSONs that nonetheless have many duplicate keys. If we believe these are rare, we could make JSON::Lexer#@string_pool
a lazy getter or even drop it entirely.