Skip to content

Sharing Strings between deserializations of a JSON::Serializable type #13638

Open
@HertzDevil

Description

@HertzDevil

Each JSON::Lexer owns its StringPool for key strings, and calling from_json on a JSON::Serializable type creates a fresh JSON::Lexer. Those string pools are good at reusing JSON keys in the same string, but not when deserializing multiple objects from different strings. Consider this:

require "json"

record Point, x : Int32, y : Int32 do
  include JSON::Serializable
end

10000.times do |i|
  Point.from_json %({"x":#{i},"y":#{i}})
end

This will create 10,000 StringPools, the strings x and y will each be looked up exactly once per StringPool, and all of them will miss. Recently I had profiled a Crystal application only to discover that those key strings are among the largest living objects allocated throughout program execution. IMO the standard library should do better than this.

A naive approach would be sharing one StringPool among all deserializations, say via a class variable. This means those pools won't ever be garbage-collected, and if the lexers look them up on all JSON keys, the memory would blow up whenever one encounters a Hash(String, _) field or a JSON::Serializable::Unmapped type, where JSONs with many different keys are expected to parse successfully.

Instead, what we really want is a pool of string literals for all the field names. Conceptually, it translates a Bytes to a String, with the constraint that the target strings must reside in read-only memory and not allocate:

record LiteralPool, literals : Array(String) do
  def get(bytes : Bytes)
    @literals.find &.to_slice.== bytes
  end
end

class JSON::Lexer
  # needs suitable forwarding in `StringBased` and `IOBased`
  def initialize(@literal_pool : LiteralPool)
    # ...
  end

  private def consume_string_with_buffer(&)
    # ...
    if @expects_object_key
      @token.string_value = @literal_pool.get(@buffer.to_slice) || @string_pool.get(@buffer)
    else
      @token.string_value = @buffer.to_s
    end
  end
end

class JSON::PullParser
  def initialize(input, literal_pool)
    @lexer = Lexer.new input, literal_pool
  end
end

struct Point
  # for exposition only; the `%w` would be generated by a macro
  class_getter __json_literals : LiteralPool { LiteralPool.new %w(x y) }
end

def Point.from_json(string_or_io)
  parser = JSON::PullParser.new(string_or_io, Point.__json_literals)
  new parser
end

The LiteralPool implementation above would be slower than a StringPool, but alternative data structures exist (e.g. prefix tree) that would have comparable time complexities, and some can even themselves be defined in read-only memory. (Note that Hash isn't one of them as Object#hash is never a constant expression.)

Now the StringPool would only be useful for Hash(String, _) and JSON::Serializable::Unmapped in JSONs that nonetheless have many duplicate keys. If we believe these are rare, we could make JSON::Lexer#@string_pool a lazy getter or even drop it entirely.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions