Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'materialize_refs' function #47

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

seriyps
Copy link
Contributor

@seriyps seriyps commented Mar 16, 2017

So, this function just resolves references and replaces them with referenced schema inplace.

{"definitions": {
    "myKey": {
        "minimum": 100,
        "description": "My key description"
    }
 },
 "type": "integer",
 "$ref": "#/definitions/myKey"
}

would become

{"definitions": {
    "myKey": {
        "minimum": 100,
        "description": "My key description"
    }
 },
 "type": "integer",
 "minimum": 100,
 "description": "My key description"
}

Works with remote references as well.

It's just prototype, I'm not expecting it to be merged right now, but let's discuss what else I should add? Readme? Tests?

Also, probably there should be some option to automatically resolve references before storing schema to jesse database.

Why this was implemented? Initialy to help with debugging large schemas with remote refs (we have a lot of schemas referencing common types.json schema).
And most likely it may improve performance (because, AFAIK, each remote $ref occurence will cause jesse_database:lookup. And this may happen in a loop), but this isn't used yet.

@andreineculau
Copy link
Member

  1. super quick feedback: 👍
  2. quick feedback: maybe expand is easier to grasp than materialize. AFAIK your example should be wrong, since $ref semantics are replace, not merge. I don't have time now to check the diff but one thing to be aware of and check is circular references. I'm not sure if it fits in the "json schema validator" jesse, but I can see how it fits in the jesse codebase with all the existing functionality so I'm happy to work on getting this PR merged. At Klarna, I have done a similar thing outside of jesse, also because of the common definitions. That was >3y ago, and Klarna never opensourced that so it's time to reinvent the wheel :)

@seriyps
Copy link
Contributor Author

seriyps commented Mar 16, 2017

@andreineculau your comment about 'replace' semantics is interesting. I checked the speck and it looks like you are right http://json-schema.org/latest/json-schema-core.html#rfc.section.7

Resolved against the current URI base, it identifies the URI of a schema to use. All other properties in a "$ref" object MUST be ignored.

But, at the same time, jesse itself doesn't follow this semantics. It evaluetes rules one-by-one and just perform sub-check when $ref is found. So, while spec requires replace semantics, jesse uses "union". I don't think it's something bad, btw.

But here I see another problem: when using maps and schema and subschema both have the same key(s), result will be different, compared to other list-based object representations:

http://erlang.org/doc/man/maps.html#from_list-1

If the same key appears more than once, the latter (right-most) value is used and the previous values are ignored.


The same spec states, that circular references are forbidden, so, I'm not sure if we sould add explicit protection.

A schema MUST NOT be run into an infinite loop against a schema. For example, if two schemas "#alice" and "#bob" both have an "allOf" property that refers to the other, a naive validator might get stuck in an infinite recursive loop trying to validate the instance. Schemas SHOULD NOT make use of infinite recursive nesting like this, the behavior is undefined.

@andreineculau
Copy link
Member

But, at the same time, jesse itself doesn't follow this semantics. It evaluetes rules one-by-one and just perform sub-check when $ref is found. So, while spec requires replace semantics, jesse uses "union". I don't think it's something bad, btw.

Correct, we do have a bug. I just forgot to open an issue. And about bad - anything that doesn't follow a standard, even a bad standard, is bad :)

It's late, and I cannot follow the map problem fully. Care to be more specific?


There are 2 diff concepts that you're mixing. One, presented in the excerpt is that you would not be able to validate a JSON schema due to infinite recursion. What I'm talking about is that you would not be able to expand the schema due to infinite recursion. The latter is very much valid e.g. a cube made of cubes made of cubes...

@andreineculau
Copy link
Member

andreineculau commented May 22, 2017

bump.
I have fixed the "replace" semantics of $ref. Thanks for noticing!

Have you given any thoughts to the infinite recursion? Maybe you could you add an option to have a limit of expansion levels before giving up and leaving the references in place?

@andreineculau andreineculau added this to the next major milestone May 17, 2018
@onno-vos-dev
Copy link

onno-vos-dev commented Sep 14, 2021

Sorry for digging up a dead horse but after chatting with @seriyps I'd like to take a stab at finalizing this work. With a very complex JSON schema I can see amazing performance gains from resolving all references and having a pre-built schema in place.

Unresolved:

Method                                           Loops               Min               Max            Median           Average
empty loop                  2156545 *  333 = 718129485           0.003us           2.649us           0.003us           0.003us
erl_eval:'-expr/5-fun-3-'()       3 *   50 =       150       90370.080us       91817.680us       90370.080us       91120.733us

Resolved:

Method                                           Loops               Min               Max            Median           Average
empty loop                  2221610 *  333 = 739796130           0.003us           1.952us           0.003us           0.003us
erl_eval:'-expr/5-fun-3-'()      71 *   50 =      3550        2595.140us        4263.480us        2775.120us        2817.803us

Average drops from 90.37 ms to 2.82 ms :-)

I'll try and devote some time to this during the upcoming week.

@andreineculau
Copy link
Member

Thanks @onno-vos-dev ! IFF you can, would be interesting to see more exactly where we lose performance. I'm thinking that there's a chance we can improve performance regardless of materializing refs or not.

@onno-vos-dev
Copy link

Thanks @onno-vos-dev ! IFF you can, would be interesting to see more exactly where we lose performance. I'm thinking that there's a chance we can improve performance regardless of materializing refs or not.

I'll try and grab some flame graphs later today 👍 From a flamegraph I have (from a much wider code flow which let me down this archaeology trip in the first place) jesse builds a huge callstack so it gets quite noisy with such a large schema as what I've been trying to dig around in.

I'll see if I can either flamegraph it separately alternatively if I can reduce the schema a tad to try and reduce the stack and make the problematic codepath a bit more visible.

@onno-vos-dev
Copy link

onno-vos-dev commented Sep 19, 2021

Apparently Github does not allow uploading svg but it does allow uploading arbitrary zip files 🤔 Wonder what is more secure. 🤔 😄 I uploaded a zip and MD5sum at the end of this post. Feel free to verify :-)

Anyway, sorry for the delay, I've had a few things that took priorities but here are two SVGs, jesse_old and jesse_new where jesse_old is the schema that I'm fiddling around with that contains a bunch of references while jesse_new has all the references resolved as per the PR proposal by @seriyps

Minor disclaimer, I only recently started digging into Jesse schema validations so I'm really not as familiar with the code as you so some of my findings and conclusions may be wrong, dumb or just outright silly. :-) So please forgive me if I write something totally crazy here 🤓

One can clearly see in jesse_old that we're spending a lot of time simply in validate_ref-land. When zooming in on it, we see jesse_state:resolve_ref/2 forming a good chunk of the stack. As far as I can follow, simplified the callstack would be:

  1. jesse_validator_draft4:check_value/3 (second to last function clause containing ?REF)
  2. jesse_validator_draft4:validate_ref/3
  3. jesse_validator_draft4:resolve_ref/2
  4. jesse_schema_validator:validate_with_state/3
  5. jesse_schema_validator:select_and_run_validator/4
  6. jesse_validator_draft4:check_value/3 (this time with a resolved JsonSchema, which may however enter a similar loop in case the JSON returned contains a Ref by itself)

Considering the above callstack we can eliminate steps 2 -> 5 with a resolved schema. If we dive into this particular part of the callstack to see if we can find anything worth optimizing on it's own, my thinking is the only candidate would be the call to jesse_state:resolve_ref/2 done by jesse_validator_draft4:resolve_ref/2 as that's essentially the main part that we'd be removing from the callstack when resolving Refs upfront.

Looking into this particular part we see the following:
jesse_state_resolve_ref

When digging into two examples, the resolution of the Datetime or UUID References in my schema, we can see it takes roughly 150 us to complete.

1> perftest:test([fun() -> jesse_state:resolve_ref(State, <<"definitions.json#/definitions/Datetime">>) end]).
Method                                           Loops               Min               Max            Median           Average
empty loop                  2816498 *  333 = 937893834           0.003us           2.039us           0.003us           0.003us
erl_eval:'-expr/5-fun-3-'()    1291 *   50 =     64550         132.860us         335.160us         145.260us         154.481us
ok
2> perftest:test([fun() -> jesse_state:resolve_ref(State, <<"definitions.json#/definitions/UUID">>) end]).
Method                                           Loops               Min               Max            Median           Average
empty loop                  2863600 *  333 = 953578800           0.003us           2.183us           0.003us           0.003us
erl_eval:'-expr/5-fun-3-'()    1256 *   50 =     62800         135.840us         533.880us         146.700us         158.788us
ok

Even though that may not sound like much, validating a single JSON message that I've been testing with, 359 References must be resolved at runtime. Easy math tells us that we're spending roughly 53-56ms just resolving Refs in this schema.

There's a lot going on in jesse_state:resolve_ref/2 and none of it particularly jump out when looking at the stack above. 😞 It'd have been nice if there was one thing in particular that was taking the majority of time but it seems quite evenly spread among a whole range of different functions it has to perform in other to be able to return the resolved JSON Schema. Some of the tasks performed might be cache-able but I'm not convinced that it's totally worth it.

In my eyes, if you're dealing with a JSON Schema that has 359 References, you're on your own. 😄 Resolving all references at compile-time makes perfect sense unless it's something that does not happen often enough to make it worth the trouble. It seems trivial in my mind at least to set up some compile-step which simply resolves all the references and stores the finalized schema. This way the developer keeps the best of both worlds where when working on the schema, you have nice references that you can refer to while the runtime code no longer has to resolve them constantly. Alternatively do something like @seriyps stated, allow an option to resolve all references prior to storing them in jesse database.

In order to take this PR forward, can we try and come up with a few things required for this to become merge-able?

  • Add some basic tests?
  • Add ability to store resolved schema in jesse database and defaulting to false (preserve old behavior).

Would that be sufficient? :-)

file info:
md5sum jesse_flamegraphs_pr_47.zip
bcb3cd47fb74c97be3903d86fd00246a

virustotal.com: scan result
jesse_flamegraphs_pr_47.zip

@seriyps
Copy link
Contributor Author

seriyps commented Sep 19, 2021

I think one of the issues I experienced when creating the original PR were that resolving the reference involves ETS lookup. And ETS lookup means memory copying from ETS to process heap.
Also, if you have a schema with

type: array,
items: {
  $ref: "..."
}

and will try to validate an array of 1000 items the reference is going to be resolved 1000 times

@andreineculau
Copy link
Member

bump

  1. I'm happy with materialize_refs as is, but with a note on circular refs. If basic tests can be added, great.
  2. I guess we have identified an optimization: resolve $ref once for "looping" properties like items, additionalProperties, etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants