Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deep validation and instead use error hierarchy to improve error messages #2842

Merged
merged 11 commits into from
Feb 5, 2023

Conversation

binste
Copy link
Contributor

@binste binste commented Jan 20, 2023

Background

Many exception messages that a user sees in Altair are based on the validation of the chart specification against the vega-lite schema. This validation with the package jsonschema produces error messages on multiple level of the schema, e.g. if an error happens in a subschema, it produces errors for the subschema as well as the schemas above. For the user, the most informative errors are the one directly pinpointing the root cause, i.e. the errors on the lowest level where the wrong specification occurs.

To generate and raise these "deep" errors, #555 introduced "deep validation". This meant that a chart is validated once using jsonschema, if a validation error is raised, the chart is validated again but this time the validation is performed iteratively on all individual subschemas in a recursive fashion. This was a great improvement and I'm not sure if there was a better way at that time. However, these subschemas by themselves are not always valid, see #2609 for an example, which in turn leads to new confusing error messages which have nothing to do with the original issue.

Proposed solution

This PR removes the deep validation again. Instead, it only validates the chart specifications once and then uses the exception hierarchy which is returned by jsonschema to find the root cause. See the "Walkthrough" section below for more details. With this approach, we can still raise the exception which is at the root of the issue, but there is no need to validate individual schemas and therefore issues such as #2609 are solved:

import altair as alt
from vega_datasets import data

source = data.cars.url

points = alt.Chart(source).mark_point().encode(
    x='Horsepower:Q',
    y='Miles_per_Gallon:Q',
)
(points | points).properties(width=400)

raises correctly:

SchemaValidationError: Invalid specification

        altair.vegalite.v5.api.HConcatChart, validating 'additionalProperties'

        Additional properties are not allowed ('width' was unexpected)

Furthermore, we can combine exceptions on the same "hierarchy level" in the schema to provide even more messages:

alt.Chart(data.barley()).mark_bar().encode(
    x=alt.X('variety'),
    y=alt.Y('sum(yield)', stack='asdf'),
)
SchemaValidationError: Invalid specification

        altair.vegalite.v5.api.Chart->0, validating 'enum'

        'asdf' is not one of ['zero', 'center', 'normalize']
        'asdf' is not of type 'null'
        'asdf' is not of type 'boolean'

Previously, the error only showed the line 'asdf' is not one of ['zero', 'center', 'normalize'], although stack could also be a boolean or null.

I already added these examples and a few more to the tests to make sure that informative exceptions are raised. Let me know if you have more that I can test against!

Fixes #2175, fixes #2609, fixes #2744, related PRs which are probably no longer needed: #2841, #2835
I tested the code also against jsonschema 3.0 which is the minimum version required by Altair.

Walkthrough

To help with the review process, I try to outline below how we get from an invalid chart specification to the final error message which is raised. I can't put in links to all the places in the code I refer to as they are not modified in the PR. Let me know if you need more details.

Let's start with the example from above:

  • The validation of a chart object is still triggered in the TopLevelMixin.to_dict method which then ends up calling super(...).to_dict here which propagates to SchemaBase.to_dict
  • SchemaBase.to_dict calls self.validate which uses the validate_jsonschema function to validate the specification. This raises a jsonschema.ValidationError. This error is catched in SchemaBase.to_dict and turned into a custom SchemaValidationError which has better error messages and then raised again
  • Without this PR. This error would read {'aggregate': 'sum', 'field': 'yield', 'stack': 'asdf', 'type': 'quantitative'} is not valid under any of the given schemas which is not very informative. This is why the deep validation was originally introduced.
  • This error would be catched in TopLevelMixin.to_dict and then the deep validation is run iterating through all objects in the specification and validating them individually. This raises the final error 'asdf' is not one of ['zero', 'center', 'normalize']to the user which is much better.

With the new approach implemented in this PR, we use the error hierarchy which is in the context attribute of the original jsonschema.ValidationError which was raised. "If the error was caused by errors in subschemas, the list of errors from the subschemas will be available on this property.", see the jsonschema docs for all the details. These errors from the subschemas have again a context attribute which allows to find the root error. I added a new function _get_most_relevant_errors which tries to find the most relevant errors based on this hierarchy. For the example it looks like this:

  • The originally raised error is: ValidationError: "{'aggregate': 'sum', 'field': 'yield', 'stack': 'asdf', 'type': 'quantitative'} is not valid under any of the given schemas and from now on referenced with top_error
  • top_error.context gives
    [<ValidationError: "'asdf' is not valid under any of the given schemas">,
     <ValidationError: "Additional properties are not allowed ('aggregate', 'field' were unexpected)">,
     <ValidationError: "'asdf' is not valid under any of the given schemas">,
     <ValidationError: "Additional properties are not allowed ('aggregate', 'field', 'stack', 'type' were unexpected)">,
     <ValidationError: "'value' is a required property">]
  • The function always takes the first error in the context and sees if this is already the lowest level, i.e. if there is another context or if its None. In this case, there is more. top_error.context[0].context gives:
    [<ValidationError: "'asdf' is not one of ['zero', 'center', 'normalize']">,
     <ValidationError: "'asdf' is not of type 'null'">,
     <ValidationError: "'asdf' is not of type 'boolean'">]
  • Calling .context again on the first validation error in this list shows that it is None and therefore we reached the bottom of the hierarchy.
  • The old deep validation approach would have raised this error alone. We can now use all 3 errors to craft a more informative error message. _get_most_relevant_errors will return all 3 errors. validate_jsonschema will then raise the first of them (ValidationError: "'asdf' is not one of ['zero', 'center', 'normalize']") and add the other two as _additional_errors attribute to the first. For errors of type additionalProperties only the first error on the lowest level is enough as the others do not contain more information, they are instead more verbose and somewhat confusing messages.
  • This error that is now raised is still converted, as in the old approach, to a SchemaValidationError so it should not break any users code if they have except statements. We use the 2 additional errors to craft the final error message
SchemaValidationError: Invalid specification

        altair.vegalite.v5.api.Chart->0, validating 'enum'

        'asdf' is not one of ['zero', 'center', 'normalize']
        'asdf' is not of type 'null'
        'asdf' is not of type 'boolean'

@mattijn
Copy link
Contributor

mattijn commented Jan 23, 2023

Great work @binste! @joelostblom is probably more capable in reviewing this than myself.
One question though, maybe not specific on this PR, but more in general on error messages based on schema such as:

'asdf' is not of type 'null'

In Python Altair, None is used instead of 'null'. There may be other instances of this. Is it necessary to translate these instances if they appear in an error message?

@ChristopherDavisUCI
Copy link
Contributor

@mattijn Is this dictionary possibly relevant to your comment?

https://github.com/altair-viz/altair/blob/a8172b7134ed262224e5dfb4e0c99d4bdc55bcb0/tools/schemapi/codegen.py#L226-L234

(In another version, in tools/schemapi/utils.py, it was mapping instead of dict. Let me know if you think it's a mistake that I am using dict in this context.)

@binste
Copy link
Contributor Author

binste commented Jan 25, 2023

I think translating those types would be very helpful to users! Thanks @ChristopherDavisUCI, definitely looks relevant. There is also https://github.com/altair-viz/altair/blob/a8172b7134ed262224e5dfb4e0c99d4bdc55bcb0/tools/schemapi/utils.py#L183 which is only slightly different. I'm happy to tackle this as part of a larger effort around types as mentioned in #2846 (comment) but if anyone gets a chance earlier to replace the types in the error messages then please, don't hold back! :)

@joelostblom
Copy link
Contributor

joelostblom commented Jan 26, 2023

In short, this is looking fantastic! Thank you so much for digging into the depths of this @binste and sorry for the delay in reviewing. I am going to get to this in the next couple of days.

On brief comment now, do you think it is confusing to see three error messages like in your example above? I don't know what would be a feasible solution technically here, but I am thinking that instead of

'asdf' is not one of ['zero', 'center', 'normalize']
'asdf' is not of type 'null'
'asdf' is not of type 'boolean'

the message would be clearer as:

'asdf' needs to be one of ['zero', 'center', 'normalize'] or of type 'null' or boolean'

This is a minor comment though

@binste
Copy link
Contributor Author

binste commented Jan 28, 2023

No worries and thank you for the kind words! :)

Indeed, it would be a nice enhancement if the error messages can be merged into a one-liner. For that we probably need to check how consistent these strings are and if that changes across versions for jsonschema. Or maybe we can construct it ourselves from some other underlying attributes of the exceptions. I made a note for myself to create a new issue with your suggestion once this PR is merged so we can look at it separately in detail.

Copy link
Contributor

@joelostblom joelostblom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said above, this PR is fantastic and makes both the code logic and the error message much more straightforward! One additional benefit of this PR is that it makes the traceback much shorter. Before, it was common to have around 120 lines or so of traceback; now that is down to 20, which makes it much easier to find the error message without having to scroll so much.

Also a big thanks for the details comments in the code and explanations in the PR which made it much easier to review. I have a few minor comments and then we're good to merge.

altair/utils/schemapi.py Outdated Show resolved Hide resolved
altair/utils/schemapi.py Show resolved Hide resolved
tests/utils/tests/test_schemapi.py Outdated Show resolved Hide resolved
tests/utils/tests/test_schemapi.py Outdated Show resolved Hide resolved
altair/utils/schemapi.py Show resolved Hide resolved
altair/utils/schemapi.py Show resolved Hide resolved
@binste
Copy link
Contributor Author

binste commented Jan 28, 2023

Thank you @joelostblom for the review and the great suggestions! Just fyi, I'll probably get to implementing them in a week or so. Same for any other open PRs I'm involved with.

@binste
Copy link
Contributor Author

binste commented Feb 4, 2023

All comments except two are implemented. For those I'll wait for your feedback.

@binste
Copy link
Contributor Author

binste commented Feb 5, 2023

Thanks for the review! I also added a line to the changelog. Not sure if this is the preferred way moving forward to keep it up-to-date? Could be nice to include a line for every PR which is relevant for the changelog but let me know if I should remove it again.

@joelostblom
Copy link
Contributor

Yup I agree, I think each notable PR should come with a changelog entry so that we can easier keep track. Will merge when checks go green.

@mattijn
Copy link
Contributor

mattijn commented Feb 5, 2023

Whoop whoop. Congrats @binste for this one!

@binste
Copy link
Contributor Author

binste commented Feb 5, 2023

Happy that this approach worked out, thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants