Skip to content

Extract vocabularies from the specs #1510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 17, 2024
Prev Previous commit
Next Next commit
add vocabulary proposal doc
  • Loading branch information
gregsdennis committed Jul 17, 2024
commit 8dabd03f36cefc7f1e1d165708bdfa0bf198c124
199 changes: 182 additions & 17 deletions proposals/vocabularies.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The current approach to extending JSON Schema by providing custom keywords is
very implementation-specific and therefore not interoperable.

To address this deficiency, this document proposes vocabularies as a concept
and a new Core keyword, `$vocabulary` to support it.
and a new Core keyword, `$vocabulary`, to support it.

While the Core specification will define and describe vocabularies in general,
the Validation specification will also need to change to incorporate some of
Expand All @@ -16,9 +16,9 @@ in both documents.
## Current Status

This proposal was originally integrated into both specifications, starting with
the 2019-09 release, and has been extracted as the feature is incomplete. The
feature, at best effort, was extracted in such a way as to retain the
functionality present in the 2020-12 release.
the 2019-09 release. For the upcoming stable release, the feature has been
extracted as it is incomplete. The feature, at best effort, was extracted in
such a way as to retain the functionality present in the 2020-12 release.

Trying to fit the 2020-12 version into the current specification, however,
raises some problems, and further discussion around the design of
Expand All @@ -45,28 +45,191 @@ also apply to this document.

### Problem Statement

The specification allows implementations to support user-defined keywords.
However, this vague and open allowance has drawbacks.
To support extensibility, the specification allows implementations to support
keywords that are not defined in the specifications themselves. However, this
vague and open allowance has drawbacks.

1. This isn't a requirement, it is a permission. An implementation could just as
easily (_more_ easily) choose _not_ to support user-defined keywords.
1. Such support is not a requirement; it is a permission. An implementation
could just as easily (_more_ easily) choose _not_ to support extension
keywords.
2. There is no prescribed mechanism by which an implementation should provide
this support. As a result, each implementation that _does_ have the feature
supports it in different ways.
3. Support for any given user-defined keyword will be limited to that
implementation. Unless the user explicitly configures another
implementation, their keywords likely will not be supported.
3. Support for any given user-defined keyword will be limited to the
implementations which are explicitly configured for that keyword. For a user
defining their own keyword, this becomes difficult and/or impossible
depending on the varying support for extension keywords offered by the
implementations the user is using.

This exposes a need for the specification(s) to define a way for implementations
to share knowledge of a keyword or group of keywords.
This exposes a need for an implementation-agnostic approach to
externally-defined keywords as well as a way for implementations to declare
support for them.

### Solution

<!-- What is the solution? Include examples of use. -->
Two new concepts, vocabularies and dialects, will be introduced into the Core
specification.

A vocabulary is identified by an absolute URI and is used to define a set of
keywords. A vocabulary is generally defined in a human-readable _vocabulary
description document_. (The URI for the vocabulary may be the same as the URL of
where this vocabulary description document can be found, but no recommendation
is made either for or against this practice.)

A new keyword, `$vocabulary`, will be introduced into the Core specification as
well. This keyword's value is an object with vocabulary URIs as keys and
booleans as values. This keyword only has meaning within a meta-schema. A
meta-schema which includes a vocabulary's URI in its `$vocabulary` keyword is
said to "include" that vocabulary.

```jsonc
{
"$schema": "https://example.org/draft/next/schema",
"$id": "https://example.org/schema",
"$vocabulary": {
"https://example.org/vocab/vocab1": true,
"https://example.org/vocab/vocab2": true,
"https://example.org/vocab/vocab3": false
},
// ...
}
```

A dialect is the set of vocabularies listed by a meta-schema. It is ephemeral
and carries no identifier.

_**NOTE** It is possible for two meta-schemas, which would have different `$id`
values, to share a common dialect if they both declare the same set of
vocabularies._

A schema that declares a meta-schema (via `$schema`) which contains
`$vocabulary` is declaring that only those keywords defined by the included
vocabularies are to be processed when evaluating the schema. All other keywords
are to be considered "unknown" and handled accordingly.

The boolean values in `$vocabulary` signify implementation requirements for each
vocabulary.

- A `true` value indicates that the implementation must recognize the vocabulary
and be able to process each of the keywords defined it. If an implementation
does not recognize the vocabulary or cannot process all of its defined
keywords, the implementation must refuse to process the schema. These
vocabularies are also known as "required" vocabularies.
- A `false` value indicates that the implementation is not required to recognize
the vocabulary or its keywords and may continue processing the schema anyway.
However, keywords that are not recognized or supported must be considered
"unknown" and handled accordingly. These vocabularies are also known as
"optional" vocabularies.

Typically, but not required, a schema will accompany the vocabulary description
document. This _vocabulary schema_ should carry an `$id` value which is distinct
from the vocabulary URI. The purpose of the vocabulary schema is to provide
syntactic validation for the the vocabulary's keywords' values for when the
schema is being validated by a meta-schema that includes the vocabulary. (A
vocabulary schema is not itself a meta-schema since it does not validate entire
schemas.) To facilitate this extra validation, when a vocabulary schema is
Comment on lines +129 to +131
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've always defined a meta-schema to be a schema that describes a schema. These vocabulary schemas do fit that definition. I understand what you're trying to say here, but I don't think saying it's not a meta-schema is the right approach. Didn't Henry update the spec at some point with a way to describe this behavior without having to say it's ignored or it's not a meta-schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

He didn't update the spec or anything. This actually came out of reviewing @jviotti's book and trying to rework my vocab schemas. I wrote about it here, and we pulled out $vocabulary from them in #1460, #1461, and #1462.

They're not meta-schemas because they don't themselves describe full schemas; they are used by meta-schemas as components.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we're quite talking about the same thing. I entirely agree that vocabulary meta-schemas shouldn't include the $vocabulary keyword.

They're not meta-schemas because they don't themselves describe full schemas they are used by meta-schemas as components.

This distinction is what doesn't sit right with me. The way I see it, a component of a meta-schema is still a meta-schema. Section 8.1.2.2 is what I thought you were referring to here. Is that correct?

It does seem like the wording there is not considering schemas referenced by a meta-schema a meta-schema, but that's never how I've understood the word or how we've defined the word. I've always used the terms dialect meta-schema and vocabulary meta-schema. The $vocabulary keyword is only meaningful when the meta-schema is used as a dialect meta-schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I wasn't referring to that section.

I'm looking at Core 4.3.4 which defines "meta-schema":

A schema that itself describes a schema is called a meta-schema.

A vocabulary schema doesn't describe a schema, therefore it's not a meta-schema.

You wouldn't use the Meta-data vocab schema as a meta-schema; you reference it from a meta-schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. We're using the same definition, but interpreting it differently. The way I see it, a vocabulary schema is validating a schema. It validates the syntax of keywords in a schema. I don't think the definition implies that a schema is only a meta-schema if it describes the entire dialect used by the schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to publish a canonical vocab metaschema schema?

I'm not sure what you're leading toward. Are you questioning the current practice of defining a vocab schema? Or are you saying that we should have a vocab schema that can function as a meta-schema?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got confused by this, others doubtlessly will (or have already) as well.

I 100% agree that we need to explain this better and using more descriptive terms is a big part of doing that well. But, the way you're currently expressing this is confusing to me because this doesn't mesh with my (and I'm sure others) understanding of the term "meta-schema". That's why I'd prefer to address this by introducing new terms that are more specific. Earlier I mention that I use the terms "dialect meta-schema" and "vocabulary meta-schema", but since the term "meta-schema" appears to be inconsistently understood, perhaps "dialect schema" and "vocabulary schema" is better.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to publish a canonical vocab metaschema schema?

I'm not sure what you're leading toward. Are you questioning the current practice of defining a vocab schema? Or are you saying that we should have a vocab schema that can function as a meta-schema?

I'm not questioning the current practice. But I would not have made the mistake if we'd had a "vocab metaschema" schema that clearly defined what was allowed in a vocab metaschema. It would have been apparent that we were talking about a very similar, but different entity.

And I am tending towards the idea of using the terms dialect schema and vocabulary schema

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not have made the mistake if we'd had a "vocab metaschema" schema that clearly defined what was allowed in a vocab metaschema. It would have been apparent that we were talking about a very similar, but different entity.

What would be the difference between these two meta-schemas? Everything that's allowed in a vocab schema is allowed in a dialect schema. Everything that's allowed in a dialect schema is allowed in a vocab schema. The only difference I can see would be their identifier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @jdesrosiers on this: syntactically they're just schemas.

The only difference is intent. A meta-schema is intended to describe/validate a schema. It can use vocab schemas (by reference) to accomplish that task. A vocab schema on its own does not describe a schema; it's merely a building block in a meta-schema. This is the distinction I'm making. (One could create a single-vocab meta-schema.)

I don't think having a separate meta-schema for vocab schemas is needed since anything you do in a general schema should technically be allowed in a vocab schema. I'm not sure how useful all of those features would be, but I don't see a reason to forbid any features.

provided, any meta-schema which includes the vocabulary should also contain a
reference (via `$ref`) to the vocabulary schema's `$id` value.

```jsonc
{
"$schema": "https://example.org/draft/next/schema",
"$id": "https://example.org/schema",
"$vocabulary": {
"https://example.org/vocab/vocab1": true,
"https://example.org/vocab/vocab2": true,
"https://example.org/vocab/vocab3": false
},
"allOf": {
{"$ref": "meta/vocab1"}, // https://example.org/meta/vocab1
{"$ref": "meta/vocab2"}, // https://example.org/meta/vocab2
{"$ref": "meta/vocab3"} // https://example.org/meta/vocab3
}
// ...
}
```

Finally, the keywords in both the Core and Validation specifications will be
divided into multiple vocabularies. The keyword definitions will be removed from
the meta-schema and added to vocabulary schemas to which the meta-schema will
contain references. In this way, the meta-schema's functionality remains the same.

```json
{
"$schema": "https://json-schema.org/draft/next/schema",
"$id": "https://json-schema.org/draft/next/schema",
"$vocabulary": {
"https://json-schema.org/draft/next/vocab/core": true,
"https://json-schema.org/draft/next/vocab/applicator": true,
"https://json-schema.org/draft/next/vocab/unevaluated": true,
"https://json-schema.org/draft/next/vocab/validation": true,
"https://json-schema.org/draft/next/vocab/meta-data": true,
"https://json-schema.org/draft/next/vocab/format-annotation": true,
"https://json-schema.org/draft/next/vocab/content": true
},
"$dynamicAnchor": "meta",

"title": "Core and Validation specifications meta-schema",
"allOf": [
{"$ref": "meta/core"},
{"$ref": "meta/applicator"},
{"$ref": "meta/unevaluated"},
{"$ref": "meta/validation"},
{"$ref": "meta/meta-data"},
{"$ref": "meta/format-annotation"},
{"$ref": "meta/content"}
],
}
```

The division of keywords among the vocabularies will be in accordance with the
2020-12 specification (for now).

### Limitations

<!-- Are there any limitations inherent to the proposal? -->
#### Unknown Keywords and Unsupported Vocabularies

This proposal, in its current state, seeks to mimic the behavior defined in the
2020-12 specification. However, the current specification's disallowance of
unknown keywords presents a problem for schemas that use keywords from optional
vocabularies. (This is the topic of the discussion at
https://github.com/orgs/json-schema-org/discussions/342.)

In short, if a schema uses a keyword from an unknown _optional_ vocabulary, the
implementation cannot proceed because unknown keywords are explicitly
disallowed. However, not being able to proceed with evaluation is the behavior
prescribed for _required_ vocabularies. Thus, if the behaviors for required and
optional vocabularies is the same, then the boolean value is moot, which
highlights that the structure of `$vocabulary` needs to be reconsidered.

#### Machine Readability

The vocabulary URI is an opaque value. There is no data that an implementation
can reference to identify the keywords defined by the vocabulary. The vocabulary
schema _implies_ this, but scanning a `properties` keyword isn't very reliable.
Moreover, such a system cannot provide metadata about the keywords. As such, the
user must explicitly ensure that the implementation recognizes and supports the
vocabulary, which isn't much of an improvement over the current state.

Having some sort of "vocabulary definition" file could alleviate this.

One reason for _not_ having such a file is that, at least for functional
keywords, the user generally needs to provide custom code to the implementation
to process the keywords, thus performing that same explicit configuration
anyway. (Such information cannot be gleaned from a vocabulary specification. For
example, an implementation can't know what to do with a hypothetical `minDate`
keyword.)

#### Implicit Inclusion of Core Vocabulary

Because the Core keywords (the ones that start with `$`) instruct an
implementation on how a schema should be processed, its inclusion is mandatory
and assumed. As such, while excluding the Core Vocabulary from the `$vocabulary`
keyword has no effect, it is generally advised as common practice to include the
Core Vocabulary explicitly.

This can be confusing and difficult to use/implement, and we probably need
something better here.

## Change Details

Expand All @@ -91,12 +254,14 @@ For example
```
-->

_**NOTE** Since the design of vocabularies will be changing anyway, it's not worth the time and effort to fill in this section just yet. As such, please read the above sections for loose requirements. For tighter requirements, please assume conformance with the 2020-12 Core and Validation specifications._

## [Appendix] Change Log

* [MMMM YYYY] Created
* 2024-06-10 - Created

## [Appendix] Champions

| Champion | Company | Email | URI |
|----------------------------|---------|-------------------------|----------------------------------|
| Your Name | | | < GitHub profile page > |
| Greg Dennis | | gregsdennis@yahoo.com | https://github.com/gregsennis |