Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any thought to Avro style "aliases" for basic schema evolution?? #285

Closed
cavanaug opened this issue Mar 26, 2017 · 11 comments
Closed

Any thought to Avro style "aliases" for basic schema evolution?? #285

cavanaug opened this issue Mar 26, 2017 · 11 comments

Comments

@cavanaug
Copy link

Has there been any thought to leverage the concept of aliases from Avro which allows for basic schema evolution??

https://avro.apache.org/docs/current/spec.html#Aliases

While not applicable in all cases, aliases can make life a bit easier for those recving data with slight schema variation at the senders.

@handrews
Copy link
Contributor

@cavanaug I'm not familiar with Avro beyond knowing it exists, but with a very quick read it looks like this feature is for dealing with field name changes, rather than type or semantic changes. Is that correct?

I try to avoid such changes, but I suppose everyone does and none of us are perfect :-) My evolution strategy is to version the schema's URI rather than the resource's, and therefore allow the schema version to be part of content negotiation. If a client asks for an older data format, I'll just send that. The response construction is driven by the schema anyway, so there's no real cost to supporting both the old and new schemas for something like a rename.

If you're looking at data type or semantic evolution, is there anything that describes that evolution strategy somewhere? (I confess, I haven't made much of an effort to look myself yet)

@cavanaug
Copy link
Author

cavanaug commented Mar 26, 2017

Avro schema evolution isnt perfect but it basically allows for field addition, deletion & renaming. So it is in essence only for simple syntactical evolution. But in many cases having that level of flexibility is pretty useful.

Avro use cases are often outside of a client/server request model in that they are more apt to exist in data processing flows (ala pub/sub systems like Kafka/Kinesis) where producers & subscribers may evolve at a different pace and happen outside of a content negotiation style model.

Advanced evolution and semantic evolution in my mind fall outside of a declarative syntax model. Which is sort of why I said Avro isnt perfect. In data processing systems it gives you a few extra capabilities, but when those are exhausted you are back to writing some type of "normalization" code as part of a data flow.

Ive often longed for some of the Avro capabilities in Json to handle those situations where simple syntactical evolution is all I would need. It provides a declarative model I can use without resorting to custom normalization code.

@handrews
Copy link
Contributor

handrews commented Mar 26, 2017

@cavanaug thanks for the explanation!

field addition and deletion works pretty well as long as you avoid making things required and don't set "additionalProperties": false, [EDIT: I'd written true by mistake] and make sure your client really doesn't choke on unanticipated or missing fields. Renaming is definitely tricker.

@cavanaug
Copy link
Author

Yep, the additionalProperties must be false and you need to set a default for all "new" fields to achieve a base level of evolution. The field renames with aliases allow another level of capability.

Though I am starting to wonder in the data processing world if there isnt perhaps a need for a more powerful & generalized json-evolution style solution. Lots of data processing involves varying degrees of "cleanups" on data.

@handrews
Copy link
Contributor

Yep, the additionalProperties must be false

Actually I'd meant to say to not set it to false but accidentally wrote true instead :-/ You have to allow additional properties so that when they start to appear, a client with the old schema won't suddenly fail to validate.

@handrews
Copy link
Contributor

Though I am starting to wonder in the data processing world if there isnt perhaps a need for a more powerful & generalized json-evolution style solution. Lots of data processing involves varying degrees of "cleanups" on data.

This sounds kind of like an XSLT for JSON, which is an idea that has popped up here before but is definitely outside of the current scope of this project. Maybe once our validation schema reaches or is close to RFC, we can look at building such a thing as another vocabulary on top of it. Does that make sense? I also don't think anyone would object to someone else starting a parallel project, but we have our hands more than full with the media type and two current vocabularies (validation and hyper-schema) plus two proposed (UI and documentation) in the queue already.

Thinking more about the original aliases question, I believe that is out of scope for us as well, as it is really just a small subset of a transformation system.

@cavanaug
Copy link
Author

XSLT is sort of an abomination in my mind in terms of complexity, so I shudder at the parallel. But yes, I may end up thinking & working more on that.

As for aliases, my suggestion though is not to consider support in the broader context of advanced evolution, but instead to consider in the context of Avro feature parity. Granted I know a lot of people here are focused more on the json web angle, dont discount the heavy usage of json in data environments and the potential growth there for jsonschema usage.

With alias support jsonschema would be a functional equivalent to avro in many data processing environments. That is a pretty powerful statement to make.

@Relequestual
Copy link
Member

I've had quite a look at AVRO in the last few years, and it was for a while the "protocol" (if we can call it that for now) of choice for a big international data exchange API.

There was lots of dicussion and dissagrement over how it should be used. Should it be used only to define the structure of data, or should the full ruleset of AVRO be used for over the wire transmission also?
One problem it has is that the JSON must be in a specific order... but this is against JSON in general, so it's not great.

It became apparent pretty fast that the libraries which implement AVRO are mostly not interoperable.

Eventually I pushed to have a larger discussion about the project and it's future, and they decided to move to ProtoBuff3. I'm not sure it was the right choice, but a lot of time was invested in investigation by people smarter than me.


If my limited knowledge of ProtoBuff is right, it also has a similar method you describe, where fields defined have static identifiers, and any modification means a new field.

I'm not sure the progression of schemas is something JSON Schema would want to do.

On the other hand, there's no harm in looking at how this works, and I think an alias of some type could be useful information for HyperSchema, and possibly also JSON Schema core.

@handrews
Copy link
Contributor

they decided to move to ProtoBuff3

Cap'n Proto has a good field evolution strategy (note: I'm not a neutral party since the inventor of Cap'n Proto recently joined Cloudflare).

@handrews
Copy link
Contributor

I'm trying to sort out what action there is to take on this issue.

For transforming JSON representations from one version to another, I would look at JSON Patch. It can express such concepts as "rename field X to Y", as well as adding and removing fields, and even limited conditionals by testing field values.

I guess the question then would be whether and how to integrate such a usage of JSON Patch into JSON Schema.

There's really no notion of versioning built into JSON Schema / Hyper-Schema. Each schema exists on its own. A system designer can indicate that a set of schemas form a sequence of versions, but that's an entirely external concept.

So if this goes anywhere, it would probably go in the API Documentation vocabulary: json-schema-org/json-schema-vocabularies#1

@cavanaug does that seem like a reasonable approach to and home for this concept? It's a worthwhile idea, I just don't see it fitting into core, validation, or hyper-schema.

@handrews
Copy link
Contributor

Moved this to json-schema-org/json-schema-vocabularies#5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants