-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVRO-2299: Normalize Avro Standard Canonical Schema updated latest rebase. #805
base: main
Are you sure you want to change the base?
Conversation
@rumeshkrish I stumbled across AVRO-2299 when I realised that parsing canonical form strips too many properties from the schema to be useful for my use-case. I'm interested to try standard canonical form, so I pulled in these changes and tried generating
import java.io.File;
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.SchemaNormalization;
public class App {
public static void main(String[] args) throws IOException {
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(new File("./test.avro"));
System.out.println(SchemaNormalization.toCanonicalForm(schema));
}
} And here's the contents of {
"type": "enum",
"name": "Suit",
"symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"],
"default": "HEARTS"
} This generates the following output, which does not include the default: {"name":"Suit","type":"enum","symbols":["SPADES","HEARTS","DIAMONDS","CLUBS"]} It looks like the arm of the code that handles ENUM's is missing a check to see if the if (s.getEnumDefault() != null) {
o.append(",\"default\":").append(JacksonUtils.toJsonNode(s.getEnumDefault()).toString());
} |
Having played a little more, there also appears to be a discrepancy between the specification for standard canonical form and the handling of the {
"type": "bytes",
"logicalType": "decimal",
"precision": 4,
"scale": 2
} Then I get this: {"type":"bytes""logicalType":"decimal","precision":4,"scale":2} The most obvious problem is that this is not valid JSON. However, if I fix this then there is still the problem that the spec. does not include In this case, I think the current implementation is correct, and the specification needs to be amended. This is because stripping the Experimenting with {
"type": "fixed",
"size": 16,
"name": "x",
"logicalType": "decimal",
"precision": 4,
"scale": 2
} Then I get this: {"name":"x","type":"fixed","size":16} This has discarded the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently fails to compile.
It would be great if someone could carefully read the proposed changes to the specification. I don't have time for that today. Thanks!
} | ||
} | ||
|
||
@RunWith(Parameterized.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is a duplicate of the above and causes a compilation error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate class removed. Thanks @cutting .
} | ||
} | ||
|
||
@RunWith(Parameterized.class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a duplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate class removed. Thanks @cutting .
@prestona Thanks for your time to do some more testing. I have update the specification with following test cases.
|
…logical type standard canonical form changes along with additional test cases.
> mvn spotless:apply
Hello! My apologies for the late, late reply -- I haven't been following this issue very well! I guess my big question is whether this really needs to be part of the specification and SchemaNormalization. There are lots of reasons we might want useful, custom code that do schema transformations, including customized normalizations. Stripping custom user properties and "non-essential" metadata is certainly one of those! But is the specification necessary or useful for portability between languages or versions? What do you think about just adding the code and tools to do "custom normalizations" into the Java SDK and others without defining a new canonical form? |
@RyanSkraba It would be great, if we could add Standard Normalizations into other languages or versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(More "Request Discussion" than "Request Changes"!)
This could be really important as a tool for working with schemas that have a longer lifetime (especially with versioning, evolution and registries), so I think it's worth getting it right.
Thanks for keeping on top of this feature -- I appreciate the work you've put into it!
@@ -1310,6 +1310,92 @@ | |||
</ul> | |||
</section> | |||
|
|||
<section> | |||
<title>Standard Canonical Form for Schemas</title> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I proposed in the JIRA that we don't create a new specification for this form, and just consider getting the "plain" schema as an SDK tool issue.
If we go that way, you can ignore the comments on this file :D
as follows: <code>name</code>, <code>type</code>, | ||
<code>fields</code>, <code>symbols</code>, | ||
<code>items</code>, <code>values</code>, | ||
<code>logicalType</code>, <code>size</code>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you switch these to keep the initial attributes the same as for Parsing Canonical Form: name, type, fields, symbols, items, values, size ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to define the order for user/custom properties? I'm tempted to say keep the 7 attributes from Parsing Canonical form in their order, followed by all other kept attributes in alphabetical order...
|
||
<li> [STRIP] Keep only attributes that are relevant to | ||
reserved properties, which are: | ||
<code>type</code>, <code>name</code>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's odd that the order specified here is different than below. It's also the case for Parsing Canonical Form documentation though...
A good question -- size is "sometimes" relevant, and sometimes ignored, for example. In the Java SDK we can add it as an attribute to a field but not an enum, and it's stripped from the field despite being a "kept" attribute. Is this a bug with canonical form in general? What do we want to happen with the Standard Canonical Form?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size is only relevant for fixed
, it should not be present in any other type.
|
||
<li> [INTEGERS] Eliminate quotes around and any leading | ||
zeros in front of JSON integer literals (which appear in the | ||
<code>size</code> attributes of <code>fixed</code> schemas).</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also appear in default attributes now. We also have floating point numbers that should be normalized.
JSON arrays should be fine, but JSON objects in defaults should probably have their fields ordered alphabetically.
<p><em>Standard Canonical Form</em> is a transformation of a schema | ||
into standard canonical ordered. It contains only avro reserved | ||
properties <code>"name", "type", "fields", "symbols", "items", "values", | ||
"logicalType", "size", "order", "doc", "aliases", "default"</code> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth mentioning that any properties the are used to configure a logical type should also be kept ("scale", "precision" and user-defined logical types.)
As a consequence, when generating a canonical form that includes a user-defined LogicalType, all languages should have the same defined attributes for that logical type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we're keeping default, we should keep all of the attributes that might appear in a JSON object in the default.
@@ -17,17 +17,22 @@ | |||
*/ | |||
package org.apache.avro; | |||
|
|||
import org.apache.avro.util.internal.JacksonUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is currently Jackson-independent... I think the JSON is being written manually here to ensure that it's isolated from any specific JSON implementations, and the trend of the last few versions has been to reduce the dependency on Jackson in particular.
Is it possible to avoid introducing this class?
} | ||
|
||
private static void setLogicalProps(Appendable o, LogicalType lt) throws IOException { | ||
o.append(",\"").append(LogicalType.LOGICAL_TYPE_PROP).append("\":\"").append(lt.getName()).append("\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect user-defined logical type properties to work here! (But that could be left as a known issue as well...)
|
||
private static void setComplexProps(Appendable o, Schema s) throws IOException { | ||
if (s.getDoc() != null && !s.getDoc().isEmpty()) | ||
o.append(",\"doc\":\"").append(s.getDoc()).append("\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be o.append(",\"doc\":").append(toJsonNode(s.getDoc()).toString());
if (f.order() != null) | ||
o.append(",\"order\":\"").append(f.order().toString()).append("\""); | ||
if (f.doc() != null) | ||
o.append(",\"doc\":\"").append(f.doc()).append("\""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be o.append(",\"doc\":").append(toJsonNode(f.doc()).toString());
Jira
This is new feature in JAVA component to normalise the Avro schema using canonical order and get the avro schema with reserved properties. The canonical form of schema should follow the below rules,
User able to normalise schema with additional user defined logical types.
reserved avro property ordering as below, followed by user given properties.
Tests
Documentation
Reference discussion:
Sending to Review : @cutting , @tjwp , @Fokko