Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting default value for RecordField #24

Open
alexeiverbny opened this issue Oct 1, 2024 · 6 comments
Open

Allow setting default value for RecordField #24

alexeiverbny opened this issue Oct 1, 2024 · 6 comments

Comments

@alexeiverbny
Copy link

Hello,

I am wondering if it would be possible to allow the user to set or change a default field on the RecordField struct. This is possible in apache_avro: https://docs.rs/apache-avro/latest/apache_avro/schema/struct.RecordField.html.

The reason I am asking for this is that is required if you want to properly handle schema evolution while using trino to read avro data: https://trino.io/docs/current/connector/hive.html#avro-schema-evolution. In particular, we are writing structs to tables that are read by trino. If we want to add a field to that struct, the schema must have a default value in order for trino to be able to read older data that was written before the field was added.

Please let me know if what I am asking for is unclear.

Thank you,
Alex

@alexeiverbny alexeiverbny changed the title Allowing setting default value for RecordField Allow setting default value for RecordField Oct 1, 2024
@Ten0
Copy link
Owner

Ten0 commented Oct 15, 2024

Hello!
Thanks for opening this issue.
I didn't answer right away because I wanted to mediate a bit on it, but it's been a while and I haven't come to it yet so I'm just going to lay my thoughts here.

I understand the use-case and I think it makes sense to have it.
However I'd also like the existence of this field to not be misleading API-wise on that it's actually used for anything within serde_avro_fast, so I have to meditate a bit on what that would look like. (Maybe something generic that's essentially flattened like extra_attributes: Vec<(&'static str, serde_json::Value)> on the relevant schema types, which would have the advantage of also being very flexible...)

@alexeiverbny
Copy link
Author

Thanks for the response. The generic option you suggested seems reasonable to me.

@Ten0
Copy link
Owner

Ten0 commented Oct 16, 2024

The downside of that though is that if we then add a field to the structured list it won't show in extra_attributes, which would then be a breaking change...

@Ten0
Copy link
Owner

Ten0 commented Oct 20, 2024

Another issue is that currently I do not leak the serde_json dependency, which is a property that ideally I'd like to keep but it seems hard to support default without it...

@alexeiverbny
Copy link
Author

Hey,

Trying to think through some of the issues you've raised. I am not quite clear - can you please explain why adding a field to the structured list won't show in extra_attributes?

@Ten0
Copy link
Owner

Ten0 commented Oct 23, 2024

If I have a

struct RecordField {
   ...
   extra_attributes: Vec<(String, serde_json::Value)>,

but then I decide that a particular attribute needs to live in RecordField itself (e.g. doc or default or order...)

I would need to turn it into:

struct RecordField {
   ...
   doc: String,
   extra_attributes: Vec<(String, serde_json::Value)>,

which would probably imply removing it from extra_attributes. However removing it from extra_attributes would be a breaking change because a user might be looking for a field named specifically like this in extra_attributes, and now they should find it in the dedicated field instead.

Alternatively I could have fn extra_attributes(&self) -> impl Iterator<(&str, &serde_json::Value)> (and/or any kind of similar lookup function) which would enable also keeping the new structured fields in the iterator, which is what I was experimenting with in this commit, but then the behavior may be inconsistent between fields with regards to whether they show up in there, and that's getting boilerplateish, and I can't generate a &serde_json::Value from doc: &String, so that doesn't work great either.

Either way I also need to add "stuff every parsed but unused property in there" which also complexifies the parsing code especially if I want to add this extra_attributes in other places where it may make sense, specifically in SchemaNode.
(Plus performance-wise there's compromises that make me uneasy but of course performance of this part of the code shouldn't ever really be a concern...)

All in all I think I'm leaning towards just adding it as default: Option<JsonValue> (where JsonValue would be a reexport of serde_json::Value and I'll hope that serde_json 2.0 won't release anytime soon which is probably a reasonable hope, and assuming that nobody cares about the perf regression of schema parsing (which is probably not too significant anyway) because the whole point of avro is that you rarely parse schemas anyway).

Main thing to work out in this case is how to make the difference between absent field and Null here.
Probably with an extra option-like enum that has a dedicated Default implementation, with #[serde(default)], and with the deserialize impl always constructing Some even if passed deserialize_unit/deserialize_none...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants