-
Notifications
You must be signed in to change notification settings - Fork 917
Add generic SerDe support #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d9c5a1e
dd32234
469f140
4827c5a
c8ae724
7333f2a
7ef6a36
829277d
f731b24
641311a
0fb364a
b2b0286
5011edd
bd555b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
#!/usr/bin/env python | ||
# | ||
# Copyright 2018 Confluent Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
from avro.schema import PrimitiveSchema | ||
|
||
# Python 2 considers int an instance of str | ||
try: | ||
string_type = basestring # noqa | ||
except NameError: | ||
string_type = str | ||
|
||
|
||
class GenericAvroRecord(dict): | ||
""" | ||
Pairs an AvroRecord with it's schema | ||
|
||
:param schema schema: A parsed Avro schema. | ||
:param dict record: Wraps existing dict in GenericAvroRecord. | ||
:raises ValueError: If schema is None. | ||
:returns: Avro record with its schema. | ||
:rtype: GenericAvroRecord | ||
""" | ||
__slots__ = ['schema'] | ||
|
||
def __init__(self, schema, record=None): | ||
if schema is None: | ||
raise ValueError("schema must not be None") | ||
self.schema = schema | ||
if record is not None: | ||
self.update(record) | ||
|
||
def put(self, key, value): | ||
self[key] = value | ||
|
||
|
||
def get_schema(datum): | ||
if isinstance(datum, GenericAvroRecord): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. useful function, but doesn't belong in this file - belongs with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair point |
||
return datum.schema | ||
elif (datum is None): | ||
return PrimitiveSchema("null") | ||
elif isinstance(datum, string_type): | ||
return PrimitiveSchema('string') | ||
elif isinstance(datum, bool): | ||
return PrimitiveSchema('boolean') | ||
elif isinstance(datum, int): | ||
return PrimitiveSchema('int') | ||
elif isinstance(datum, float): | ||
return PrimitiveSchema('float') | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. python floats are double precision and if you don't use a same with int - python seems to have unified int's to be mean long in more recent versions of the language, and I think just using a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I did forget to add the double, I'll add that in. I do not think we should be silently promoting floats. The avro implementation actually ends up calling write_long within write_int anyway so there is nothing to be gained there. It's much more straightforward to say an int is an int. https://github.com/apache/avro/blob/master/lang/py/src/avro/io.py#L300-L304 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. python doesn't have any data type that corresponds to avro's float or avro's int. python float is equivalent to avro double. pre-python3, I think python int's are platform dependent. On my machine they'e 64 bit (i.e. equivalent to avro long). in python3, all integer values are arbitrary precision (i.e. there is no corresponding avro type, but the best choice is long, presumably the avro library has some error checking). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is true for floats on read but it absolutely differentiates between the float and double on write: https://github.com/apache/avro/blob/master/lang/py/src/avro/io.py#L316-L326 Since the Avro implementation differentiates in its treatment of floats and doubles its not really our place to say you really meant double. We have to work under the assumption that the developer using the library was aware that an avro float is 32 bits and that is why they made the conscious decision to declare a float. I'm not super keen on the idea of upgrading generic ints either. This would be inconsistent with the handling of Avro Records with int fields or even primitives that were specific primitives(provide the primitive schema before hand). If you can accidentally write a 64-bit integer to an avro record field you can accidentally write a 64-bit int to a generic int. I agree this is unfortunate and perhaps an improvement that can be made upstream. But until then we should continue to mark ints as ints I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ahh whoops, would you believe I forgot the schema is specified. very good, stupid me, carry on. |
||
else: | ||
raise ValueError("Unsupported Avro type {}.".format(type(datum))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider composition instead of inheritance here because 1. it's a looser form of coupling (allows for more future flexibility), so is generally a better default 2. composition is used in other languages, e.g. https://github.com/confluentinc/avro/blob/master/lang/csharp/src/apache/main/Generic/GenericRecord.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite intentional as it allows me bind the schema with the record without needing to dereference the record contents in the serializer. Since all records are represented as dicts and the datum writer will only accept a dict I think this is the right move for python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't quite understand what you're trying to say, but there would be only minor difference in code between the two approaches (the composition approach would be a bit more but not much). Anyway, I'm ok with the inheritance relationship here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm saying why call serialize(record.data) vs serialize(data). I'm not convinced I gain anything from composition here. On the contrary however Users can't treat this GenericAvroRecord exactly as if it were a dictionary. Just like they could before I introduced a new type. i.e. record['stuff']='a' or record.get('stuff',None).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i think it's fine. you could of course provide pass through methods (which other implementations do), but there may also be some python syntax that doesn't work (haven't looked into it).
serialize(record.data)
vsserialize(data)
is a non-issue since it's internal. anyway, i'm happy enough with this, just pointing out hte difference.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing the other thing that is nice is your record types work exactly as they did in the past with no material changes. You can still use
record.put
andrecord[key]
as you did previously since all records are dicts