Skip to content

Conversation

@slinkydeveloper
Copy link
Member

@slinkydeveloper slinkydeveloper commented Jul 14, 2020

This PR allows CloudEvent to contain a data field different from byte[]. The user doesn't need to have a prior knowledge of the type of data and EventFormat/protocol bindings doesn't need to know it too. When the user extracts the data from CloudEvent, then the data can be converted using getData(Class). On the other end, the CloudEvent might contain half middle representation data field like JsonNode for json.

In order to support marshalling/unmarshalling from/to byte[], I added the concept of EventDataCodec and implemented it in the jackson module.

Signed-off-by: Francesco Guardiani francescoguard@gmail.com

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
* @throws IllegalArgumentException if there isn't any unmarshaller from the inner data to T
*/
@Nullable
<T> T getData(Class<T> c) throws IllegalArgumentException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as seen in Jackson and others, Class may not represent all the types (e.g. List<User>). Consider using a type reference instead.

Copy link
Member Author

@slinkydeveloper slinkydeveloper Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but that means we need to define our type reference? I'm not sure i wanna dive in into that complexity... And if we define our type reference, then the EventDataCodec implementer needs to also convert our type reference to its type reference

Copy link
Member Author

@slinkydeveloper slinkydeveloper Jul 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea for using Class is:

  • For the simple use cases, the user just uses getData(c).
  • For the user cases when the user needs a type reference, he can just use jackson directly: he gets the raw payload with getData(Object.class) and then he uses the jackson mapper to map it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally against this operation all together regardless of Type or Class. CloudEvent type is representation of the actual cloud event on the wire. It should be as simple and as light as possible containing only essentials. We had a very long and contentious debate about it, so let's not go back to square one.
If user needs stronger type for its data representation then there should be another abstraction that would perform such conversion, but only as an option.

Signed-off-by: Francesco Guardiani <francescoguard@gmail.com>
@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

@slinkydeveloper previously, CloudEvent was a POJO, but now it may need to store a reference to the encoder/decoder service to perform getData(Class) (de-)serialization, and can't be treated as a simple POJO anymore.

Also, obviously getData(Class) cannot be generic for every Class, which means that an instance of CloudEvent may return from getData(Class) correctly when created by implementation X but fail (due to missing codec) with from Y.

Another question is how byte[] getData() should be implemented now. There are two options:

  1. always have byte[] data representation in the implementing object - bad for performance
  2. always ensure that the underlying object can be encoded as byte[] - hard to enforce

In Liiklus, I went with option number 2 for performance reasons, but I can't say I am 100% happy about the API of it :)

Curious to hear your thoughts on these.

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Jul 14, 2020

I think i didn't underlined the important prerequisite of this PR, as you sad:

always ensure that the underlying object can be encoded as byte[] - hard to enforce

If it's a CloudEvent, it must be somehow encodable/decodable to bytes. If you put inside the data field of CloudEvent something that cannot be encoded to bytes, then i suspect it's a bad usage of the CloudEvent abstraction. So IMO this enforcement is necessary.

I see this feature more as a mix of optimization and sugar to use the sdk, but it's not my intention to break the existing assumptions/requirements of CloudEvent:

  • Optimizes the use case where you create a CloudEvent in memory and it never goes to the wire
  • It allows to implement more efficiently an event format (like json) that defines some "special" encodings around some specific data content types
  • Provides some "sugar" to invoke directly the json mapper from the CloudEvent data field

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Jul 14, 2020

@slinkydeveloper previously, CloudEvent was a POJO, but now it may need to store a reference to the encoder/decoder service to perform getData(Class) (de-)serialization, and can't be treated as a simple POJO anymore.

Also, obviously getData(Class) cannot be generic for every Class, which means that an instance of CloudEvent may return from getData(Class) correctly when created by implementation X but fail (due to missing codec) with from Y.

So CloudEvent doesn't need to store a reference to the encoder/decoder service (as you see in the code, I resolve the codec using data content type), but yes this change definitely adds the requirement to the implementations of CloudEvent to have some kind of "codecs store", so i would love to hear some feedback before proceeding forward with fully implementing it.

Of course, an implementation can always fail on getData(Class) saying that it cannot find the codec, but that doesn't sound good.

@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

If you put inside the data field of CloudEvent something that cannot be encoded to bytes, then i suspect it's a bad usage of the CloudEvent abstraction

As we previously figured, CloudEvent is an in-memory representation of a CloudEvent, and it is a valid use case to use it for signalling events without even having to encode/decode them. Given that, I would not say that it is a "bad usage of the CloudEvent abstraction".

getData(Class) implies some sort of a conversion mechanism. Have you considered having CloudEvent<T> with T getData()? Then the decoding shifts to the CloudEvent creator (something like CloudEvent<User> receiveCloudEvent(User.class)), and encoding is done at the writer side, with an option to get the raw data as ByteBuffer, for example.

@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

Another question to ask: does it really need to be in the API at all. Since it is an optimization, it can as well be implementation specific, when the optimization can be made.

@slinkydeveloper
Copy link
Member Author

Have you considered having CloudEvent with T getData()?

I don't like the idea of bringing back generics to CloudEvent interface. We removed them because they brought the important concern of prior knowledge of the payload of the event. As a user that receives 3/4 different type of events, i don't have the prior knowledge of the generic inside the CloudEvent. Also all the encoding process had to have knowledge of how to go from data to T, while now this is not required. Now the user can just read the CloudEvent, get the payload as bytes and that's all, without having prior knowledge of the content of data nor without registering additional mappers.

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Jul 14, 2020

Another question to ask: does it really need to be in the API at all. Since it is an optimization, it can as well be implementation specific, when the optimization can be made.

It needs to be an API in the sense that, if I create an in memory CloudEvent with MyPojo as data, I wanna retrieve it without going through byte[] (nor a downcasting) doing CloudEvent.getData(MyPojo.class). But maybe I'm watching the problem from the wrong POV...
Another solution could be removing getData(Class) and adding something like Object getRawData(), then with his assumptions the user might cast to MyPojo the returned value. But that still requires the implementation or to store the data field as byte[] or to define an encoding from its data field to byte[]

@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

As a user that receives 3/4 different type of events, i don't have the prior knowledge of the generic inside the CloudEvent

For that, you could have CloudEvent<JsonNode> (that can be decoded into a concrete type later, for the record).

Now the user can just read the CloudEvent, get the payload as bytes and that's all, without having prior knowledge of the content of data nor without registering additional mappers.

This sounds too magical for me given that we're talking about the thin interface in the API module. Think of this interface as the one that defines interoperability between various CloudEvent implementations. You can have as many features as desired in the core module (aka the "default" implementation), but the interface needs to be carefully reviewed to not block other implementations nor make them suffer working around some "features" of it that they can't support.

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Jul 14, 2020

CloudEvent<JsonNode>

Still, that's a prior assumption that the payload is json 😄 In sdk-go we have this interesting api usability feature that you can decode the payload without checking/having knowledge of the data content type: getData(MyPojo.class) always returns MyPojo, no matter if data content type is xml or json (of course assuming that MyPojo can be decoded from xml and json)

@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

@slinkydeveloper

Still, that's a prior assumption that the payload is json

Funny enough, readTree in Jackson's (aka "FasterXML") XML support returns... JsonNode 😂 It can also be any other "holder" of a value ready to be deserialized. As long as it decouples the decoding logic from CloudEvent implementations (where getData(Class) is considered a decoding operation)

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Jul 14, 2020

How does it sound the option of T getData(Class<T>) -> Object getRawData() and then some static method somewhere (probably in core) like readData(Class)? That doesn't sound problematic, doesn't add any new assumptions and a naive implementation without encoding/decoding can still store everything as byte[]. And we can still optimize the CloudEvent in memory use case

@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

@slinkydeveloper I like it, but I'd still bring back the generic parameter, to avoid unnecessary casting when the type is clear. It will always allow to have CloudEvent<Object>.

and then some static method somewhere (probably in core) like readData(Class)

Yeah, this definitely belongs to core. You could even add an instance method to core's implementation, and, if you don't want to expose implementation's type, add an additional interface (to core) that will have something like:

SmartCloudEvent<T> extends CloudEvent<T> {
    <T> T readData(Class<T> type);
}

(the name is random and there must be a better one, of course, just I didn't want to spend hours trying to find it 😂 )

@bsideup
Copy link
Contributor

bsideup commented Jul 14, 2020

Re "naive implementation" - it will indeed always return CloudEvent<byte[]> in such case, which is absolutely valid and self descriptive: "I only know the byte representation".

@johanhaleby
Copy link
Contributor

johanhaleby commented Jul 14, 2020

Thanks for looking into this. My use case is that I want to store CloudEvent's in MongoDB but hopefully provide a way to avoid double encoding/decoding of the body (see discussion here for context). This means that I'd like to be able to set an instance of org.bson.Document as body.

I kind of like the non-generic api better though (as compared with cloud-events 1.3), it looks less complex and you know that you're dealing with byte[] data, but it's probably too limited to cover cases such as mine. I guess a trade-off has to be made on performance vs complexity like you imply. I would personally be fine with enforcing byte[], but I'm not sure it's generalizable.

How does it sound the option of T getData(Class) -> Object getRawData() and then some static method somewhere (probably in core) like readData(Class)

I don't really understand what you mean here (I may be corrupted by -> in ML languages? :)). Do you mean two different methods, T getData(Class<T>) and Object getRawData() in the CloudEvent interface?

On a different note I would strongly suggest to keep encoding/decoding/codecs out of the CloudEvent implementation. In my mind the data representation of a cloud event and how its encoded/decoded are different concerns. Better to pass CloudEvent as a parameter to a function that does encoding/decoding.

@jskeet
Copy link

jskeet commented Jul 20, 2020

FWIW, I believe the question of "how should CloudEvent objects represent data" is likely to be an important one across multiple languages.

I've been thinking about this myself quite a lot for the C# SDK, but without any concrete progress yet.

@olegz
Copy link
Contributor

olegz commented Jul 20, 2020

@jskeet Can't agree with you more. It is certainly an important aspect and must cover many angles of type representation and conversion including class loader constrains (Foo that is really a Foo doesn't appear to be a Foo) and more. . . So, while 100% in agreement that it is needed, i just don't believe the question of "how should CloudEvent objects represent data" should be answered by CloudEvent interface.

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Jul 27, 2020

Ok I thought about it and I think i found a solution that reaches the 3 common goals we have:

  • nice ux for the basic data field unmarshalling while keeping it lazy (aka the data field parsing should not be done while reading the message from the wire)
  • support having pojos inside the event, so our jackson mapper can optimize json values (and users can optimize their event format impls too like @johanhaleby)
  • avoid adding new constraints to CloudEvent

That's more or less the idea:

public interface CloudEvent extends CloudEventAttributes, CloudEventExtensions {

    /**
     * The event data
     */
    @Nullable
    <T> T getData();
}

This relaxes the assumptions of getData() (the generic T does nothing more than triggering a downcast to make the api nicer to use) and should not hurt the user too much.

Now with a method in EventDataCodecProvider like readData we can have this UX:

CloudEvent ev = ....
MyPojo pojo = EventDataCodecProvider.deserializeData(ev, MyPojo.class);

This checks the content of getData() and if there is already MyPojo, it just returns it. Otherwise it calls the proper EventDataCodec implementation which invokes, in case of jackson, the mapper to do the crazy stuff of the conversion.

If the user needs more complex typing when unmarshalling, he can still use jackson directly pretty easily with:

List<AAA> l = mapper.readValue(ev.getData(), new TypeReference<List<AAA>>{})

On the write side, protocol bindings implementers needs to invoke something like EventDataCodecProvider.serializeData(String datacontenttype, Object val, Class<T> target) to convert from whatever is inside data to target type (which will allow to perform the optimizations like avoid double serialization of json)

WDYT about that?

@bsideup
Copy link
Contributor

bsideup commented Jul 27, 2020

I would strongly advice against using " T". This does not allow knowing the type at runtime and creates a false feeling of being dynamic, although it is not. So, -1 from me (if counts).

@olegz
Copy link
Contributor

olegz commented Jul 27, 2020

As stated before I am also -1 on this one. Returning T creates an additional contractual obligation on the CloudEvent strategy - to become a type converter (directly or not) and that goes against the "separation of concerns", "loose coupling" etc. Basically it's not its job to deal with any type of type conversion. There are type converters, marshallers, transformers, serializers etc., and we should try to stick with these familiar java patterns and idioms. But the CloudEvent should be simple and unassuming.

@slinkydeveloper
Copy link
Member Author

So, what are the next steps here? Any proposals on the table?

@olegz
Copy link
Contributor

olegz commented Aug 7, 2020

Yes, this PR has been rejected twice with detailed explanation(s) with no acceptable counter arguments . Personally I am not sure what else to add. Consider closing it and move on.

@johanhaleby
Copy link
Contributor

johanhaleby commented Aug 7, 2020 via email

@olegz
Copy link
Contributor

olegz commented Aug 7, 2020

@johanhaleby what you're saying is very widely used pattern and personally I am +1 for it. Basically having specialized implementations that may provide extra functionality. In this case it doesn't even have to be one. Could be several or even hierarchy, but the core interface as far as I am concerned is final and IMHO I do not foresee any changes to it - ever.

@olegz
Copy link
Contributor

olegz commented Aug 7, 2020

Also, those specializations should live in core module (not the api)

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Aug 10, 2020

The implementation must obviously support
serializing the content to a byte-array as well but this could be done
lazily (and one could use memoization for caching if needed).

That's what i'm struggling to solve. When you go down the road of allowing a "non byte array", you end up with the requirements of mapping T to byte array.
One of the core ideas of sdk-java v2 is to keep jackson/mappers/whatever data serializer/deserializer as much as possible out of the sight of this codebase, because there is no point on re-doing what jackson already does pretty well (and, even more important, to avoid problems like #198 (comment)).

But, in order to fix #196, you need a solution to this problem anyway. We could, as you sad, relax the implementation of CloudEvent in core adding a new method getObjectData() that returns Object, but you still you need the mapper then to go from Object to byte[]. And, side effect, you need to allow in CloudEventBuilder the Object data.

The only doable solution I see is implementing only serialization with jackson (from Object to byte[]) and this could also fit straight in the CloudEvent interface just relaxing getData() to return Object

@bsideup
Copy link
Contributor

bsideup commented Aug 10, 2020

FTR I second the idea of not having the notion of "typed CloudEvent" in api and using custom objects (TypedCloudEvent extends CloudEvent in core?) when typed object is desired.

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Aug 10, 2020

"typed CloudEvent"

I'm not proposing any "typed CloudEvent" here 😄 But yeah that's something that could be done in future in core, like the getData(Class) in this PR. Although i recognize that this PR diverged a bit from the initial idea, so maybe just relaxing getData() + some serialization code might do the trick. WDYT?

@bsideup
Copy link
Contributor

bsideup commented Aug 10, 2020

IMO the current one is fine (well, I would still prefer ByteBuffer instead of byte[], or some other memory efficient option, but that's not the point) and we should not change it to Object. If one wants a CloudEvent that holds a reference to an object and avoid eager serialization, they are free to use a custom type that holds said object next to binary-friendly getData.

@slinkydeveloper
Copy link
Member Author

I'm sorry @bsideup , but TBH I don't get why relaxing getData() in CloudEvent might be a problem. Any specific reason?

@bsideup
Copy link
Contributor

bsideup commented Aug 10, 2020

@slinkydeveloper because it makes it dynamic. Object getData is a can of worms.

Either make it CloudEvent<T> or keep everything as it is right now, that would be my suggestion.

@bsideup
Copy link
Contributor

bsideup commented Aug 10, 2020

In fact, even CloudEvent<T> would be problematic because it introduces the serialization to the table.

@slinkydeveloper
Copy link
Member Author

Ok I'm trying a second iteration, stay tuned

@johanhaleby
Copy link
Contributor

In occurrent I've run into a similar issue. My current approach is to add a level of indirection and instead of returning a byte array (byte[]) I return a Data interface. Data then has a method that returns byte[] but consumers can add instanceof checks to cast to a specific implementation that can be used to get access to a data structure before it's serialized/converted to a byte array for optimization purposes.

@slinkydeveloper
Copy link
Member Author

slinkydeveloper commented Aug 11, 2020

@johanhaleby this is an approach adopted by other sdks too, eg: https://github.com/cloudevents/sdk-rust/blob/master/src/event/data.rs#L5

I think it might be interesting to explore it, I personally like a lot the kind of union types

@slinkydeveloper
Copy link
Member Author

Second try: #208

@slinkydeveloper
Copy link
Member Author

Replaced by #250

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discussion enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants