Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding & Transport Independent Interoperability Guarantee (Generic Size Measurement) #395

Closed
wants to merge 9 commits into from

Conversation

cneijenhuis
Copy link
Contributor

Alternative proposal to #364

The feedback on last weeks call was that it was bad that we're forcing non-JSON implementations to serialize a CloudEvent to JSON to ensure that it is within the supported size.

This proposes a different way to measure the event size by looking at the attributes.

Note that this proposal doesn't measure the size of the event on the wire in any format. The size on the wire will differ, including in JSON. Particularly for JSON, Binary and Integer fields are likely bigger than measured here.
If we want to keep a JSON CloudEvent below 64KB (basically for Azure EventGrid compatibility), the last rule should probably state 40KB. If we're more targeting "around 64KB in efficient wire formats", we can go for 64KB in the last rule.

Tradeoff compared to #364 :
CON: More complexity for measuring the size of the event, given that all implementations must support JSON.
PRO: No need to serialize an event to JSON. More fine-granular limits on individual fields.

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
… some examples

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@erikerikson
Copy link
Member

Perhaps a third option:

CloudEvent consumers MUST support CloudEvent events with a wire encoded size up to 64KB.
CloudEvent consumers receiving an event with a wire encoded size which has surpassed 64KB MAY reject the event and remain in compliance.

More prose could be added but that's the core.

I recognize that this ignores your desire for end-users/producers to be provided a guarantee but the consequences of delivering that increase costs and seem to imply a lot of duplicated and unnecessary work. Given the optional nature of the rejection ability, guidance could be supplied to users of more compact transmission formats that they should be conservative in their reaching of the 64KB limit in order to avoid rejections. Another option is that SDKs could supply features such as pre-flight validation for all known-to-be-used interchange formats. That implies a knowledge requirement that may be unreasonable. Perhaps another workable option is to have a different limit for producers than for consumers and to add to the spec an imposition on all encodings that ensures no encoding can increase the event's wire size beyond the delta between the limits.

Copy link
Contributor

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an additional concern about these limits and the interaction with middleware which may choose to enrich the message with additional context attributes. As currently writ, the middleware has at least two limits to be aware of:

  1. The risk of going over 100 top-level attributes/attribute-map-entries
  2. The risk of expanding the message beyond 64KB

I don't know that there is a way to rigorously draw a line between "size allowed for the original event producer" and "size allowed for middleware to expand into", but any guidance here (e.g. leave at least 4KB and 10 attributes for middleware) seems likely to belong in the Primer rather than the spec.

spec.md Outdated
considered an attribute.
* All attribute names are 20 characters or less long.
* All indexes of `Maps` are 20 characters or less long.
* All `Binary` attributes do not exceed a size of 1KB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include data as a Binary attribute, or only context attributes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data is an Any attribute, and the rule for Any attributes states that it does not apply to data.

Maybe it is easier to read if I move that line (305) up.

spec.md Outdated
In order to increase interoperability, all CloudEvent consumers SHOULD accept
events that follow these rules:

* The number of attributes does not exceed 100. Each index of a `Map` is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this include or exclude the data attribute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Includes data. I can explicitly mention that, or exclude data.


The size of a `String` attribute is measured by encoding it in
[UTF-8](https://tools.ietf.org/html/rfc3629).
The size of an `Integer` attribute is 4 bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we assume 4 bytes for Integer? That seems unlikely to be correct as an encoding estimate, and may not be correct as a length estimate (for example, for call stack traces from a running program, as a fraction of instruction executions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Integer is defined as a 32-bit Integer, so 4 byte is the minimal encoding size for the value itself. Practically, the encoding may be bigger (e.g. in JSON).

Actually, I don't care much what the particular value is, but there should be some encoding-independent value to calculate the total size. Do you have a better proposal than 4 byte?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 32-bit integer can consume up to 10 bytes encoded in decimal. If we're estimating worst-case size, then 10 bytes might be a better estimate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not trying to estimating worst-case size (which in this case would probably be in JSON, where the Integer is represented as text...).

The size calculated for the event will most likely be lower than any wire formatting will be able to achieve (without compression). The goal isn't to predict the size of a message, but to have a way to validate the size of an event independently of the chosen encoding and transport format.

spec.md Outdated
The size of an `Integer` attribute is 4 bytes.

CloudEvent consumers MAY reject events that do not follow these rules and MAY
reject messages that are not minified (e.g. contain unnecessary white-space).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for rejecting messages with additional whitespace that are still within the message length limits (assuming that the consumer is allowed to discard the whitespace)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the message length limit? 😉

This PR does not define a message length limit. It defines limits on some parts of the message. Other parts of the message are not limited.

E.g. for JSON, this PR limits the length of all names and all values, as well as how many names and values can appear in a message. One can use that info to calculate the maximum size of a minified JSON, and then reject all requests that exceed this size. Then one is doing exactly what you ask for: Accept all messages within the maximum message size, even if they contain more whitespace.

So, there is only a rationale for rejecting messages with excessive additional whitespace (discarding 1GB of whitespace from a JSON document is still expensive 😉 ), but without a clearly defined message length limit, I have a hard time drawing a line at which point one is allowed to reject a message. Suggestions welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "MAY reject messages which contain more than 1KB of whitespace" or "MAY reject messages which contain whitespace which extends the total JSON message length beyond 64KB"?

I'm worried that the MAY gives license to drop pretty-printed and otherwise conformant messages, which seems a bit punitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea - I've tried to incorporate it without explicitly mentioning JSON. Does it sound good to you?

spec.md Outdated
* All indexes of `Maps` are 20 characters or less long.
* All `Binary` attributes do not exceed a size of 1KB.
* All `String` attributes (including String expressions like `URI-reference` and
`Timestamp`) do not exceed a size of 1KB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 1KB sufficient for a URI-reference? I think 2KB is accepted by many browsers and other software.

In any case, is there a particular benefit to capping the size of (presumably context) attributes on a per-attribute basis? 100 * (20 + 20 + 1KB + encoding overhead) >= 100KB, which is larger than the total allowed size.

Additionally, I can imagine some attribute uses (like signatures) which could quite possibly consume more than 1KB -- is there some other system limit which is being worked around here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale is that a middleware needs to interpret some context attribute values. E.g. the source or type could be used for routing. A middleware may allow to perform a Regex on the source.

A DoS attacker will try to maximize the computational overhead per request. To do that, the attacker could send the maximum allowed length for a source and have a Regex that forces the middleware to do an expensive calculation.
Even if we ignore DoS attacks, if I want to provide a middleware aaS, and I want to be price-competitive, I'd like the computational work per event to be somewhat similar across tenants. E.g. I'd like to charge $0.00001 per request.
However, if one tenant makes me parse a regex on a 60KB source, this is much more expensive than a tenant who has a regex on a small source.
So either I have a very complicated pricing model, I loose money on some tenants, or I have to overcharge everyone for a few tenants doing crazy things.

That is my experience from running a BaaS. I'll most likely not be running a CloudEvents middleware aaS, so I'm just trying to be nice to the guys who'll end up doing that 😉 But - if no one thinks this is necessary - I can also remove it.


Yes, 2KB is also fine. I'll change that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/google/re2 should bound the cost of the regular expression matching to be linear in the cost of the string.

I'm also comfortable with a 2KB limit; looking at a fairly large JWT (not that we would necessarily pass a JWT as a context attribute, but future encryption might look somewhat similar), the size is 1500 bytes: https://stackoverflow.com/questions/39767231/jwt-access-token-public-information-and-size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/google/re2 should bound the cost of the regular expression matching to be linear in the cost of the string.

Which only validates point 😉
If the average event producer sends a source of 50 bytes, and the DoS event producer sends a source of 50 kilobytes, assuming linear cost it is 1000 times more expensive to match the latter event.

I'm also comfortable with a 2KB limit

👍

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@cneijenhuis
Copy link
Contributor Author

@evankanderson Thanks for your detailed feedback 🙇 I tried to incorporate it in 6b648c1 - including your point on the middleware adding extra attributes - or have provided a comment where I didn't want to change the text.

@cneijenhuis
Copy link
Contributor Author

@erikerikson Thanks for your feedback.

I have also considered defining a lower limit for producers, and a higher limit for consumers. However, it puts a middleware that transposes from one format to another in a tight spot. A middleware is both a consumer and a producer.
Even if a middleware doesn't transpose and just routes messages, it is in an awkward position (it SHOULD accept messages at e.g. 64 KB and SHOULD send messages at e.g. 32 KB).
If a middleware transposes, it is completely dependent on the producer to be compliant to the lower limit. E.g. it SHOULD accept a 64 KB MQTT message, even though the middleware can predict it won't be able to transpose that into a JSON of 64 KB, and knows the following consumer will reject a JSON above 64 KB. So it SHOULD accept the message from the producer, but not forward it?


If I got your motivation for the proposal correctly, it is to lower the duplicated/unnecessary work and increased cost for producers and consumers, because both can perform a relatively simple size check.

I think my proposal also enables this. The rules can be used both at design-time and at run-time, and I'd argue that - since this is a size guarantee, not a limit - they are more important for design-time.

  • A producer can at design-time verify that none of his events can get close to the limit. Then there is no need to re-do this at run-time.
  • Alternatively, a producer is aware that the data sometimes is too big and the claim check pattern needs to be used. At run-time, the producer can only check the size of the data attribute to make that decision if an upper boundary for the size of the other attributes has been established at design-time.
  • A consumer can pick arbitrary limits at which to reject messages at design time (e.g. I'm rejecting messages over 256 KB) as long as these are always above the limits defined in the spec. At run-time, this validation is then cheap (compared to the rather involved rules defined in this PR).
  • A consumer can also calculate the maximum possible size for a valid CloudEvent in a particular wire encoding, and reject message above that size.

I'd argue that for both consumers and producers that would like to keep some "distance" to the size, there is practically no difference between your and my proposal.

However for those that need to be very precise about when to reject messages (e.g. a middleware that transposes messages) a clearly defined size is much easier to deal with.

spec.md Outdated
`Timestamp`) do not exceed a size of 2KB.
* For a `Map`, the `Any`-typed values follow the rules above.
* The total size of all attributes (including the `data` attribute) does not
exceed 40KB/64KB.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the 40KB/64KB here? Is this padding for base64 encoding binary content?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote in the PR description:

If we want to keep a JSON CloudEvent below 64KB (basically for Azure EventGrid compatibility), the last rule should probably state 40KB. If we're more targeting "around 64KB in efficient wire formats", we can go for 64KB in the last rule.

Anyway, I found the time to check what the worst-case CloudEvent would look like. I think like this: https://gist.github.com/cneijenhuis/2984f933f05b9dc579d101ae2d24cd8f with the data filled with a Binary. If I didn't make a blunder, I think 45KB (according to this measurement) should guarantee that no JSON document can reach 64KB, but get pretty close to it. 46KB (according to this measurement) can result in a JSON document above 64KB.

To be precise, the worst-case JSON document with 45KB weighs in at 64527 Bytes, aka 63,01 KB. So there is a bit of wiggle room in case I made an error somewhere 😉 or we have further minor changes to the spec. However, any JSON event consumer can reject JSON at 64KB and be compliant with these rules.

@duglin
Copy link
Collaborator

duglin commented Mar 7, 2019

@cneijenhuis can you sign your commits to make the DCO checker happy?

45KB is the maximum to keep JSON size below 64KB

Signed-off-by: Christoph Neijenhuis <christoph.neijenhuis@commercetools.de>
@clemensv
Copy link
Contributor

These are now even more rules than previously.

I would be perfectly happy with a simple statement that says that the wire-encoded CloudEvent SHOULD NOT exceed 64KByte in size. If an intermediary chooses to reencode the event and then the event gets bigger than that, that's really a system configuration problem more than a problem that we need to solve here.

I appreciate that the might be gray zone of a few KByte of where events may exceed the threshold in reencoding and transport switching scenarios, but that's true for everywhere else in messaging as well, and if you look across the messaging landscape and especially at multi-protocol brokers, you'll find that they give you pretty simple quotas for how large a message can be, and that's typically anchored on the transport frame size limit.

@clemensv clemensv mentioned this pull request Mar 15, 2019
@cneijenhuis cneijenhuis changed the title Minimum Supported Event Size, Option 2 Encoding & Transport Independent Interoperability Guarantee (Generic Size Measurement) Mar 21, 2019
@duglin
Copy link
Collaborator

duglin commented Mar 22, 2019

@duglin duglin added the v0.3 label May 30, 2019
@cneijenhuis cneijenhuis closed this Jun 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants