Description
openedon Apr 22, 2024
What is the issue and why is it an issue?
Using poll-based consumption (the current situation) for real-time data has several challenges.
- The specification states that real-time feeds should be updated as often as possible.
- This means they should have a
ttl
(time-to-live) value of 0. Ideally, this means consumers should poll infinitely often, to stay up-to-date. - This is, of course, not possible. Consumers will necessarily poll on a finite interval. Choosing the appropriate interval will depend on various factors, mostly related to available computing and bandwidth resources, as well as the total number of feeds that the consumer needs to poll.
- This means they should have a
Consumer side
There is an inherent conflict within this decision process: Consumers don’t want to poll too infrequently, because that increases the likelihood that data will be stale and that incorrect information is shown to users.
At the same time, polling too frequently is a potential waste of resources, depending on how often data is refreshed. They may also face rate-limiting policies from producers (I have first-hand experience with this).
In the end, we have to decide between over-fetching and stale data, and it will never be better than a mere compromise.
Producer side
Frequent polling of large-size payloads hogs resources and pushes producers to introduce complexity like caching and CDNs. Having consumers poll at an interval close to 0 seconds is resource-intensive and costly for the data producer, and they face the risk of lost revenue if consumers poll too in-frequently.
We must further consider that large-size payloads often only contains minor changes to the totality of the information, causing an additional waste of resources as non-changes have to be computed.
Cloud computing contributes to greenhouse gas emissions on a massive scale. Allocated resources are generally underutilised and unnecessary computing is extremely wasteful on the financial side, as well as damaging on the environmental side.
Potential solutions
I would like to open up for a community discussion on how to solve this challenge by generic and scalable means. Individual arrangements between consumers and producers are not sustainable and finding a common solution will benefit the community as whole and help the standard grow.
I don’t want to constrain the solutions from the outset, but I think potential solutions fall into the following 3 broad categories:
- Continue to use a polling-based model but encourage better use of cache headers and not-modified responses.
- Use a push-based model without an intermediary, with technologies like WebSocket or Server-Sent Events
- Use a push-based model using an intermediary message broker, with technologies like amqp, pub/sub, kafka, mqtt etc.
Personally, I think the second category holds the right trade-off between added complexity and added value. In particular Server-Sent Events seems to be promising as a theoretical extension of existing endpoints. It should also be noted that options 1 and 2 can co-exist. I.e. producers can continue to support the polling-based method for real time feeds, and improve upon it, while at the same time support a push-based model.
Still, there is another axis to consider: For any given update, what is the size of the delta of that update. There is potentially a very large upside to precompute and only ship what has actually changed, rather than always transferring everything. On the other hand, it requires us to introduce new semantics to communicate to consumers the contents of the delta. E.g. what has been added, what has changed and what was removed.
I’m looking forward to hearing what the community has to say about this. I will use your feedback to work on a proposal for a standard way to deal with the problems outlined here.
Is your potential solution a breaking change?
- Yes
- No
- Unsure