Replace global hub transport with cloudevents #310

yanmxa · 2023-01-30T09:18:49Z

1. Kafka offset committer in Global Hub manager

When the global hub manager receives a message through transport, there is not only a goroutine for the consumer client but also a committer goroutine that commits the Kafka offset periodically. The consumer receives the message and forwards it to the message handler to process the message, after processing the message, the committer updates the offset in Kafka. The process of consuming messages and committing the offset is asynchronous.

But when we use cloudevents to deliver messages, although based on Kafka protocol, we cannot directly update the offset through the Kafka client after receiving the message. Instead, the cloudevents client receives the message and returns an ACK or NACK result to decide whether to update the kafka offset. The message consuming and the offset updating are synchronous in the whole process.

Now the question are:

The above two ways manual commit kafka offset, so that when the consumer crashes and restarts, it can continue to consume the previously received messages. Is this manual commit offset method necessary in global hub?
If we use cloudevents, what are the possible impacts of synchronously updating the Kafka offset on the global hub?
Do these effects really prevent us from using cloudevents instead of the original transport?

nirrozenbaum · 2023-02-01T13:44:24Z

I see @vMaroon has assigned himself. he was in charge of the initial Kafka work, so he probably has the best answers.
having said that, If I remember correctly I think we had discussions about cloud events when we started the collaboration and we agreed that it should be possible in the spec path, but will have some issues in the status path.

@yanmxa continuing the other thread we discussed about performance, this is correct here as well.
in the status path, we need to keep in mind that the scale is extremely high, 1M MCs and 100M policies status updates.
we intentionally separated the offset commit to a different goroutine, in order to achieve the best performance we can get.
if we'll have to commit offset synchronously after every bundle processing, it will slow the process rate and of course as a result may reduce the performance significantly.

As I suggested in the different thread we discussed, I suggest here as well to try running high scale simulations in order to understand the huge performance effect a change can make (it's true about any change, not this one specifically).
it's highly important in order to understand which changes are acceptable and which aren't.

here are the original performance results we were able to achieve:

nirrozenbaum · 2023-02-01T13:54:08Z

high scale simulations setup:

vMaroon · 2023-02-02T10:34:55Z

As @nirrozenbaum mentioned, regardless of what we say here, any substantial change to the status-path flow should be followed with a set of high scale tests, otherwise it would not be right to claim the same scalability as the system underwent iterations of improvements and compactions to achieve a very efficient and compact data flow.

The first major issue in the proposition is synchronous reading from Kafka:
- Since the conflation-logic is a consumer of the transport through go-channels, it means that synchronous reading from Kafka would propagate all the way to the conflators, unless you decouple them with a middle-layer of conflation, which is a recursive problem.
- Regardless, at high scale, it is against best practices to go for synchronous, please keep in mind that the demand here is of 100k+ managed clusters, and the natural next step is to go for 1M+ and so on - and the current design supports more than 1M+.
Cloud-events: if it is beyond structuring a message, keep the following in mind:
1. Delta messages: while the tests showed that the system can still manage extreme scales without the delta-messaging mechanism, the bandwidth required is considerably larger. Therefore delta-messages should not be dropped.
2. Message fragmentation: with the Kafka message size limit being 1MB, the transport layer manages fragmentation transparently, while satisfying one of the major assumptions of messages committed are certainly processed (or irrelevant) - as compactly as possible.
  - This means, on the same Kafka partition, there can be multiple open streams of fragmented messages, and their offsets are changed such that if two or more streams intersect, committing one when complete then crashing does not cause loss of fragments of the rest of the intersected streams.

The above are the tips of each point, if requested, I can dive into reasoning and more details.

yanmxa · 2023-02-07T03:41:02Z

@vMaroon Very much appreciate your information!

As you said, the main purpose of using delta messages here is to reduce bandwidth.
The way message fragmentation is used to assemble messages and then commit message offset asynchronously is very clever. I had to think about how to deliver and consume chunked messages in cloudevents.
I am doing a simple A/B test, just synchronizing managed clusters from leaf hub to hub of hubs database. cloud you help to check the following test result to see if it is reasonable?

100 MCs: 100 RHs with 1000 MCs

Scenarioes\Times	1	2	3
Kafka	4 seconds	4 seconds	4 seconds
Cloudevents	3 seconds	6 seconds	4 seconds

1M MCs: 1000 RHs with 1000 MCs

Scenarioes\Times	1	2	3
Kafka	49 seconds	47 seconds	48 seconds
Cloudevents	59 seconds	57 seconds	60 seconds

yanmxa · 2023-02-09T15:46:29Z

Transport Spec Path

Updates:

Transport Producer: reduce methods of the producer interface
Transport Consumer: only responsible for delivering messages, not handling bundle-related logic
Global Hub Agent: add a dispatcher to register syncers for different bundles, and dispatch messages to the corresponding syncer according to the received message ID

Transport Status Path

Updates

Transport Producer: reduce methods of the producer interface, the status producer is consistent with the interface of the spec.
Transport Consumer: align with the spec consumer. It's only responsible for receiving and forwarding messages, not for processing messages

Bundle syncers don't need to be registered to a specific consumer. Add a TransportDispatcher so that the consumer only forwards the message to the dispatcher, which invokes the registered syncer for processing the message based on the MsgID.
Remove the asynchoronous offset committer. The cloudevents client receives the message and returns an ACK or NACK result to decide whether to update the kafka offset. The message offset updating are synchronous in the process.

A/B Testing

Synchronize managed clusters from Regional Hubs to Global Hub database.

100 K MCs: 100 RHs with 1000 MCs

Scenarioes\Times	1	2	3
Kafka	4 seconds	4 seconds	4 seconds
Cloudevents(kafka)	3 seconds	6 seconds	4 seconds

1 M MCs: 1000 RHs with 1000 MCs

Scenarioes\Times	1	2	3
Kafka	49 seconds	47 seconds	48 seconds
Cloudevents(kafka)	59 seconds	57 seconds	60 seconds

Improvements

Increase the size of messages that can be sent at a time.
Since the message is sent in chunks, the manager consumer locks the receiver when assembling chunks after receiving the event. Reducing the use of locks should reduce the delivery time.
Considering compressed data may reduce the transfer time.

yanmxa · 2023-02-10T02:36:44Z

Test Results after Increasing Transport Message Limit Size to 940 KB

1 M MCs: 1000 RHs with 1000 MCs

Scenarioes\Times	1	2	3
Kafka	54 seconds	55 seconds	53 seconds
Cloudevents(kafka)	51 seconds	50 seconds	52 seconds

Conclusion: From the test results so far, using cloudevents to replace the original transport does not cause significant performance degradation

vMaroon · 2023-02-10T14:09:38Z

@yanmxa these results seem fine on the first look. Did you try testing the load / rotation of 100 policies with the setup above? It's highly suggested to do so.

yanmxa · 2023-02-16T10:34:44Z

@vMaroon Since we focus on the performance change of transport, I only compare and test the cases of 1 M Polices and 1 M managed cluster in HoH initialization.

10 Policies
100 RHs with 1000 MCs
Total: 100 K MCs and 1 M Policies

Scenarioes\Times	1	2	3
Kafka	12 seconds	15 seconds	13 seconds
Cloudevents(kafka)	18 seconds	16 seconds	12 seconds

yanmxa · 2023-02-17T02:43:02Z

ref: cloudevents/sdk-go#846

yanmxa · 2023-02-24T01:22:53Z

For the transport status path, we added a transportFormt in the API to support both kafka directly and cloudevents.
ref: #319

vMaroon self-assigned this Jan 31, 2023

vMaroon removed their assignment Feb 2, 2023

yanmxa closed this as completed Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace global hub transport with cloudevents #310

Replace global hub transport with cloudevents #310

yanmxa commented Jan 30, 2023 •

edited

Loading

nirrozenbaum commented Feb 1, 2023 •

edited

Loading

nirrozenbaum commented Feb 1, 2023

vMaroon commented Feb 2, 2023 •

edited

Loading

yanmxa commented Feb 7, 2023 •

edited

Loading

yanmxa commented Feb 9, 2023 •

edited

Loading

yanmxa commented Feb 10, 2023 •

edited

Loading

vMaroon commented Feb 10, 2023

yanmxa commented Feb 16, 2023

yanmxa commented Feb 17, 2023

yanmxa commented Feb 24, 2023 •

edited

Loading

Replace global hub transport with cloudevents #310

Replace global hub transport with cloudevents #310

Comments

yanmxa commented Jan 30, 2023 • edited Loading

1. Kafka offset committer in Global Hub manager

nirrozenbaum commented Feb 1, 2023 • edited Loading

nirrozenbaum commented Feb 1, 2023

vMaroon commented Feb 2, 2023 • edited Loading

yanmxa commented Feb 7, 2023 • edited Loading

100 MCs: 100 RHs with 1000 MCs

1M MCs: 1000 RHs with 1000 MCs

yanmxa commented Feb 9, 2023 • edited Loading

Transport Spec Path

Updates:

Transport Status Path

Updates

A/B Testing

100 K MCs: 100 RHs with 1000 MCs

1 M MCs: 1000 RHs with 1000 MCs

Improvements

yanmxa commented Feb 10, 2023 • edited Loading

Test Results after Increasing Transport Message Limit Size to 940 KB

1 M MCs: 1000 RHs with 1000 MCs

vMaroon commented Feb 10, 2023

yanmxa commented Feb 16, 2023

yanmxa commented Feb 17, 2023

yanmxa commented Feb 24, 2023 • edited Loading

yanmxa commented Jan 30, 2023 •

edited

Loading

nirrozenbaum commented Feb 1, 2023 •

edited

Loading

vMaroon commented Feb 2, 2023 •

edited

Loading

yanmxa commented Feb 7, 2023 •

edited

Loading

yanmxa commented Feb 9, 2023 •

edited

Loading

yanmxa commented Feb 10, 2023 •

edited

Loading

yanmxa commented Feb 24, 2023 •

edited

Loading