Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move DotNetty batching scheduling off of DotNetty STEE and onto HashedWheelTimer #4678

Merged

Conversation

Aaronontheweb
Copy link
Member

work for #4563 and #4636

@Aaronontheweb
Copy link
Member Author

Raw throughput numbers were the same as dev on this branch - so no real impact to end to end performance.

However, significant change in idle CPU numbers!

Some important notes:

  • clusterwebcrawler_webcrawler.lighthouse2_1 - isn't in the cluster although it is running Akka.Remote
  • clusterwebcrawler_webcrawler.lighthouse_1 IS in the cluster but IS NOT using batching
  • all other services inside the cluster are using batching

Before

CONTAINER ID        NAME                                            CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O   
 PIDS                                                                                                                                                             
84ccf601e906        clusterwebcrawler_webcrawler.lighthouse2_1      0.55%               35.52MiB / 50.17GiB   0.07%               1.2kB / 0B          0B / 0B     
 21                                                                                                                                                               
37fd1e313cdb        clusterwebcrawler_webcrawler.web_1              9.98%               79.53MiB / 50.17GiB   0.15%               98.8kB / 97kB       0B / 0B     
 56                                                                                                                                                               
57d530c02a47        clusterwebcrawler_webcrawler.trackerservice_1   10.96%              48.97MiB / 50.17GiB   0.10%               99.9kB / 99kB       0B / 0B     
 39                                                                                                                                                               
5737cbaed752        clusterwebcrawler_webcrawler.crawlservice_1     11.31%              44.49MiB / 50.17GiB   0.09%               97.3kB / 97.4kB     0B / 0B     
 39                                                                                                                                                               
66dd2a7fe096        clusterwebcrawler_webcrawler.lighthouse_1       4.36%               45.39MiB / 50.17GiB   0.09%               106kB / 109kB       0B / 0B     
 27                                                                                                                                                               

After

 CONTAINER ID        NAME                                            CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O
 PIDS
b332f021fbe1        clusterwebcrawler_webcrawler.lighthouse2_1      0.52%               35.91MiB / 50.17GiB   0.07%               1.27kB / 0B         0B / 0B
 21
0093f3510df3        clusterwebcrawler_webcrawler.crawlservice_1     5.64%               52.1MiB / 50.17GiB    0.10%               201kB / 202kB       0B / 0B
 40
d738c3391a89        clusterwebcrawler_webcrawler.trackerservice_1   4.96%               48.66MiB / 50.17GiB   0.09%               201kB / 201kB       0B / 0B
 39
0392e8f7835d        clusterwebcrawler_webcrawler.web_1              5.29%               85.66MiB / 50.17GiB   0.17%               211kB / 204kB       0B / 0B
 57
fd492a3b6ff5        clusterwebcrawler_webcrawler.lighthouse_1       4.42%               46.12MiB / 50.17GiB   0.09%               216kB / 221kB       0B / 0B
 23

Looks like a ~50% drop in idle CPU consumption to me. The nodes in the cluster using batching are now in the same ballpark is the ones that aren't.

() =>
{
// want to fire this event through the top of the pipeline
context.Channel.Pipeline.FireUserEventTriggered(FlushCmd.Instance);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use an event for this as this will allow me to do some sampling around how often the FlushCmd has to be invoked separately from other flushes. But that may end up being a distinction without difference depending on how I write the sampler.

Copy link
Member

@to11mtm to11mtm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question and one nit but looks good otherwise

public readonly BatchWriterSettings Settings;
public readonly IScheduler Scheduler;
private ICancelable _flushSchedule;

internal bool CanSchedule { get; private set; } = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we don't use CanSchedule anywhere, we are only setting it. Should it be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it

Comment on lines +178 to +184
_flushSchedule = Scheduler.Advanced.ScheduleRepeatedlyCancelable(Settings.FlushInterval,
Settings.FlushInterval,
() =>
{
// want to fire this event through the top of the pipeline
context.Channel.Pipeline.FireUserEventTriggered(FlushCmd.Instance);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is one heck of an edge case.... But do we want to have any sort of warning if the flush interval is for some unknown reason less than tick-duration on HWT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't - but it shouldn't be an issue. The scheduler will just put it in the next available bucket if that's the case.

@@ -114,72 +121,66 @@ public override Task WriteAsync(IChannelHandlerContext context, object message)
* across the network.
*/
var write = base.WriteAsync(context, message);
if (Settings.EnableBatching)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the Settings.EnableBatching checks - this stage never runs if this is false.

@Aaronontheweb Aaronontheweb merged commit 91e9c50 into akkadotnet:dev Dec 18, 2020
@Aaronontheweb Aaronontheweb deleted the fix/DotNetty-scheduler-performance branch December 18, 2020 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants