Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use scheduleWithFixedDelay instead of scheduleAtFixedRate for java producer batch timer #14125

Merged
merged 1 commit into from
Feb 4, 2022

Conversation

klevy-toast
Copy link
Contributor

@klevy-toast klevy-toast commented Feb 4, 2022

Fixes #11100

Motivation

We believe that the use of scheduleAtFixedRate in the java producer's batch timer can result in unnecessarily high thread usage, which can become especially problematic for applications that start many producers.

Modifications

Replaced the use of scheduleAtFixedRate with scheduleWithFixedDelay, which is the same behavior as previously in 2.6.x. The producer's parameter batchingMaxPublishDelay implies the use of the "delay" method instead of "rate" method as well.

Verifying this change

  • Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as existing pulsar client producer tests.

Testing of the performance regression can be demonstrated by using this artifact and comparing a recent release of pulsar client with a manually built SNAPSHOT version with this change:

Version 2.7.1 CPU & thread behavior

  • While sending messages

image

- While running idle producers

image

- 30 second profile while sending messages

image

Version 2.10.0-SNAPSHOT CPU & thread behavior

  • While sending messages

image

- While running idle producers

image

- 30 second profile while sending messages

image

These samples show fewer threads running with this change compared to 2.7.1, less time spend in batchMessageAndSend, and overall lower CPU usage -- note that this testing was done on a laptop machine, and we have observed, along with other users that this CPU regression can be much worse with more producers, smaller batch intervals, and on deployed cloud applications.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)
  • The rest endpoints: (no)
  • The admin cli options: (no)
  • Anything that affects deployment: (no)

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

  • no-need-doc

The general behavior of the batch timer feature should not be changing

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spending time in debugging this.

I second this patch

+1
@rdhabalia @merlimat @michaeljmarshall @lhotari @codelipenghui you may be interested in this patch

@merlimat merlimat added the type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages label Feb 4, 2022
@merlimat merlimat added this to the 2.10.0 milestone Feb 4, 2022
Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I think we could even go further and schedule the task only when the producer is active.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This is a great catch.

For example, considering that the default delay is 1 millisecond, a 10 millisecond GC pause on the client could lead to 10 immediate calls to batchMessageAndSend(). Since the runnable acquires a lock on the whole producer, those 10 calls are likely impacting the user application.

@michaeljmarshall
Copy link
Member

I think we could even go further and schedule the task only when the producer is active.

@merlimat - this is a great point. Essentially, we only need a scheduled to a "flush" if there are messages in the batchMessageContainer. We could even delay the flush each time that we send messages, since a client that is pushing many messages through the producer will likely hit the batch limit before the flush. Without pushing out the flush, we'll also increase the probability of under-filled batches.

@klevy-toast - are you interested in working on this fix? If not, I think we should merge this and then someone can work on a follow up fix?

I'd argue we could cherry pick this commit to older release branches since it is very well understood.

@michaeljmarshall
Copy link
Member

After thinking about this a bit more, I think it makes sense to merge this as is, cherry-pick it to older release branches, and then any further optimizations will just target master.

@michaeljmarshall michaeljmarshall merged commit f9e69f0 into apache:master Feb 4, 2022
michaeljmarshall pushed a commit that referenced this pull request Feb 5, 2022
…a producer batch timer (#14125)

Fixes #11100

### Motivation

We believe that the use of `scheduleAtFixedRate` in the java producer's batch timer can result in unnecessarily high thread usage, which can become especially problematic for applications that start many producers.

### Modifications

Replaced the use of `scheduleAtFixedRate` with `scheduleWithFixedDelay`, which is the same behavior as previously in 2.6.x. The producer's parameter `batchingMaxPublishDelay` implies the use of the "delay" method instead of "rate" method as well.

### Verifying this change

- [x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as existing pulsar client producer tests.

Testing of the performance regression can be demonstrated by using [this](https://github.com/klevy-toast/dropwizard-pulsar-test) artifact and comparing a recent release of pulsar client with a manually built SNAPSHOT version with this change:

#### Version 2.7.1 CPU & thread behavior

- While sending messages
<img width="1632" alt="image" src="https://user-images.githubusercontent.com/42187013/152588959-8ee4beb9-70f3-4ad8-9132-240d4498dda5.png">
- While running idle producers
<img width="1613" alt="image" src="https://user-images.githubusercontent.com/42187013/152589079-b45fce49-757a-4bfd-8ddd-c438774ecf41.png">
- 30 second profile while sending messages
<img width="1295" alt="image" src="https://user-images.githubusercontent.com/42187013/152589222-54732bf3-44d7-40b8-8c6b-03b54ba01090.png">

#### Version 2.10.0-SNAPSHOT CPU & thread behavior
- While sending messages
<img width="1615" alt="image" src="https://user-images.githubusercontent.com/42187013/152589391-ae243e7a-5f1f-40b7-a77c-7e3d12a84c8e.png">
- While running idle producers
<img width="1603" alt="image" src="https://user-images.githubusercontent.com/42187013/152589436-784d9c56-043e-41fa-95e8-6a721e0adc78.png">
- 30 second profile while sending messages
<img width="1289" alt="image" src="https://user-images.githubusercontent.com/42187013/152589619-f274545d-b9f9-48e8-8b02-e226c6dec59e.png">

These samples show fewer threads running with this change compared to 2.7.1, less time spend in `batchMessageAndSend`, and overall lower CPU usage -- note that this testing was done on a laptop machine, and we have observed, along with [other users](#11100 (comment)) that this CPU regression can be much worse with more producers, smaller batch intervals, and on deployed cloud applications.

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the changes*

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API: (no)
  - The schema: (no)
  - The default values of configurations: (no)
  - The wire protocol: (no)
  - The rest endpoints: (no)
  - The admin cli options: (no)
  - Anything that affects deployment: (no)

### Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

- [x] `no-need-doc`

The general behavior of the batch timer feature should not be changing

(cherry picked from commit f9e69f0)
michaeljmarshall pushed a commit that referenced this pull request Feb 5, 2022
…a producer batch timer (#14125)

Fixes #11100

### Motivation

We believe that the use of `scheduleAtFixedRate` in the java producer's batch timer can result in unnecessarily high thread usage, which can become especially problematic for applications that start many producers.

### Modifications

Replaced the use of `scheduleAtFixedRate` with `scheduleWithFixedDelay`, which is the same behavior as previously in 2.6.x. The producer's parameter `batchingMaxPublishDelay` implies the use of the "delay" method instead of "rate" method as well.

### Verifying this change

- [x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as existing pulsar client producer tests.

Testing of the performance regression can be demonstrated by using [this](https://github.com/klevy-toast/dropwizard-pulsar-test) artifact and comparing a recent release of pulsar client with a manually built SNAPSHOT version with this change:

#### Version 2.7.1 CPU & thread behavior

- While sending messages
<img width="1632" alt="image" src="https://user-images.githubusercontent.com/42187013/152588959-8ee4beb9-70f3-4ad8-9132-240d4498dda5.png">
- While running idle producers
<img width="1613" alt="image" src="https://user-images.githubusercontent.com/42187013/152589079-b45fce49-757a-4bfd-8ddd-c438774ecf41.png">
- 30 second profile while sending messages
<img width="1295" alt="image" src="https://user-images.githubusercontent.com/42187013/152589222-54732bf3-44d7-40b8-8c6b-03b54ba01090.png">

#### Version 2.10.0-SNAPSHOT CPU & thread behavior
- While sending messages
<img width="1615" alt="image" src="https://user-images.githubusercontent.com/42187013/152589391-ae243e7a-5f1f-40b7-a77c-7e3d12a84c8e.png">
- While running idle producers
<img width="1603" alt="image" src="https://user-images.githubusercontent.com/42187013/152589436-784d9c56-043e-41fa-95e8-6a721e0adc78.png">
- 30 second profile while sending messages
<img width="1289" alt="image" src="https://user-images.githubusercontent.com/42187013/152589619-f274545d-b9f9-48e8-8b02-e226c6dec59e.png">

These samples show fewer threads running with this change compared to 2.7.1, less time spend in `batchMessageAndSend`, and overall lower CPU usage -- note that this testing was done on a laptop machine, and we have observed, along with [other users](#11100 (comment)) that this CPU regression can be much worse with more producers, smaller batch intervals, and on deployed cloud applications.

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the changes*

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API: (no)
  - The schema: (no)
  - The default values of configurations: (no)
  - The wire protocol: (no)
  - The rest endpoints: (no)
  - The admin cli options: (no)
  - Anything that affects deployment: (no)

### Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

- [x] `no-need-doc`

The general behavior of the batch timer feature should not be changing

(cherry picked from commit f9e69f0)
@klevy-toast
Copy link
Contributor Author

I think we could even go further and schedule the task only when the producer is active.

@merlimat - this is a great point. Essentially, we only need a scheduled to a "flush" if there are messages in the batchMessageContainer. We could even delay the flush each time that we send messages, since a client that is pushing many messages through the producer will likely hit the batch limit before the flush. Without pushing out the flush, we'll also increase the probability of under-filled batches.

@klevy-toast - are you interested in working on this fix? If not, I think we should merge this and then someone can work on a follow up fix?

I probably will not be able to work on that enhancement, but I agree that it is a great idea! This issue originally surfaced for us when we had an application running many inactive producers, but we ended up working around it by just disabling batch messaging.

Nicklee007 pushed a commit to Nicklee007/pulsar that referenced this pull request Apr 20, 2022
…a producer batch timer (apache#14125)

Fixes apache#11100 

### Motivation

We believe that the use of `scheduleAtFixedRate` in the java producer's batch timer can result in unnecessarily high thread usage, which can become especially problematic for applications that start many producers. 

### Modifications

Replaced the use of `scheduleAtFixedRate` with `scheduleWithFixedDelay`, which is the same behavior as previously in 2.6.x. The producer's parameter `batchingMaxPublishDelay` implies the use of the "delay" method instead of "rate" method as well.

### Verifying this change

- [x] Make sure that the change passes the CI checks.

This change is already covered by existing tests, such as existing pulsar client producer tests.

Testing of the performance regression can be demonstrated by using [this](https://github.com/klevy-toast/dropwizard-pulsar-test) artifact and comparing a recent release of pulsar client with a manually built SNAPSHOT version with this change:

#### Version 2.7.1 CPU & thread behavior

- While sending messages
<img width="1632" alt="image" src="https://user-images.githubusercontent.com/42187013/152588959-8ee4beb9-70f3-4ad8-9132-240d4498dda5.png">
- While running idle producers
<img width="1613" alt="image" src="https://user-images.githubusercontent.com/42187013/152589079-b45fce49-757a-4bfd-8ddd-c438774ecf41.png">
- 30 second profile while sending messages
<img width="1295" alt="image" src="https://user-images.githubusercontent.com/42187013/152589222-54732bf3-44d7-40b8-8c6b-03b54ba01090.png">

#### Version 2.10.0-SNAPSHOT CPU & thread behavior
- While sending messages
<img width="1615" alt="image" src="https://user-images.githubusercontent.com/42187013/152589391-ae243e7a-5f1f-40b7-a77c-7e3d12a84c8e.png">
- While running idle producers
<img width="1603" alt="image" src="https://user-images.githubusercontent.com/42187013/152589436-784d9c56-043e-41fa-95e8-6a721e0adc78.png">
- 30 second profile while sending messages
<img width="1289" alt="image" src="https://user-images.githubusercontent.com/42187013/152589619-f274545d-b9f9-48e8-8b02-e226c6dec59e.png">

These samples show fewer threads running with this change compared to 2.7.1, less time spend in `batchMessageAndSend`, and overall lower CPU usage -- note that this testing was done on a laptop machine, and we have observed, along with [other users](apache#11100 (comment)) that this CPU regression can be much worse with more producers, smaller batch intervals, and on deployed cloud applications.

### Does this pull request potentially affect one of the following parts:

*If `yes` was chosen, please highlight the changes*

  - Dependencies (does it add or upgrade a dependency): (no)
  - The public API: (no)
  - The schema: (no)
  - The default values of configurations: (no)
  - The wire protocol: (no)
  - The rest endpoints: (no)
  - The admin cli options: (no)
  - Anything that affects deployment: (no)

### Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs? 

- [x] `no-need-doc` 
  
The general behavior of the batch timer feature should not be changing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.3 release/2.9.2 type/enhancement The enhancements for the existing features or docs. e.g. reduce memory usage of the delayed messages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Java client: high CPU regression for producer between 2.6.x and 2.7.x
5 participants