Skip to content

Conversation

@traeak
Copy link
Contributor

@traeak traeak commented Jan 7, 2022

The current cache write lock retries work by spinning N times at ms intervals, defined by proxy.config.http.cache.max_open_write_retries (default 1) and proxy.config.http.cache.open_read_retry_time (default 10ms).

While performing synthetic load testing with an origin that still accepts connections but stalls, we encountered larger than expected transaction ttms. This was caused by the retry time space going far beyond the configured 10ms.

Using the following setting I was getting too much variation in time spent spinning on the write lock. Anywhere from 1.5s to 8s.

CONFIG proxy.config.http.cache.max_open_write_retries INT 150

So I added this millisecond timeout/deadline:

CONFIG proxy.config.http.cache.max_open_write_retry_timeout INT 1500

When set this timeout is used in preference to the sloppy open_write_retries * open_read_retry_time

@traeak traeak force-pushed the ts_write_retry branch 4 times, most recently from 6f7b816 to d97b866 Compare January 10, 2022 17:46
@traeak traeak added this to the 10.0.0 milestone Jan 10, 2022
@traeak traeak self-assigned this Jan 10, 2022
@traeak traeak added the Cache label Jan 10, 2022
@traeak traeak marked this pull request as draft January 10, 2022 21:30
@traeak traeak marked this pull request as ready for review January 10, 2022 23:58
@ezelkow1
Copy link
Member

[approve ci rocky]

@bryancall
Copy link
Contributor

bryancall commented Mar 28, 2022

@SolidWallOfCode is going to try to take a look at this

@bryancall
Copy link
Contributor

@cmcfarlen is going to take a look at this

Copy link
Contributor

@cmcfarlen cmcfarlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the logic on the read retry assert.

Copy link
Contributor

@cmcfarlen cmcfarlen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm after removing the backward logic.

@traeak
Copy link
Contributor Author

traeak commented Apr 19, 2022

We've been running with this patch in prod but disabled for a bit now. The last squash had some changes from when I ran tests using docker containers, i removed a small optimization that missed evaluating an open_write increment.

Copy link
Contributor

@bryancall bryancall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on @cmcfarlen review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants