-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Backend connection queue #4030
Backend connection queue #4030
Conversation
With this change, it seems that the default value for |
As an initial matter I would prefer I'm not sure I think it is a good idea however, but default-off mitigates that. |
FWIW this is a partial implementation of what we agreed on for VIP 31. https://github.com/varnishcache/varnish-cache/wiki/VIP31%3A-Backend-connection-queue There are two loose points not covered, disembarking and health status. Disembarking fetch tasks is a large project, one we can disregard or move to its own VIP. The saturation of both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I am 👍🏽
478c1ef
to
65bebcb
Compare
Thank you, this looks good to me overall. I would still have some suggestions, but would also be OK to polish after merge, if you agree:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see top level comment
65bebcb
to
8fa9f61
Compare
8fa9f61
to
bc95536
Compare
bc95536
to
f88295d
Compare
7fb8ce2
to
f82cb60
Compare
Rebased and addressed all review comments. Ready for a (hopefully) last review. |
f82cb60
to
5eed80f
Compare
Nitpick noticed during review of varnishcache#4030
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always feel bad when it looks like I was holding things up, so I would like to apologize for not having spotted some issues earlier.
@@ -149,6 +246,12 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp, | |||
if (bo->htc == NULL) { | |||
VSLb(bo->vsl, SLT_FetchError, "out of workspace"); | |||
/* XXX: counter ? */ | |||
if (cw->cw_state == CW_QUEUED) { | |||
Lck_Lock(bp->director->mtx); | |||
vbe_connwait_dequeue_locked(bp, cw); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not move this whole htc alloc block to the top, even before the cw init?
Reason: the ws does not change during waiting, so if it overflows, it does so right from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we allocate workspace before we are sure that we can get a backend connection ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wouldn't be able to allocate the htc in the first place, why wait at all?
On the other hand, if we are effectively not connecting and workspace would have been too tight, then we fail for the wrong reason, and we don't visit vcl_backend_error
at all.
Since the default values for the parameters make this an opt-in feature, may I suggest adding an XXX comment for now to take the proper time later to see how to best approach this? (snapshot/reset for certain paths for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an XXX comment as suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, if we are effectively not connecting and workspace would have been too tight, then we fail for the wrong reason, and we don't visit vcl_backend_error at all.
I disagree on "fail for the wrong reason". This code only runs because we do want to connect and having enough workspace is a precondition for the connect to succeed. Potentially running into the connection or wait limit does not make it a "wrong reason" to fail for insufficient workspace.
I have addressed most of the last review items, and mentioned the potential drawbacks of this feature in the docs as requested during last bugwash. |
@@ -149,6 +246,12 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp, | |||
if (bo->htc == NULL) { | |||
VSLb(bo->vsl, SLT_FetchError, "out of workspace"); | |||
/* XXX: counter ? */ | |||
if (cw->cw_state == CW_QUEUED) { | |||
Lck_Lock(bp->director->mtx); | |||
vbe_connwait_dequeue_locked(bp, cw); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wouldn't be able to allocate the htc in the first place, why wait at all?
On the other hand, if we are effectively not connecting and workspace would have been too tight, then we fail for the wrong reason, and we don't visit vcl_backend_error
at all.
Since the default values for the parameters make this an opt-in feature, may I suggest adding an XXX comment for now to take the proper time later to see how to best approach this? (snapshot/reset for certain paths for example)
07e0295
to
d604088
Compare
bugwash: proposed (re)name: global parameters:
VCL: backend foo {
.queue_limit = 42;
.queue_timeout 1m;
}
sub vcl_backend_fetch {
set bereq.queue_limit = 42;
set bereq.queue_timeout = 1m;
} |
hi all! I'd really like this to get into the next release, and from what I'm reading, it's only a naming exercise from now on. As a refresher, the current PR offers Any chance to get the original names in? I hate to bring it up and I'll probably get slapped for it but: we have customers using the feature and consistency is pretty important, I don't want to have to translate parameter names depending on the platform people are running. |
bugwash approved. |
Bugwash: when a probe transitions effectively to sick, the queue should be cleared to let tasks fail immediately. Effective transitions to sick:
|
This patch allows a task to be queued when a backend reaches its max_connections. The task will queue on the backend and wait for a connection to become availble, rather than immediately failing. This initial commit just adds the basic functionality. It temporarily uses the connect_timeout as the queue wait time, until new parameters are added in followup effort.
d604088
to
871f784
Compare
bugwash: merge but the last commit, open new PR for the flush |
871f784
to
5e73a81
Compare
The following parameters have been added: the amount of time a task will wait. the maximum number of tasks that can wait. - global parameters: backend_wait_timeout (default 0.0) backend_wait_limit (default 0) - those parameters can be overridden in the backend: backend foo { .host = "bar.com"; .wait_timeout = 3s; .wait_limit = 10; } The backend wait queue capability is off by default and must be enabled by setting both of the new parameters defined above. Note that this makes an ABI breaking change.
These counters were added to main: backend_wait - count of tasks that waited in queue for a connection. backend_wait_fail - count of tasks that waited in queue but did not get a connection (timed out).
As suggested by Nils
This makes sure that we won't abort a backend connection attempt if the backend can take it. It covers for any potential missing connwait_signal call.
5e73a81
to
171b6b7
Compare
This patch allows a task to be queued when a backend reaches its max_connections. The task will queue on the backend and wait for a connection to become available, rather than immediately failing. This capability is off by default and must be enabled with new parameters.
The following parameters have been added:
backend_wait_timeout
: the amount of time a task will wait (default 0.0).backend_wait_limit
: the maximum number of tasks that can wait (default 0).The two parameters can also be overriden for individual backends in vcl:
Authored by: @drodden (with minor rearrangements)