-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backend: increase connection count before connect #4154
backend: increase connection count before connect #4154
Conversation
bugwash:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some actual concerns, but in general, I am not happy with the amount of changes this introduces left and right of the pretty basic feature itself. #4030 already contains incremental changes to some regions by different authors, and now a third author comes along and edits it all over again? Um no, my take here is that the PR should aim for a small diff unless for good reasons.
I understand your concerns. I initially only implemented the bug fix, but when I did that I saw the opportunity to make this function simpler. If you do not like the extra work I did here, I can propose that we split this PR into two chunks. One containing a few commits and the bug fix, while the other contains the rest of the commits. Sorry for being too late for the #4030 party, this is not an effort to re-implement #4030 . |
Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again. |
Cool, then I will do that. This includes the 4 first commits: https://github.com/varnishcache/varnish-cache/compare/master...varnishcache:varnish-cache:fe747ac16ef15ef228a5f245fab5bd6372c01f30?w=1 |
ab9f0af
to
b1408da
Compare
I ended up reducing the fix to only one commit with small diff. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commenting on unchanged lines is not possible: In addition to the issues pointed out, if this was the right approach, an n_conn
decrement is missing for the bo->htc == NULL
case and the vbe_connwait_dequeue_locked()
is still missing.
But at any rate, I think the simplification attempted here is not possible.
bin/varnishd/cache/cache_backend.c
Outdated
@@ -233,8 +234,10 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp, | |||
VTAILQ_REMOVE(&bp->cw_head, cw, cw_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not directly related to this PR, but related to the problem we are trying to solve: In the line above, is err != 0
actually correct?
IIUC, if we got signaled (err == 0
), we could race for the mtx
and find the backend already busy again. But we only look at the busy state if waiting for the condition variable resulted in an error (likely ETIMEDOUT
), so we would go ahead trying to make a connection even if the backend is busy (again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this has changed a bit: e46f972 (I am planning to rebase soon)
But, your statement still holds true. Maybe this should have been an OR?
The reason for this change came from you starting this thread: #4030 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this rather be something like do { ... } while (VRT_Healthy && BE_BUSY(...) && VTIM_real() < wait_end
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, something like this.
This is normally how we re-check the condition after condwait, and I think that is cleaner as well (easy to understand what happens).
We would need to recalculate the timeout, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would only put one condition in the while loop and the rest as if statement with breaks in the loop.
bin/varnishd/cache/cache_backend.c
Outdated
} | ||
} | ||
} else | ||
vbe_connwait_dequeue_locked(bp, cw); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with dequeuing early here is that racing threads might find the queue empty as a consequence and not exercise the CW_BE_BUSY
code above, skipping the queue. Which is contrary to the problem you are trying to solve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think n_conn
is wrong for this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not possible, we dequeue and increase the bp->n_conn
variable while holding the mutex.
If the last entry get dequeued then BE_BUSY
would return true as n_conn >= max_connections (or not if some fetch completes before the thread grab the mutex, but then there is room for it).
Either way, the only way a thread can skip the queue is iff there is no existing queue AND bp->n_conn
< max_connections: https://github.com/varnishcache/varnish-cache/pull/4154/files#diff-fe1fb2709016ca15daea3e926b7cec272966ee397c5bafe50d6967023dd9f4a7R218-R219
I do not see how a thread could skip the queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @asadsa92 my previous comment was referring to a different scenario. I have deleted it.
So the real problem is that, when the queue goes from one entry to zero, any other racing requests need to queue, otherwise they can overtake the original one and, for example, reach VCP_Get()
earlier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @asadsa92 my previous comment was referring to a different scenario. I have deleted it.
No worries. I would rather get this correct the first time.
So the real problem is that, when the queue goes from one entry to zero, any other racing requests need to queue, otherwise they can overtake the original one and, for example, reach
VCP_Get()
earlier.
Yes, and we hopefully do that here.
Probes are special as they grab fresh connection, skip the queue and are not counted, but that is kind of expected.
The access to probe is serial, so we never have more than one connection active for each backend.
We could end with many connection in the closing state when the interval is low, but I do not count this as a Varnish problem. The VCL user asked for it.
All other in-core users of connection pool has to go through here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I now think I see why I was wrong. To recapitulate in my own words: The code before this change needed to keep the queue entry as a "temporary reservation" until the
n_conn++
after the actual connection was made, otherwise the queue skip could actually have happened. But now thatn_conn
is incremented with the dequeue, and only decremented upon error, that scenario can not happen any more, becauseBE_BUSY()
will be true.So I apologize for having taken so long.
What might have contributed to my confusion is that
bp->n_conn++
was moved tovbe_connwait_dequeue_locked()
. I did see the change, but then apparently forgot about it again when I stared atvbe_dir_getfd()
, which is evident in my (wrong) comment Also I thinkn_conn
is wrong for this case.
I must admit I spent more time to do the implementation due lots of moving parts.
I was working in these code path for different project, and was wondering whether we could simplify. It was then it stroke me that there was a bug here. The test case verified my suspicion.
So, now that there is only one call site left, can we please move the body of
vbe_connwait_dequeue_locked()
in-line again (similar to what you did in dc013e6)? Also, with this change it seems to make no sense to signal when taking a connection, because why would we need to wake up a thread when there are now less connections available?
For reference, here is copy of the tree: https://github.com/asadsa92/varnish-cache/commits/backend_conn_queue_fixesv2/
Yes, that was I landed on when doing the fix for the bug: asadsa92@57f5470
This was what led me to simplify the implementation and move it into its own function as we no longer needed the reservation for cw.
if you agree, feel free to squash this with your commits:
diff --git a/bin/varnishd/cache/cache_backend.c b/bin/varnishd/cache/cache_backend.c index 54ec79ef0..f29994ed0 100644 --- a/bin/varnishd/cache/cache_backend.c +++ b/bin/varnishd/cache/cache_backend.c @@ -156,16 +156,6 @@ vbe_connwait_signal_locked(const struct backend *bp) } } -static void -vbe_connwait_dequeue_locked(struct backend *bp, struct connwait *cw) -{ - Lck_AssertHeld(bp->director->mtx); - bp->n_conn++; - VTAILQ_REMOVE(&bp->cw_head, cw, cw_list); - vbe_connwait_signal_locked(bp); - cw->cw_state = CW_DEQUEUED; -} - static void vbe_connwait_fini(struct connwait *cw) { @@ -230,13 +220,14 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp, wait_end); } while (err == EINTR); bp->cw_count--; + VTAILQ_REMOVE(&bp->cw_head, cw, cw_list); if (err != 0 && BE_BUSY(bp)) { - VTAILQ_REMOVE(&bp->cw_head, cw, cw_list); VSC_C_main->backend_wait_fail++; cw->cw_state = CW_BE_BUSY; } else - vbe_connwait_dequeue_locked(bp, cw); - } else if (cw->cw_state == CW_DO_CONNECT) + cw->cw_state = CW_DEQUEUED; + } + if (cw->cw_state != CW_BE_BUSY) bp->n_conn++; Lck_Unlock(bp->director->mtx);
Looks good, I like your proposal
We could even change the state to CW_DEQUEUED earlier (not full diff):
diff --git a/bin/varnishd/cache/cache_backend.c b/bin/varnishd/cache/cache_backend.c index 54ec79ef0..f29994ed0 100644 --- a/bin/varnishd/cache/cache_backend.c +++ b/bin/varnishd/cache/cache_backend.c @@ -230,13 +220,14 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp, wait_end); } while (err == EINTR); + VTAILQ_REMOVE(&bp->cw_head, cw, cw_list); + cw->cw_state = CW_DEQUEUED; bp->cw_count--; if (err != 0 && BE_BUSY(bp)) { - VTAILQ_REMOVE(&bp->cw_head, cw, cw_list); VSC_C_main->backend_wait_fail++; cw->cw_state = CW_BE_BUSY; - } else - vbe_connwait_dequeue_locked(bp, cw); - } else if (cw->cw_state == CW_DO_CONNECT) + } + } + if (cw->cw_state != CW_BE_BUSY) bp->n_conn++; Lck_Unlock(bp->director->mtx);
Let me know which you prefer, I will then push a commit for it and squash later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, your additional simplification makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asadsa92 did you maybe forget about this thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did, sorry about that. Let me redo it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be OK now. Not sure how forgot about this, but thanks for the reminder.
Oh, this was a miss on my part when I rebased the original PR to remove the other commits I forgot to change this. BTW, this is not a simplification attempt, this is fixing the bug outlined in the description of the commit. |
As of edfeb73, b68 fails for me locally. looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:
Maybe it's better to use a backend with a delay? Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0 |
Interesting.
Thanks for testing, the original test cases had this part: backend default {
# Non-routeable IPv4 address to simulate connect_timeout
# XXX: $\{blackhole_backend}
.host = "10.255.255.1 80";
[...]
} That's when I found |
I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar? |
Nope, this would exercise the first_byte_timeout and not the connect_timeout. That is too late. A connect syscall will make the TCP kernel stack to initiate the three-way handshake (SYN, SYN-ACK, ACK), a successfully call to connect does not require the other end to "accept" the connection. Once the handshake completes, the connection is added to the listen backlog. It stays there until the server "accept" the connection. In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address. |
Can't we do that with raw sockets maybe ? |
I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP. |
What I meant is that we are currently limited by the kernel taking care of connection establishment and doing the handshake for us, and that's what we want to prevent. I'm not suggesting to implement a protocol here, we only want to receive the TCP packet containing the SYN flag from the client and simply ignore it so that we can emulate a connection timeout, that's why I was suggesting raw sockets, but I'm not sure if this can be achieved though. |
what about this?
can't we already do this with a vtc_server? if not, I would think the adjustments should be fairly minimal. |
This wouldn't work with a regular TCP socket, as soon as you have a listening socket, all the |
also with uds? |
This is not how the connect syscall works (at least not on Linux). Patch to add 5s sleep before we accept the client in the server block: $ git diff
diff --git a/bin/varnishtest/vtc_server.c b/bin/varnishtest/vtc_server.c
index 45913aaa7..66e2bf9f1 100644
--- a/bin/varnishtest/vtc_server.c
+++ b/bin/varnishtest/vtc_server.c
@@ -244,6 +244,7 @@ server_conn(void *priv, struct vtclog *vl)
CAST_OBJ_NOTNULL(s, priv, SERVER_MAGIC);
+ sleep(5);
addr = (void*)&addr_s;
l = sizeof addr_s;
fd = accept(s->sock, addr, &l); Here is the VTC: varnishtest "UDS connect"
server s1 -listen "${tmpdir}/uds" {
rxreq
txresp
} -start
client c1 -connect "${s1_sock}" {
timeout 10
txreq
rxresp
expect resp.status == 200
} -run Logs: $ bin/varnishtest/varnishtest -i uds_connect.vtc -v
**** dT 0.000
* top TEST uds_connect.vtc starting
**** top extmacro def pkg_version=trunk
**** top extmacro def pkg_branch=trunk
**** top extmacro def pwd=/home/asadsa/varnish-cache
**** top extmacro def date(...)
**** top extmacro def string(...)
**** top extmacro def localhost=127.0.0.1
**** top extmacro def bad_backend=127.0.0.1:35537
**** top extmacro def listen_addr=127.0.0.1:0
**** top extmacro def bad_ip=192.0.2.255
**** top extmacro def topbuild=/home/asadsa/varnish-cache
**** top extmacro def topsrc=/home/asadsa/varnish-cache
**** top macro def testdir=/home/asadsa/varnish-cache
**** top macro def tmpdir=/tmp/vtc.2188850.221ec1e4
**** top macro def vtcid=vtc.2188850.221ec1e4
** top === varnishtest "UDS connect"
* top VTEST UDS connect
** top === server s1 -listen "${tmpdir}/uds" {
** s1 Starting server
**** s1 macro def s1_addr=0.0.0.0
**** s1 macro def s1_port=0
**** s1 macro def s1_sock=/tmp/vtc.2188850.221ec1e4/uds
* s1 Listen on /tmp/vtc.2188850.221ec1e4/uds
** top === client c1 -connect "${s1_sock}" {
** c1 Starting client
** c1 Waiting for client
** s1 Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
** c1 Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
*** c1 Connect to /tmp/vtc.2188850.221ec1e4/uds
*** c1 connected fd 4 to /tmp/vtc.2188850.221ec1e4/uds
** c1 === timeout 10
** c1 === txreq
**** dT 0.001
**** c1 txreq|GET / HTTP/1.1\r
**** c1 txreq|Host: 127.0.0.1\r
**** c1 txreq|User-Agent: c1\r
**** c1 txreq|\r
** c1 === rxresp
**** dT 5.001
*** s1 accepted fd 5 0.0.0.0 0
** s1 === rxreq
**** s1 rxhdr|GET / HTTP/1.1\r
**** s1 rxhdr|Host: 127.0.0.1\r
**** s1 rxhdr|User-Agent: c1\r
**** s1 rxhdr|\r
**** s1 rxhdrlen = 51
**** s1 http[ 0] |GET
**** s1 http[ 1] |/
**** s1 http[ 2] |HTTP/1.1
**** s1 http[ 3] |Host: 127.0.0.1
**** s1 http[ 4] |User-Agent: c1
**** s1 bodylen = 0
** s1 === txresp
**** s1 txresp|HTTP/1.1 200 OK\r
**** s1 txresp|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** s1 txresp|Server: s1\r
**** s1 txresp|Content-Length: 0\r
**** s1 txresp|\r
**** dT 5.002
*** s1 shutting fd 5
** s1 Ending
**** c1 rxhdr|HTTP/1.1 200 OK\r
**** c1 rxhdr|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** c1 rxhdr|Server: s1\r
**** c1 rxhdr|Content-Length: 0\r
**** c1 rxhdr|\r
**** c1 rxhdrlen = 87
**** c1 http[ 0] |HTTP/1.1
**** c1 http[ 1] |200
**** c1 http[ 2] |OK
**** c1 http[ 3] |Date: Mon, 02 Sep 2024 11:46:09 GMT
**** c1 http[ 4] |Server: s1
**** c1 http[ 5] |Content-Length: 0
**** c1 bodylen = 0
** c1 === expect resp.status == 200
**** c1 EXPECT resp.status (200) == "200" match
*** c1 closing fd 4
** c1 Ending
* top RESETTING after uds_connect.vtc
** s1 Waiting for server (3/-1)
* top TEST uds_connect.vtc completed
# top TEST uds_connect.vtc passed (5.004)
Take a note that the client connected before accept(); The 5s delay was observed during rxresp |
bugwash: add usage instructions to the vtc and skip it, other than the merge conflicts I think this is ready. |
edfeb73
to
5cbb34f
Compare
Done. |
There is a race between the time we connect to a backend and when we increase the connection count. This race is not normally observed, but when .max_connections property is used we could end up creating more connections to the backend than this limit permits. This is problematic as we end up consuming more system resources, e.g., file-descriptors, than reserved for such backends. The problem is that we release the backend lock and wait for the connect to finish. This can take upto connect_timeout, which has a default timeout of 3.5 seconds, and only then do we count the connection as consumed towards the .max_connections property. This can create a recipe for bursty behavior where a backend already struggling could be overwhelmed by a significant number of connections. Then, as the connection count goes higher than the ceiling we either fail fetches or queue them up. However, as soon as we once again drop below this limit we will again over-commit and the cycle repeats. The tests cases uses a blackhole backend to simulate the connect_timeout, this excersice the race but under normal circumstances the likelihood for the race to occur depend on the traffic (fetches), networking conditions, and the performance of the backend. The solution to the problem is to increase the connection count before connecting to the backend, and if we fail to get a connection to the backend we revert the count before failing the fetch. This will make any fethes past the .max_connections limit to either outright fail, or observe the presence of a queue to the backend. Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
5cbb34f
to
5c8ad65
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent! We now reached a nice clean patch, improving functionality and reducing code duplication. 👍🏽
bugwash: merge |
There is a race between the time we connect to a backend and when we increase
the connection count. This race is not normally observed, but when
.max_connections property is used we could end up creating more connections to
the backend than this limit permits. This is problematic as we end up consuming
more system resources, e.g., file-descriptors, than reserved for such backends.
The problem is that we release the backend lock and wait for the connect
to finish. This can take upto connect_timeout, which has a default timeout of
3.5 seconds, and only then do we count the connection as consumed towards the
.max_connections property.
This can create a recipe for bursty behavior where a backend already
struggling could be overwhelmed by a significant number of connections. Then,
as the connection count goes higher than the ceiling we either fail fetches or
queue them up. However, as soon as we once again drop below this limit we will
again over-commit and the cycle repeats.
The tests cases uses a blackhole backend to simulate the connect_timeout, this
excersice the race but under normal circumstances the likelihood for the race
to occur depend on the traffic (fetches), networking conditions, and the
performance of the backend.
The solution to the problem is to increase the connection count before
connecting to the backend, and if we fail to get a connection to the
backend we revert the count before failing the fetch. This will make any
fethes past the .max_connections limit to either outright fail, or
observe the presence of a queue to the backend.