Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend: increase connection count before connect #4154

Merged
merged 1 commit into from
Sep 23, 2024

Conversation

asadsa92
Copy link
Contributor

There is a race between the time we connect to a backend and when we increase
the connection count. This race is not normally observed, but when
.max_connections property is used we could end up creating more connections to
the backend than this limit permits. This is problematic as we end up consuming
more system resources, e.g., file-descriptors, than reserved for such backends.

The problem is that we release the backend lock and wait for the connect
to finish. This can take upto connect_timeout, which has a default timeout of
3.5 seconds, and only then do we count the connection as consumed towards the
.max_connections property.

This can create a recipe for bursty behavior where a backend already
struggling could be overwhelmed by a significant number of connections. Then,
as the connection count goes higher than the ceiling we either fail fetches or
queue them up. However, as soon as we once again drop below this limit we will
again over-commit and the cycle repeats.

The tests cases uses a blackhole backend to simulate the connect_timeout, this
excersice the race but under normal circumstances the likelihood for the race
to occur depend on the traffic (fetches), networking conditions, and the
performance of the backend.

The solution to the problem is to increase the connection count before
connecting to the backend, and if we fail to get a connection to the
backend we revert the count before failing the fetch. This will make any
fethes past the .max_connections limit to either outright fail, or
observe the presence of a queue to the backend.

@asadsa92
Copy link
Contributor Author

asadsa92 commented Aug 12, 2024

bugwash:

  • skipped, was too close to the bugwash this week.

Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some actual concerns, but in general, I am not happy with the amount of changes this introduces left and right of the pretty basic feature itself. #4030 already contains incremental changes to some regions by different authors, and now a third author comes along and edits it all over again? Um no, my take here is that the PR should aim for a small diff unless for good reasons.

bin/varnishd/cache/cache_backend.c Outdated Show resolved Hide resolved
bin/varnishd/cache/cache_backend.c Outdated Show resolved Hide resolved
bin/varnishd/cache/cache_backend.c Outdated Show resolved Hide resolved
@asadsa92
Copy link
Contributor Author

There are some actual concerns, but in general, I am not happy with the amount of changes this introduces left and right of the pretty basic feature itself. #4030 already contains incremental changes to some regions by different authors, and now a third author comes along and edits it all over again? Um no, my take here is that the PR should aim for a small diff unless for good reasons.

I understand your concerns. I initially only implemented the bug fix, but when I did that I saw the opportunity to make this function simpler. If you do not like the extra work I did here, I can propose that we split this PR into two chunks. One containing a few commits and the bug fix, while the other contains the rest of the commits.

Sorry for being too late for the #4030 party, this is not an effort to re-implement #4030 .
As explained here: #4154 (comment)
The bug fix allows us to make the connection queue implementation more compact and I think we should go the extra mile to keep it simple.

@nigoroll
Copy link
Member

I can propose that we split this PR into two chunks.

Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again.

@asadsa92
Copy link
Contributor Author

I can propose that we split this PR into two chunks.

Yes. I think it would be helpful to have a minimal patch with just the connection count fix and maybe a rework of the CW feature in a second step. Regarding the latter, there are things I do not like, yes, but also from the first look I would think it's wrong. As mentioned before, I would need to look again.

Cool, then I will do that. This includes the 4 first commits: https://github.com/varnishcache/varnish-cache/compare/master...varnishcache:varnish-cache:fe747ac16ef15ef228a5f245fab5bd6372c01f30?w=1
If you are OK with this, then I can go ahead and rebase.
I will wait with the rest of the commits until after these 4 commits has been merged.

@asadsa92
Copy link
Contributor Author

I ended up reducing the fix to only one commit with small diff.
@nigoroll Please go ahead and let me know if you are happy with it now.
Once merged, I will create a new PR with the remaining commits.

@asadsa92 asadsa92 requested a review from nigoroll August 19, 2024 09:46
nigoroll added a commit that referenced this pull request Aug 20, 2024
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commenting on unchanged lines is not possible: In addition to the issues pointed out, if this was the right approach, an n_conn decrement is missing for the bo->htc == NULL case and the vbe_connwait_dequeue_locked() is still missing.
But at any rate, I think the simplification attempted here is not possible.

@@ -233,8 +234,10 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp,
VTAILQ_REMOVE(&bp->cw_head, cw, cw_list);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not directly related to this PR, but related to the problem we are trying to solve: In the line above, is err != 0 actually correct?
IIUC, if we got signaled (err == 0), we could race for the mtx and find the backend already busy again. But we only look at the busy state if waiting for the condition variable resulted in an error (likely ETIMEDOUT), so we would go ahead trying to make a connection even if the backend is busy (again).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, this has changed a bit: e46f972 (I am planning to rebase soon)
But, your statement still holds true. Maybe this should have been an OR?

The reason for this change came from you starting this thread: #4030 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this rather be something like do { ... } while (VRT_Healthy && BE_BUSY(...) && VTIM_real() < wait_end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like this.
This is normally how we re-check the condition after condwait, and I think that is cleaner as well (easy to understand what happens).
We would need to recalculate the timeout, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would only put one condition in the while loop and the rest as if statement with breaks in the loop.

}
}
} else
vbe_connwait_dequeue_locked(bp, cw);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with dequeuing early here is that racing threads might find the queue empty as a consequence and not exercise the CW_BE_BUSY code above, skipping the queue. Which is contrary to the problem you are trying to solve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think n_conn is wrong for this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not possible, we dequeue and increase the bp->n_conn variable while holding the mutex.

If the last entry get dequeued then BE_BUSY would return true as n_conn >= max_connections (or not if some fetch completes before the thread grab the mutex, but then there is room for it).

Either way, the only way a thread can skip the queue is iff there is no existing queue AND bp->n_conn < max_connections: https://github.com/varnishcache/varnish-cache/pull/4154/files#diff-fe1fb2709016ca15daea3e926b7cec272966ee397c5bafe50d6967023dd9f4a7R218-R219
I do not see how a thread could skip the queue.

Copy link
Member

@nigoroll nigoroll Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @asadsa92 my previous comment was referring to a different scenario. I have deleted it.

So the real problem is that, when the queue goes from one entry to zero, any other racing requests need to queue, otherwise they can overtake the original one and, for example, reach VCP_Get() earlier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @asadsa92 my previous comment was referring to a different scenario. I have deleted it.

No worries. I would rather get this correct the first time.

So the real problem is that, when the queue goes from one entry to zero, any other racing requests need to queue, otherwise they can overtake the original one and, for example, reach VCP_Get() earlier.

Yes, and we hopefully do that here.

Probes are special as they grab fresh connection, skip the queue and are not counted, but that is kind of expected.
The access to probe is serial, so we never have more than one connection active for each backend.
We could end with many connection in the closing state when the interval is low, but I do not count this as a Varnish problem. The VCL user asked for it.

All other in-core users of connection pool has to go through here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I now think I see why I was wrong. To recapitulate in my own words: The code before this change needed to keep the queue entry as a "temporary reservation" until the n_conn++ after the actual connection was made, otherwise the queue skip could actually have happened. But now that n_conn is incremented with the dequeue, and only decremented upon error, that scenario can not happen any more, because BE_BUSY() will be true.

So I apologize for having taken so long.

What might have contributed to my confusion is that bp->n_conn++ was moved to vbe_connwait_dequeue_locked(). I did see the change, but then apparently forgot about it again when I stared at vbe_dir_getfd(), which is evident in my (wrong) comment Also I think n_conn is wrong for this case.

I must admit I spent more time to do the implementation due lots of moving parts.
I was working in these code path for different project, and was wondering whether we could simplify. It was then it stroke me that there was a bug here. The test case verified my suspicion.

So, now that there is only one call site left, can we please move the body of vbe_connwait_dequeue_locked() in-line again (similar to what you did in dc013e6)? Also, with this change it seems to make no sense to signal when taking a connection, because why would we need to wake up a thread when there are now less connections available?

For reference, here is copy of the tree: https://github.com/asadsa92/varnish-cache/commits/backend_conn_queue_fixesv2/

Yes, that was I landed on when doing the fix for the bug: asadsa92@57f5470
This was what led me to simplify the implementation and move it into its own function as we no longer needed the reservation for cw.

if you agree, feel free to squash this with your commits:

diff --git a/bin/varnishd/cache/cache_backend.c b/bin/varnishd/cache/cache_backend.c
index 54ec79ef0..f29994ed0 100644
--- a/bin/varnishd/cache/cache_backend.c
+++ b/bin/varnishd/cache/cache_backend.c
@@ -156,16 +156,6 @@ vbe_connwait_signal_locked(const struct backend *bp)
        }
 }
 
-static void
-vbe_connwait_dequeue_locked(struct backend *bp, struct connwait *cw)
-{
-       Lck_AssertHeld(bp->director->mtx);
-       bp->n_conn++;
-       VTAILQ_REMOVE(&bp->cw_head, cw, cw_list);
-       vbe_connwait_signal_locked(bp);
-       cw->cw_state = CW_DEQUEUED;
-}
-
 static void
 vbe_connwait_fini(struct connwait *cw)
 {
@@ -230,13 +220,14 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp,
                            wait_end);
                } while (err == EINTR);
                bp->cw_count--;
+               VTAILQ_REMOVE(&bp->cw_head, cw, cw_list);
                if (err != 0 && BE_BUSY(bp)) {
-                       VTAILQ_REMOVE(&bp->cw_head, cw, cw_list);
                        VSC_C_main->backend_wait_fail++;
                        cw->cw_state = CW_BE_BUSY;
                } else
-                       vbe_connwait_dequeue_locked(bp, cw);
-       } else if (cw->cw_state == CW_DO_CONNECT)
+                       cw->cw_state = CW_DEQUEUED;
+       }
+       if (cw->cw_state != CW_BE_BUSY)
                bp->n_conn++;
        Lck_Unlock(bp->director->mtx);

Looks good, I like your proposal
We could even change the state to CW_DEQUEUED earlier (not full diff):

diff --git a/bin/varnishd/cache/cache_backend.c b/bin/varnishd/cache/cache_backend.c
index 54ec79ef0..f29994ed0 100644
--- a/bin/varnishd/cache/cache_backend.c
+++ b/bin/varnishd/cache/cache_backend.c
@@ -230,13 +220,14 @@ vbe_dir_getfd(VRT_CTX, struct worker *wrk, VCL_BACKEND dir, struct backend *bp,
                            wait_end);
                } while (err == EINTR);
+               VTAILQ_REMOVE(&bp->cw_head, cw, cw_list);
+               cw->cw_state = CW_DEQUEUED;
                bp->cw_count--;
                if (err != 0 && BE_BUSY(bp)) {
-                       VTAILQ_REMOVE(&bp->cw_head, cw, cw_list);
                        VSC_C_main->backend_wait_fail++;
                        cw->cw_state = CW_BE_BUSY;
-                } else
-                       vbe_connwait_dequeue_locked(bp, cw);
-       } else if (cw->cw_state == CW_DO_CONNECT)
+                }
+       }
+       if (cw->cw_state != CW_BE_BUSY)
                bp->n_conn++;
        Lck_Unlock(bp->director->mtx);

Let me know which you prefer, I will then push a commit for it and squash later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, your additional simplification makes sense!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asadsa92 did you maybe forget about this thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did, sorry about that. Let me redo it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK now. Not sure how forgot about this, but thanks for the reminder.

@asadsa92
Copy link
Contributor Author

As commenting on unchanged lies is not possible: In addition to the issues pointed out, if this was the right approach, an n_conn decrement is missing for the bo->htc == NULL case and the vbe_connwait_dequeue_locked() is still missing. But at any rate, I think the simplification attempted here is not possible.

Oh, this was a miss on my part when I rebased the original PR to remove the other commits I forgot to change this.
I will push the fix as a new commit.

BTW, this is not a simplification attempt, this is fixing the bug outlined in the description of the commit.
The simplification will come later, but let's handle this separately in another PR.

@nigoroll
Copy link
Member

As of edfeb73, b68 fails for me locally.

looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:

**** v1    vsl|       1002 BereqMethod     b GET
**** v1    vsl|       1002 BereqURL        b /blackhole
**** v1    vsl|       1002 BereqProtocol   b HTTP/1.1
**** v1    vsl|       1002 BereqHeader     b Host: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b User-Agent: c1
**** v1    vsl|       1002 BereqHeader     b X-Forwarded-For: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b Via: 1.1 v1 (Varnish/trunk)
**** v1    vsl|       1002 BereqHeader     b Accept-Encoding: gzip
**** v1    vsl|       1002 BereqHeader     b X-Varnish: 1002
**** v1    vsl|       1002 VCL_call        b BACKEND_FETCH
**** v1    vsl|       1002 VCL_return      b fetch
**** v1    vsl|       1002 Timestamp       b Fetch: 1724257240.120347 0.000023 0.000023
**** v1    vsl|       1002 FetchError      b backend default: fail errno 101 (Network is unreachable)
**** v1    vsl|       1002 Timestamp       b Beresp: 1724257240.130065 0.009741 0.009718
**** v1    vsl|       1002 Timestamp       b Error: 1724257240.130068 0.009744 0.000003
...
**** dT    7.624
---- v1    Not true: MAIN.backend_busy (0) == 1 (1)

Maybe it's better to use a backend with a delay?

Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0

@asadsa92
Copy link
Contributor Author

As of edfeb73, b68 fails for me locally.

Interesting.

looks like this was because the "blackhole" connection does not time out as intended, but rather fails immediately:

**** v1    vsl|       1002 BereqMethod     b GET
**** v1    vsl|       1002 BereqURL        b /blackhole
**** v1    vsl|       1002 BereqProtocol   b HTTP/1.1
**** v1    vsl|       1002 BereqHeader     b Host: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b User-Agent: c1
**** v1    vsl|       1002 BereqHeader     b X-Forwarded-For: 127.0.0.1
**** v1    vsl|       1002 BereqHeader     b Via: 1.1 v1 (Varnish/trunk)
**** v1    vsl|       1002 BereqHeader     b Accept-Encoding: gzip
**** v1    vsl|       1002 BereqHeader     b X-Varnish: 1002
**** v1    vsl|       1002 VCL_call        b BACKEND_FETCH
**** v1    vsl|       1002 VCL_return      b fetch
**** v1    vsl|       1002 Timestamp       b Fetch: 1724257240.120347 0.000023 0.000023
**** v1    vsl|       1002 FetchError      b backend default: fail errno 101 (Network is unreachable)
**** v1    vsl|       1002 Timestamp       b Beresp: 1724257240.130065 0.009741 0.009718
**** v1    vsl|       1002 Timestamp       b Error: 1724257240.130068 0.009744 0.000003
...
**** dT    7.624
---- v1    Not true: MAIN.backend_busy (0) == 1 (1)

Maybe it's better to use a backend with a delay?

Full output: https://gist.github.com/nigoroll/1dea1f48af137a0341f54e373cc5aee0

Thanks for testing, the original test cases had this part:

backend default {
        # Non-routeable IPv4 address to simulate connect_timeout
        # XXX: $\{blackhole_backend}
        .host = "10.255.255.1 80";
        [...]
}

That's when I found ${bad_ip} something I could use instead. So, I rewrote it.
Could you try to change the host in the test case to what I wrote here and try again?
The original idea was introduce a macro for it as the IP is non-routeable over the Internet.
Backend with a delay would be too late (I think), you want the VTCP_Connect() to hang until the timeout.

@nigoroll
Copy link
Member

I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

@asadsa92
Copy link
Contributor Author

asadsa92 commented Aug 23, 2024

I do not think that we will be able to find some magic IP which times out. Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

Nope, this would exercise the first_byte_timeout and not the connect_timeout. That is too late.
When a backend listens for connections, a connect will not fail unless the backlog overflows. The kernel could decide to refuse the connection for other reasons, but this is mostly outside the control of the application.

A connect syscall will make the TCP kernel stack to initiate the three-way handshake (SYN, SYN-ACK, ACK), a successfully call to connect does not require the other end to "accept" the connection. Once the handshake completes, the connection is added to the listen backlog. It stays there until the server "accept" the connection.

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

@walid-git
Copy link
Member

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Can't we do that with raw sockets maybe ?

@asadsa92
Copy link
Contributor Author

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Can't we do that with raw sockets maybe ?

I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP.
How is this relevant here? Varnish only supports TCP & UDS backends, I do not think we should roll out our own protocol for this.

@walid-git
Copy link
Member

In order to make the VTCP_Connect to fail, the backend needs to either not be in the LISTEN state or we need to initiate the handshake with a peer that will not respond to the handshake with SYN-ACK (blackhole). The problem with the former approach is that it would yield an immediate rejection (that is not what we want to test). We could also do something smart with an eBPF program, but that solution would not be portable for our liking. That's why I landed on using a "blackhole" IPv4 address.

Can't we do that with raw sockets maybe ?

I am not sure I understand, you normally use raw sockets if you want to implement your own transport over IP. How is this relevant here? Varnish only supports TCP & UDS backends, I do not think we should roll out our own protocol for this.

What I meant is that we are currently limited by the kernel taking care of connection establishment and doing the handshake for us, and that's what we want to prevent. I'm not suggesting to implement a protocol here, we only want to receive the TCP packet containing the SYN flag from the client and simply ignore it so that we can emulate a connection timeout, that's why I was suggesting raw sockets, but I'm not sure if this can be achieved though.

@nigoroll
Copy link
Member

nigoroll commented Aug 26, 2024

what about this?

Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

can't we already do this with a vtc_server? if not, I would think the adjustments should be fairly minimal.

@walid-git
Copy link
Member

what about this?

Maybe we need a facility in varnishtest which listens on a socket, but just never accepts some something similar?

This wouldn't work with a regular TCP socket, as soon as you have a listening socket, all the connect calls to that socket would succeed (regardless if you call accept or not), or fail immediately if the backlog is full.

@nigoroll
Copy link
Member

also with uds?

@asadsa92
Copy link
Contributor Author

asadsa92 commented Sep 2, 2024

also with uds?

This is not how the connect syscall works (at least not on Linux).
The connect syscall associate the socket to an address, and gives you back a file descriptor (fd).
You can send off your request before your peer has accepted you, but the response will not return before you have been dequeued from the backlog and a response is written to the socket. You wait for your response in read/recv...
It is true that the connect syscall can hang, but that usually is due to a failing handshake at the transport level.

Patch to add 5s sleep before we accept the client in the server block:

$ git diff
diff --git a/bin/varnishtest/vtc_server.c b/bin/varnishtest/vtc_server.c
index 45913aaa7..66e2bf9f1 100644
--- a/bin/varnishtest/vtc_server.c
+++ b/bin/varnishtest/vtc_server.c
@@ -244,6 +244,7 @@ server_conn(void *priv, struct vtclog *vl)
 
        CAST_OBJ_NOTNULL(s, priv, SERVER_MAGIC);
 
+       sleep(5);
        addr = (void*)&addr_s;
        l = sizeof addr_s;
        fd = accept(s->sock, addr, &l);

Here is the VTC:

varnishtest "UDS connect"

server s1 -listen "${tmpdir}/uds" {
	rxreq
	txresp
} -start

client c1 -connect "${s1_sock}" {
	timeout 10
	txreq
	rxresp
	expect resp.status == 200
} -run

Logs:

$ bin/varnishtest/varnishtest -i uds_connect.vtc -v
**** dT    0.000
*    top   TEST uds_connect.vtc starting
**** top   extmacro def pkg_version=trunk
**** top   extmacro def pkg_branch=trunk
**** top   extmacro def pwd=/home/asadsa/varnish-cache
**** top   extmacro def date(...)
**** top   extmacro def string(...)
**** top   extmacro def localhost=127.0.0.1
**** top   extmacro def bad_backend=127.0.0.1:35537
**** top   extmacro def listen_addr=127.0.0.1:0
**** top   extmacro def bad_ip=192.0.2.255
**** top   extmacro def topbuild=/home/asadsa/varnish-cache
**** top   extmacro def topsrc=/home/asadsa/varnish-cache
**** top   macro def testdir=/home/asadsa/varnish-cache
**** top   macro def tmpdir=/tmp/vtc.2188850.221ec1e4
**** top   macro def vtcid=vtc.2188850.221ec1e4
**   top   === varnishtest "UDS connect"
*    top   VTEST UDS connect
**   top   === server s1 -listen "${tmpdir}/uds" {
**   s1    Starting server
**** s1    macro def s1_addr=0.0.0.0
**** s1    macro def s1_port=0
**** s1    macro def s1_sock=/tmp/vtc.2188850.221ec1e4/uds
*    s1    Listen on /tmp/vtc.2188850.221ec1e4/uds
**   top   === client c1 -connect "${s1_sock}" {
**   c1    Starting client
**   c1    Waiting for client
**   s1    Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
**   c1    Started on /tmp/vtc.2188850.221ec1e4/uds (1 iterations)
***  c1    Connect to /tmp/vtc.2188850.221ec1e4/uds
***  c1    connected fd 4 to /tmp/vtc.2188850.221ec1e4/uds
**   c1    === timeout 10
**   c1    === txreq
**** dT    0.001
**** c1    txreq|GET / HTTP/1.1\r
**** c1    txreq|Host: 127.0.0.1\r
**** c1    txreq|User-Agent: c1\r
**** c1    txreq|\r
**   c1    === rxresp
**** dT    5.001
***  s1    accepted fd 5 0.0.0.0 0
**   s1    === rxreq
**** s1    rxhdr|GET / HTTP/1.1\r
**** s1    rxhdr|Host: 127.0.0.1\r
**** s1    rxhdr|User-Agent: c1\r
**** s1    rxhdr|\r
**** s1    rxhdrlen = 51
**** s1    http[ 0] |GET
**** s1    http[ 1] |/
**** s1    http[ 2] |HTTP/1.1
**** s1    http[ 3] |Host: 127.0.0.1
**** s1    http[ 4] |User-Agent: c1
**** s1    bodylen = 0
**   s1    === txresp
**** s1    txresp|HTTP/1.1 200 OK\r
**** s1    txresp|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** s1    txresp|Server: s1\r
**** s1    txresp|Content-Length: 0\r
**** s1    txresp|\r
**** dT    5.002
***  s1    shutting fd 5
**   s1    Ending
**** c1    rxhdr|HTTP/1.1 200 OK\r
**** c1    rxhdr|Date: Mon, 02 Sep 2024 11:46:09 GMT\r
**** c1    rxhdr|Server: s1\r
**** c1    rxhdr|Content-Length: 0\r
**** c1    rxhdr|\r
**** c1    rxhdrlen = 87
**** c1    http[ 0] |HTTP/1.1
**** c1    http[ 1] |200
**** c1    http[ 2] |OK
**** c1    http[ 3] |Date: Mon, 02 Sep 2024 11:46:09 GMT
**** c1    http[ 4] |Server: s1
**** c1    http[ 5] |Content-Length: 0
**** c1    bodylen = 0
**   c1    === expect resp.status == 200
**** c1    EXPECT resp.status (200) == "200" match
***  c1    closing fd 4
**   c1    Ending
*    top   RESETTING after uds_connect.vtc
**   s1    Waiting for server (3/-1)
*    top   TEST uds_connect.vtc completed
#    top  TEST uds_connect.vtc passed (5.004)

Take a note that the client connected before accept(); The 5s delay was observed during rxresp

@nigoroll
Copy link
Member

nigoroll commented Sep 2, 2024

bugwash: add usage instructions to the vtc and skip it, other than the merge conflicts I think this is ready.

@asadsa92
Copy link
Contributor Author

asadsa92 commented Sep 3, 2024

Done.

There is a race between the time we connect to a backend and when we increase
the connection count. This race is not normally observed, but when
.max_connections property is used we could end up creating more connections to
the backend than this limit permits. This is problematic as we end up consuming
more system resources, e.g., file-descriptors, than reserved for such backends.

The problem is that we release the backend lock and wait for the connect
to finish. This can take upto connect_timeout, which has a default timeout of
3.5 seconds, and only then do we count the connection as consumed towards the
.max_connections property.

This can create a recipe for bursty behavior where a backend already
struggling could be overwhelmed by a significant number of connections. Then,
as the connection count goes higher than the ceiling we either fail fetches or
queue them up. However, as soon as we once again drop below this limit we will
again over-commit and the cycle repeats.

The tests cases uses a blackhole backend to simulate the connect_timeout, this
excersice the race but under normal circumstances the likelihood for the race
to occur depend on the traffic (fetches), networking conditions, and the
performance of the backend.

The solution to the problem is to increase the connection count before
connecting to the backend, and if we fail to get a connection to the
backend we revert the count before failing the fetch. This will make any
fethes past the .max_connections limit to either outright fail, or
observe the presence of a queue to the backend.

Signed-off-by: Asad Sajjad Ahmed <asadsa@varnish-software.com>
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! We now reached a nice clean patch, improving functionality and reducing code duplication. 👍🏽

@nigoroll
Copy link
Member

bugwash: merge

@asadsa92 asadsa92 merged commit 4f36cd5 into varnishcache:master Sep 23, 2024
10 of 11 checks passed
@asadsa92 asadsa92 deleted the backend_conn_queue_fixes branch September 23, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants