Skip to content

Bug Report: connection pool timed out errors when there is a spike in borrowed/waiting connections due to race condition #17662

Closed
@mhamza15

Description

Overview of the Issue

There seems to be a race condition that causes a deadlock in connection pooling that occurs when a large number of connections are borrowed/waiting, specifically when there are no new connections afterwards. Here is the general flow, assuming a connection pool of size 1 for example:

  1. "Thread" A borrows a connection from the pool
  2. Thread B attempts to borrow a connection from the pool.
  3. Some time after Thread B checks the pool but before it gets a chance to join the waitlist, Thread A completes and tries to pass its connection on to a waiter in the waitlist. As there are yet no waiters, it simply returns the connection to the pool
  4. Thread B now joins the waitlist, but all connections are free and there are no existing connections to pass the connection from. Thread B blocks forever waiting for a new connection, the context times out, and we see our error code = ResourceExhausted desc = connection pool timed out.

Normally, in a live production system, a new query would come in, and a connection would be pulled straight from the pool, rather than waiting on an existing connection to pass it on. The new connection could then pass it on to Thread B, breaking the deadlock. But when it comes to our (GitHub) CI, the nature of our queries tends to cause the race condition more often, as we fire a bunch of queries all at once as part of a UNION ALL in our test cleanup code. These queries exceed the connection pool quickly, execute quickly, and cause the race condition. Since we're at the end of our test(s), no new queries are fired to pull a connection directly from the pool, and we wait forever.

Reproduction Steps

@arthurschreiber has come up with a test case that pretty consistently reproduces the error: #17661

Binary Version

main

Operating System and Environment details

all

Log Fragments

Trilogy::ProtocolError: 1203: target: github_test_repositories_actions_checks12.-80.primary: vttablet: rpc error: code = ResourceExhausted desc = connection pool timed out (CallerID: userData1) (trilogy_query_recv)

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions