Overall revamp of connection pool, retries and timeouts

I'm experiencing many issues (already reported here by other people) related with the (blocking) connection pool in the asyncio version.

I had to rollback and pin my dependency version to 4.5.5 currently. (lablup/backend.ai#1620)

Since there are already many reports, here I would instead suggest some high-level design suggestions:

* The redis-py users should be able to distinguish:
  - The connection is closed by myself.
  - The connection is actively closed by the server after sending responses.
  - The connection is abruptly closed by the server without sending responses.
  - [The client is evicted due to a server-side limit.](https://redis.io/docs/reference/clients/#client-eviction)
* Improvements for timeouts
  - Distinguish:
    - connection ready timeout in the blocking connection pool
    - socket connection timeout
    - response timeout
  - These timeouts should be treated differently in the intrinsic retry subsystem and distinguishable by subclasses of `ConnectionError` and/or `TimeoutError` to ease writing user-defined retry mechanisms.
    - e.g., Within the connection ready timeout, we should silenty wait and retry until we get a connection from a blocking connection pool.
    - The response timeout for non-blocking (`GET`, `SET`, ...) and blocking commands (`BLPOP`, `XREAD`, ...) should be considered a different condition: active error vs. polling.
      - It would be nice to have explicit examples/docs on how to write a polling loop around blocking commands with proper timeout and retry handling.
      - Please refer: https://github.com/lablup/backend.ai/blob/833ed5477d57846e568b17fec35c82300111a519/src/ai/backend/common/redis_helper.py#L174
    - #2807
    - #2973
    - #2663
  - Maybe we could refer the design of [`aiohttp.ClientTimeout`](https://docs.aiohttp.org/en/stable/client_reference.html#aiohttp.ClientTimeout).
* `BlockingConnectionPool` should be the default.
  - `ConnectionPool`'s default `max_connections` should be a more reasonable number.
    - #2220
  - #2522
  - #3034
  - #3056
  - There are issues to resolve first, though...
    - #2995
    - #2983
    - #2998
      - #2997
      - #2859
        - #2755
          - #2749
    - #2992
      - #445
    - #3124
* Better connection pool design and abstraction to embrace underyling transport type differences and errors with connections
  - #2523
  - #2773
  - #2832
  - #2727
  - #2636
  - #2695
  - #3000 (though this is a thread usecase)
    - #2883
  - #3014
  - #3026
  - #3043
* The sentinel's connection pool should also have the blocking version.
  - Currently there is no `BlockingSentinelConnectionPool`.
  - #2956
  - We need to clearly define whether the delay after connecting to the sentinel but before connecting to the target master/slave is included in the socket connection timeout or not.
* The new `CLIENT SETINFO` mechanism should be generalized.
  - #2682
  - What if a failure occurs during sending this command?
    - In my test cases which test retries with a sentinel master failover or a redis server restart, this `CLIENT SETINFO` breaks the retry semantic.
    - Could we enforce the intrinsic/user-defined retry on such event?
  - What if a user wants to insert additional commands like `CLIENT SETNAME` or `CLIENT NO-EVICT on`?
  - Maybe we could refactor it as a connection-establishment callback, whose failure is ignored before disposing the faulty connection.
    - #2980
      - #2965
* Backport policy for bug fixes
  - Since I had to rollback to 4.5.5 after trying 4.6.0 → 5.0.1 upgrade with connection leak experience, it would be nice to have a backport policy for critical bug fixes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overall revamp of connection pool, retries and timeouts #3008

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Overall revamp of connection pool, retries and timeouts #3008

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions