Pooling#2345
Conversation
Added connection pool options for SQL Alchemy engine.
Change pool_recycle to -1 to preserve current behavior.
Added SQLAlchemy connection-pool tuning options to configuration.
test_sql_pool_options.py exercises `store_db_parameters()` directly, requires no database, and runs in standard CI. It asserts the zero-behaviour-change defaults, override + typing, no DBAPI leakage, the existing dict-filtering, hashable/deterministic cache keys, and coexistence with search_path.
|
Is there a reason to not pop the attributes from the connect_args inside of get_engine? This would consolidate a bit of the complications noted in the PR between hashing and the manager using get_engine. Maybe I am missing something |
ricardogsilva
left a comment
There was a problem hiding this comment.
Just leaving my two cents here - I'm not a core committer so take these with a grain of salt.
Overall I agree with the PR, as adding these connection-related options seems relevant - thanks for your work and I look forward to having it merged!
Personally, I would simplify the implementation a bit, by relying on pygeoapi's JSON Schema document for the validation of the config.
And I would not include most of these tests, which I see as not being relevant.
| # Defaults keep SQLAlchemy's QueuePool sizing but, unlike SQLAlchemy's | ||
| # default of -1, recycle connections after an hour so that pooled | ||
| # connections cannot sit IDLE on the server indefinitely. |
There was a problem hiding this comment.
This part of the comment seems to be outdated, as you end up setting the default value of pool_recycle to -1
| # SQLAlchemy connection-pool tuning (optional). Defaults match | ||
| # SQLAlchemy's QueuePool and preserve previous behaviour. | ||
| # Persistent connections held open per worker process. | ||
| pool_size: 5 | ||
| # Extra short-lived connections allowed above pool_size. | ||
| max_overflow: 10 | ||
| # Recreate connections older than this many seconds. -1 (the | ||
| # default) never recycles; set a finite value (e.g. 300) so | ||
| # pooled connections cannot sit IDLE on the server indefinitely. | ||
| pool_recycle: -1 | ||
| # Seconds to wait for a connection from the pool before erroring. | ||
| pool_timeout: 30 | ||
| # Test connections with a lightweight ping before use. | ||
| pool_pre_ping: true |
There was a problem hiding this comment.
All of these new parameters need to be added to the config schema at
pygeoapi/resources/schemas/config/pygeoapi-config-0.x.yml
This will make it possible to test a pygeoapi configuration for correctness even before starting up the server.
| (key, type(default)(options.pop(key, default))) | ||
| for key, default in pool_defaults.items() | ||
| )) | ||
|
|
There was a problem hiding this comment.
In my opinion this could be made easier to read and less complex by:
- Storing
self.db_pool_optionsas adictinstead of atuple, and defer tuple creation to whenget_engineis called; - Relying on the types of passed
optionsalready being correct. Adding these new parameters to the config JSON Schema (as I mentioned in my other comment) would mean that the type of each parameter would already be documented and would be enforceable by doing a validation of the config.
Also, note that your implementation contains a subtle bug when trying to parse pool_pre_ping. If the original value had been:
{'pool_pre_ping': 'False'} # I'm passing a string with the value of "False"then the outcome would be:
# type(True)("False")
TrueIn other words, bool("False") is actually True because non-empty strings are truthy.
-pool_defaults = {
- 'pool_size': 5,
- 'max_overflow': 10,
- 'pool_recycle': -1, # SQLAlchemy default; preserves current behaviour
- 'pool_timeout': 30,
- 'pool_pre_ping': True,
-}
-self.db_pool_options = tuple(sorted(
- (key, type(default)(options.pop(key, default)))
- for key, default in pool_defaults.items()
-))
+self.pool_defaults = {
+ 'pool_size': options.pop('pool_size', 5),
+ 'max_overflow': options.pop('max_overflow', 10),
+ 'pool_recycle': options.pop('pool_recycle', -1), # SQLALchemy default - never release connections
+ 'pool_timeout': options.pop('pool_timeout', 30),
+ 'pool_pre_ping': options.pop('pool_pre_ping', True),
+}| self.db_user, | ||
| self._db_password, | ||
| self.db_conn, | ||
| self.db_pool_options, |
There was a problem hiding this comment.
as per my other comment, in my opinion it would be clearer if the tuple would be generated here, perhaps also accompanied with a comment mentioning that this is made as a way to enable making use of functools.cache.
Also, in modern Python, a dict's insertion ordering is preserved, so I don't think sorting the tuple would be needed.
| self.db_pool_options, | |
| tuple(self.db_pool_options.items()), # convert to hashable type, for using with functools.cache |
There was a problem hiding this comment.
I just found out that the upcoming Python 3.15 will come with a new frozendict type. This means this new dict type will be hashable, thus being usable as part of a cacheable function.
It will likely take a long while before pygeoapi can rely on this though, but nice to see this becoming a standard feature in Python.
| def test_pool_options_defaults_preserve_current_behaviour(): | ||
| obj = _Dummy() | ||
| store_db_parameters(obj, dict(CONN), {}) | ||
| pool = dict(obj.db_pool_options) | ||
| # Defaults must match pre-existing effective behaviour: | ||
| # pool_pre_ping was hardcoded True; pool_recycle was unset (-1). | ||
| assert pool['pool_size'] == 5 | ||
| assert pool['max_overflow'] == 10 | ||
| assert pool['pool_timeout'] == 30 | ||
| assert pool['pool_pre_ping'] is True | ||
| assert pool['pool_recycle'] == -1 |
There was a problem hiding this comment.
This test seems unnecessary to me - when this PR gets merged, the behavior it implements will become the current behavior, so the test looses its relevancy.
| def test_pool_options_are_overridable_and_typed(): | ||
| obj = _Dummy() | ||
| store_db_parameters( | ||
| obj, dict(CONN), | ||
| {'pool_size': 2, 'max_overflow': 3, 'pool_recycle': 300}, | ||
| ) | ||
| pool = dict(obj.db_pool_options) | ||
| assert pool['pool_size'] == 2 and isinstance(pool['pool_size'], int) | ||
| assert pool['max_overflow'] == 3 | ||
| assert pool['pool_recycle'] == 300 | ||
| # untouched keys keep defaults | ||
| assert pool['pool_timeout'] == 30 | ||
| assert pool['pool_pre_ping'] is True |
There was a problem hiding this comment.
This test would be unnecessary if you'd go with my suggestion above, of storing db_pool_options as a dict instead of a tuple and you'd rely on the configuration being valid after having added the JSON schema bits that are missing.
| def test_dict_valued_options_still_filtered(): | ||
| obj = _Dummy() | ||
| store_db_parameters( | ||
| obj, dict(CONN), | ||
| {'pool_size': 2, 'zoom': {'min': 0, 'max': 22}}, | ||
| ) | ||
| assert 'zoom' not in obj.db_options | ||
| assert dict(obj.db_pool_options)['pool_size'] == 2 | ||
|
|
There was a problem hiding this comment.
This test seems to be unnecessary, as it is not testing the changes you made in this PR. It verifies that the contents of obj.db_options are correct.
IMO this PR does not make any changes that would warrant this verification, unless you would not trust the behavior of dict.pop, which is a Python builtin.
| def test_pool_options_hashable_and_deterministic(): | ||
| a, b = _Dummy(), _Dummy() | ||
| store_db_parameters(a, dict(CONN), {'pool_size': 2}) | ||
| store_db_parameters(b, dict(CONN), {'pool_size': 2}) | ||
| # identical config -> identical key -> shared engine via functools.cache | ||
| assert a.db_pool_options == b.db_pool_options | ||
| assert hash(a.db_pool_options) == hash(b.db_pool_options) | ||
|
|
||
| c = _Dummy() | ||
| store_db_parameters(c, dict(CONN), {'pool_size': 9}) | ||
| # differing pool config -> distinct key (separate engine, by design) | ||
| assert c.db_pool_options != a.db_pool_options | ||
|
|
There was a problem hiding this comment.
This is testing Python's own implementation of how tuples are hashed, so I don't think it is relevant to include in pygeoapi.
| def test_pool_options_coexist_with_search_path(): | ||
| obj = _Dummy() | ||
| store_db_parameters( | ||
| obj, dict(CONN), | ||
| {'search_path': ['published', 'public'], 'pool_size': 4}, | ||
| ) | ||
| assert obj.db_search_path == ('published', 'public') | ||
| assert dict(obj.db_pool_options)['pool_size'] == 4 | ||
|
|
There was a problem hiding this comment.
This test seems unnecessary, as it does not test the functionality introduced in this PR
That's a good shout, I'll refactor |
Overview
Makes the SQLAlchemy connection pool of the SQL provider configurable per provider via the existing
options:block, exposingpool_size,max_overflow,pool_recycle,pool_timeoutandpool_pre_ping.Previously
get_engine()calledcreate_engine(conn_str, connect_args=connect_args, pool_pre_ping=True)with no pool sizing or recycle, so the defaultQueuePoolheldpool_sizeconnections open for the life of each worker process and never recycled them. In multi-process deployments this produces a large number of permanently-IDLE server-side connections (we saw connections idle for days, eventually exhaustingmax_connections). There was no way to bound or recycle the pool from configuration.Changes:
store_db_parameters()now extracts the five pool keys fromoptions, coerces them to their declared types, and stores them as a sorted, hashabletuple(self.db_pool_options). They are popped out ofoptionsso they are not forwarded to the DBAPI asconnect_args.get_engine()takes apool_optionstuple parameter and applies**dict(pool_options)tocreate_engine(). It stays@functools.cache-able because the parameter is a hashable tuple, so engine sharing per process is preserved; providers with differing pool config correctly get distinct engines.pygeoapi/process/manager/postgresql.pyalso callsget_engine(); its call site is updated to passself.db_pool_optionsso the manager does not losepool_pre_pingor skip recycling.Backward compatibility: defaults preserve current behaviour exactly —
pool_size=5,max_overflow=10,pool_pre_ping=True, andpool_recycle=-1(SQLAlchemy's default, i.e. the current effective behaviour).This PR is therefore a pure, opt-in feature add with no behaviour change for existing users. (See the issue for discussion of whether a finite default
pool_recycleshould be adopted as a separate follow-up.)New tests and documentation are included.
Related Issue / discussion
Closes #2344.
Additional information
Example configuration:
Note (documented): because
get_engine()is@functools.cache-d on its full argument set, providers that share a database must use identical pool options to continue sharing a single engine per worker; differing options intentionally yield separate engines.Dependency policy (RFC2)
No new dependencies are introduced; only the standard library and the already-required SQLAlchemy are used.
Updates to public demo
local.config.ymldoes not need to change.Contributions and licensing