Description
Environment details
- macOS Ventura 13.5.2
- Python version: 3.11
- pip version: 23.3.2
google-cloud-spanner
version: 3.40.1
I've created a script to simulate the issue here: https://github.com/MostafaOmar98/python-spanner-issue
Issue description:
When using a google cloud spanner session that has been deleted from server-side, you should get a 404 Session NotFound
error.
Currently google cloud spanner deletes sessions from the server side for the following reasons source:
- A client can delete a session.
- The Cloud Spanner database service can delete a session when the session is idle for more than 1 hour.
- The Cloud Spanner database service may delete a session if the session is more than 28 days old.
Using PingingPool
mainly protects the library user from (2) but does not cover (1) & does not cover (3) in case there is a steady flow of requests to the pool.
I'm mostly concerned with (3) though.
Issue reason:
PingingPool
keeps track of a locally calculated timestamp calledping_after
associated with each session in the pool.- When
PingingPool
is pinged (usingping()
).PingingPool
uses theping_after
timestamp and compares it to the current time to check whether it should actually initiate a server ping (by sending a dummy query such asSELECT 1
) or should just do nothing - the
ping_after
representsmost_recent_activity_on_the_session
+ping_interval
. Where:ping_interval
is a supplied option in secondsmost_recent_activity_on_the_session
is the last time the session was returned to the pool
I assume the rationale behind checking the ping_after
is to not send unnecessary pings to the server when the session has been recently used so it is assumed to be fresh/active
Case where this fails:
Assume you have a PingingPool
of size 1. The session in this pool has been active for 28+ days, so spanner service decided to delete it from server side (PingingPool
is unaware of the deletion).
Here is what happens when client attempts to use the deleted session
Client: Request a session from the pool
PingingPool: Returns the deleted session (unaware that it was deleted since there are no checks on existence at get time if the ping_after
time didn't get surpassed)
Client: Attempts to use the deleted session, receives 404 Session NotFound
, gracefully returns the session to the pool (Client doesn't differentiate between Session NotFound
errors and any other errors)
PingingPool: Puts the deleted session back into the pool, refreshes the ping_after
timestamp, making future ping
attempts useless since from the pool's POV, the session was recently used.
All of that while our background thread desperately trying to ping the pool, but having no side-effect since the ping_after
keeps getting refreshed by client requesting the dead session and putting it back in the pool.
Victims of this issues:
Long runnning deployments using PingingPool
where there is a steady flow of requests.
Note: The mention of steady flow of request is because if there hasn't been any requests to the PingingPool
for the duration of the ping_interval
, PingingPool
's ping will actually go through, and will replace deleted session with a new one
Proposed solutions:
1- Have PingingPool
keep track of session creation time, add an extra check to get
to make it check/replace session if sessions are >= 28 days old
This will cover the case of spanner deleting old sessions, but will not cover the general case of server-side deletions.
However, it seems that the only other case is manual deletion of a session (who does that?). So if you're going to manually delete a session, might as well manually refresh your pool. Hence, this is my suggested solution
2- Have PingingPool
clients specially check for Session NotFound
errors and replace the deleted session with a new session
This is implemented in the Java client library. Seems like a lot of effort though.
References:
This issue was referenced in other google cloud libraries and there are solutions to them
PHP:
issue: googleapis/google-cloud-php#5827
Solution PR (implemented the tracking of session created time): googleapis/google-cloud-php#5853
Another issue ~4 months after by the same user asking to handle more general cases of session deletion but was deemed too much effort: googleapis/google-cloud-php#6284
Go:
issue: googleapis/google-cloud-go#1527
Solution: They seemed to also go for the 2nd proposed solution, they did it gradually on multiple CLs mentioned in the referenced issues tho.
Java:
Couldn't find issue references, but here is a PR that solves the issue by the 2nd proposed solution: googleapis/google-cloud-java#4734
How to use this script to illustrate the issue
Prerequisites:
- Local installation of
poetry
- Spanner instance & database created.
- Select the project in which the database is created to be the default project by executing
gcloud init
locally
Setup
-
Clone the repo, cd into it, and create a
.env
file similar to the.env.sample
and supply your instance & db names -
Run the following to enter into poetry venv
poetry shell
- Install the needed packages
poetry install
- After installation is done run the script
python3 script.py
Usage
You will be greeted by an interactive cli
What do you want to do?
1-Execute query
2-Delete session server side
3-Desperately try to ping the pool yourself
Enter 1
to test sending a dummy query to your database (the query is SELECT 1
)
You should have this printed in your console
Query executed successfuly
Now enter 2
to delete the session from server side while having the pool be unaware of the deletion
You should have this printed on your console with {session-id} replaced with whatever id of the session was
session with id {session-id} successfully deleted
Now enter 1
again to try to send another query.
You should get this response
Query Failed
404 Session not found: projects/{project-id}/instances/{instance-name}/databases/{database-name}/sessions/{session-id} [resource_type: "type.googleapis.com/google.spanner.v1.Session"
resource_name: "projects/{project-id}/instances/{instance-name}/databases/{session-name}/sessions/{session-id}"
description: "Session does not exist."
]
Enter 3
to try to manually initiate a pool ping
You should see this
Pool pinged successfully
Now enter 1
again to see how pinging the pool did not help
You should see the same error again