Only delete page content if the number of connections is zero #4285

falkoschindler · 2025-01-28T16:43:40Z

This PR tries to solve #4253 with a new _num_connections counter. On handshake the counter is increased, on disconnect it is decreased. On disconnect, after awaiting the reconnect timeout, the page content is only deleted if the number of connections is zero.

Possible scenarios:

normal reconnect:

handshake (num=1) ... disconnect (num=0) ... handshake (num=1) ... after delay: don't delete

strange reconnect with handshake before disconnect:

handshake (num=1) ... handshake (num=2) ... disconnect (num=1) ... after delay: don't delete

normal disconnect:

handshake (num=1) ... disconnect (num=0) ... after delay: delete

While working on the handshake and disconnect implementation, I noticed an inconsistency with app.on_connect and app.on_disconnect: While connect handlers are called on reconnect, disconnect handlers are not. In this PR I changed it so that both handlers are called during reconnection. We should clarify if this is the desired behavior.

falkoschindler · 2025-01-28T21:55:35Z

Something in this PR seems to slow down our pytests. Usually the ~500 tests run in around 15 minutes per machine. Now they timeout after 30 minutes.

nicegui/client.py

chriswi93 · 2025-01-28T23:03:21Z

Something in this PR seems to slow down our pytests. Usually the ~500 tests run in around 15 minutes per machine. Now they timeout after 30 minutes.

If I had to guess I would say that await client.handle_disconnect() blocks until asyncio.sleep() is finished. There was a background task before and client.handle_disconnect() immediately returned.

falkoschindler · 2025-01-29T15:36:27Z

That's a good explanation, @chriswi93! So it is better to sleep in a background task because they are gracefully canceled during teardown.

Apart from that we noticed a theoretical flaw of the connection counter. The delay could end right in between a "normal reconnect" when the disconnect happened and the counter num is 0. This would cause the disconnect to delete the page impermissibly (⚠️):

num: 1            0   1     0   1           2   1                 0
     H .......... D ----------> delete because num=0 ⚠️
                      H ... D ----------> no delete because num>0
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

To overcome this problem, we can get back to using a background task for the content deletion, which is canceled (❌) by subsequent handshakes:

num: 1            0   1     0   1           2   1                 0
     H .......... D --> ❌
                      H ... D --> ❌
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

chriswi93 · 2025-01-29T20:57:38Z

That's a good explanation, @chriswi93! So it is better to sleep in a background task because they are gracefully canceled during teardown.

Apart from that we noticed a theoretical flaw of the connection counter. The delay could end right in between a "normal reconnect" when the disconnect happened and the counter num is 0. This would cause the disconnect to delete the page impermissibly (⚠️):
num: 1            0   1     0   1           2   1                 0
     H .......... D ----------> delete because num=0 ⚠️
                      H ... D ----------> no delete because num>0
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0
To overcome this problem, we can get back to using a background task for the content deletion, which is canceled (❌) by subsequent handshakes:
num: 1            0   1     0   1           2   1                 0
     H .......... D --> ❌
                      H ... D --> ❌
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

Good point! I think we are quite close 🙂
I tested your code from today and it seems to resolve the white screen issue!

But it turns out that it introduces a breaking change I was not aware of and I think many applications built on top of NiceGUI would be affected in the same way:
Our service uses a disconnect handler to clean up resources after a client disconnects. Since disconnect handlers are not called once per client just before delete, but now for each disconnect and even for reconnect, you can not use disconnect handlers anymore as reliable source to decide if a client has finally gone or not. So the resources for our service are cleaned up too early.

I would suggest two options to fix this:

Call disconnect handlers right after the if statement inside the background task:

if self._num_connections == 0:
   ...

And document that there is an intended inconsistency: A connect handler might be called many times for a single client, but disconnect handler is only called once per client just before resources are cleaned up.

Add another on_delete() handler such that applications can reliable clean up their resources when a client has finally gone.

falkoschindler · 2025-01-30T15:48:20Z

@chriswi93 You're right, calling disconnect handlers more than once can break user code. Even though you can argue that it was a bug and should be fixed, the impact is quite severe.

Besides your option 1 (keep inconsistent behavior like before) and option 2 (consistent behavior but breaking change), I see another option 3 that calls both handlers only once. It would also be a breaking change, but maybe less critical.

Multiple connects, single disconnect
Multiple connects, multiple disconnects, single delete
Single connect, single disconnect

But it's hard to tell if there are users relying on a connect handler being called on reconnect.

chriswi93 · 2025-01-30T23:08:41Z

@falkoschindler Yes, that would be another way to make it consistent. All three options are valid in my opinion.
Option 1 might be unexpected behavior and if you don't know about it your application recreates or overwrites existing resources for a client.
Option 2 is the most flexible option, but at the moment I can not see any reason why you would need to know about reconnects if the connection is not completly lost.
Option 3 makes it simple for most users without having to think about reconnects.

falkoschindler · 2025-01-31T18:06:24Z

Ok, I'm about to keep the current behavior of multiple connects and a single disconnect (option 1). We can improve consistency with a breaking change in version 3.0, but should fix this bug first.

However, I noticed another aspect we didn't think about: On the shared auto-index client there are multiple open connections. Currently we call multiple connects and a single disconnect like on regular pages. But the _num_connections counter doesn't work for multiple concurrent connections.

So I thought we could use a defaultdict[int] instead with some connection ID:

We could use the tab_id, but it isn't awailable in the @sio.on('disconnect') handler.
Or we use the socket ID, but it isn't awailable via On Air in the @self.relay.on('client_disconnect') handler.

Because we need to add socket IDs to On Air messages anyway (see #4218), I suggest to do that first and use them for the connection dictionary.

chriswi93 · 2025-02-01T13:44:27Z

Ok, I'm about to keep the current behavior of multiple connects and a single disconnect (option 1). We can improve consistency with a breaking change in version 3.0, but should fix this bug first.

Yes, I think it's a good idea to postpone the breaking change to a major release.

However, I noticed another aspect we didn't think about: On the shared auto-index client there are multiple open connections. Currently we call multiple connects and a single disconnect like on regular pages. But the _num_connections counter doesn't work for multiple concurrent connections.

So I thought we could use a defaultdict[int] instead with some connection ID:

We could use the tab_id, but it isn't awailable in the @sio.on('disconnect') handler.

Or we use the socket ID, but it isn't awailable via On Air in the @self.relay.on('client_disconnect') handler.

Because we need to add socket IDs to On Air messages anyway (see #4218), I suggest to do that first and use them for the connection dictionary.

As far as I can see you share a single Client instance and all the elements among many users and tabs. So all users share the same client id. However, I think the socket id won't resolve the issue due to the following reason:

I would assume that a new socket id is assigned to every new socket connection and I guess this also applies to a connection that is the result of a reconnect.

For non shared clients

If my assumption is true, the counter in _num_connections will never exceed 1 for any socket id and the client resources are deleted too early on the first socket disconnect even if the _num_connections dict still contains another socket id. Therefore, the socket id has no additional benefit and it should be fine to keep _num_connections as a simple integer counter.

For shared auto-index client

If all users share the same client id, information is lost since we can no longer use the client id to identify an individual user tab. Therefore we don't know when all connections for an individual user tab are actually closed and as a result it is hard to find out when the disconnect handlers should be called.

I don't know if tab id is something that you could use instead for the shared auto-index page. Otherwise, it might be a good idea to add a unique id that makes it possible to identify a unique user tab across reconnects and regardless of whether it is a shared client or not.

If this would involve too many changes, it might be sufficient for now to just ignore the _num_connections counter for a shared client inside the background task:

# for self.shared = True: asyncio.sleep(...) is also not required
if self.shared or self._num_connections == 0:
    ...

Instead of:

if self._num_connections[socket_id] == 0:
    ...

Then for a shared client the disconnect handlers would be called after each disconnect event. However, it should be noted that an application built on top of the auto-index feature can not allocate/deallocate resources for a single user tab which could be an intended limitation. Therefore, in my opinion it is consistent to say that the behavior of how disconnect handlers are called is different for a shared client.

only delete page content if the number of connections is zero

d73370c

falkoschindler added the bug Something isn't working label Jan 28, 2025

falkoschindler added this to the 2.11 milestone Jan 28, 2025

falkoschindler requested a review from rodja January 28, 2025 16:43

falkoschindler linked an issue Jan 28, 2025 that may be closed by this pull request

White screen after socket reconnect #4253

Open

falkoschindler mentioned this pull request Jan 28, 2025

White screen after socket reconnect #4253

Open

chriswi93 reviewed Jan 28, 2025

View reviewed changes

nicegui/client.py Outdated Show resolved Hide resolved

falkoschindler added 3 commits January 29, 2025 16:43

use background task to cancel content deletion on handshake

b4916b9

define async delete function more locally

3cff467

prevent two deletions because of two subsequent disconnects

5ae38b3

falkoschindler mentioned this pull request Jan 31, 2025

Keep multiple disconnect tasks in a list #4271

Closed

falkoschindler added 3 commits January 31, 2025 19:24

use socket ID to identify connections on auto-index client

31219a1

restore previous behavior: call disconnect handlers only once

d0f52ae

make sure to use the new _delete_tasks dictionary

8147bd5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only delete page content if the number of connections is zero #4285

Only delete page content if the number of connections is zero #4285

falkoschindler commented Jan 28, 2025

falkoschindler commented Jan 28, 2025

chriswi93 commented Jan 28, 2025

falkoschindler commented Jan 29, 2025 •

edited

Loading

chriswi93 commented Jan 29, 2025 •

edited

Loading

falkoschindler commented Jan 30, 2025

chriswi93 commented Jan 30, 2025

falkoschindler commented Jan 31, 2025

chriswi93 commented Feb 1, 2025

Only delete page content if the number of connections is zero #4285

Are you sure you want to change the base?

Only delete page content if the number of connections is zero #4285

Conversation

falkoschindler commented Jan 28, 2025

falkoschindler commented Jan 28, 2025

chriswi93 commented Jan 28, 2025

falkoschindler commented Jan 29, 2025 • edited Loading

chriswi93 commented Jan 29, 2025 • edited Loading

falkoschindler commented Jan 30, 2025

chriswi93 commented Jan 30, 2025

falkoschindler commented Jan 31, 2025

chriswi93 commented Feb 1, 2025

For non shared clients

For shared auto-index client

falkoschindler commented Jan 29, 2025 •

edited

Loading

chriswi93 commented Jan 29, 2025 •

edited

Loading