Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only delete page content if the number of connections is zero #4285

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

falkoschindler
Copy link
Contributor

This PR tries to solve #4253 with a new _num_connections counter. On handshake the counter is increased, on disconnect it is decreased. On disconnect, after awaiting the reconnect timeout, the page content is only deleted if the number of connections is zero.

Possible scenarios:

  • normal reconnect:
    handshake (num=1) ... disconnect (num=0) ... handshake (num=1) ... after delay: don't delete
    
  • strange reconnect with handshake before disconnect:
    handshake (num=1) ... handshake (num=2) ... disconnect (num=1) ... after delay: don't delete
    
  • normal disconnect:
    handshake (num=1) ... disconnect (num=0) ... after delay: delete
    

While working on the handshake and disconnect implementation, I noticed an inconsistency with app.on_connect and app.on_disconnect: While connect handlers are called on reconnect, disconnect handlers are not. In this PR I changed it so that both handlers are called during reconnection. We should clarify if this is the desired behavior.

@falkoschindler falkoschindler added the bug Something isn't working label Jan 28, 2025
@falkoschindler falkoschindler added this to the 2.11 milestone Jan 28, 2025
@falkoschindler falkoschindler requested a review from rodja January 28, 2025 16:43
@falkoschindler falkoschindler linked an issue Jan 28, 2025 that may be closed by this pull request
@falkoschindler
Copy link
Contributor Author

Something in this PR seems to slow down our pytests. Usually the ~500 tests run in around 15 minutes per machine. Now they timeout after 30 minutes.

nicegui/client.py Outdated Show resolved Hide resolved
@chriswi93
Copy link

Something in this PR seems to slow down our pytests. Usually the ~500 tests run in around 15 minutes per machine. Now they timeout after 30 minutes.

If I had to guess I would say that await client.handle_disconnect() blocks until asyncio.sleep() is finished. There was a background task before and client.handle_disconnect() immediately returned.

@falkoschindler
Copy link
Contributor Author

falkoschindler commented Jan 29, 2025

That's a good explanation, @chriswi93! So it is better to sleep in a background task because they are gracefully canceled during teardown.

Apart from that we noticed a theoretical flaw of the connection counter. The delay could end right in between a "normal reconnect" when the disconnect happened and the counter num is 0. This would cause the disconnect to delete the page impermissibly (⚠️):

num: 1            0   1     0   1           2   1                 0
     H .......... D ----------> delete because num=0 ⚠️
                      H ... D ----------> no delete because num>0
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

To overcome this problem, we can get back to using a background task for the content deletion, which is canceled (❌) by subsequent handshakes:

num: 1            0   1     0   1           2   1                 0
     H .......... D --> ❌
                      H ... D --> ❌
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

@chriswi93
Copy link

chriswi93 commented Jan 29, 2025

That's a good explanation, @chriswi93! So it is better to sleep in a background task because they are gracefully canceled during teardown.

Apart from that we noticed a theoretical flaw of the connection counter. The delay could end right in between a "normal reconnect" when the disconnect happened and the counter num is 0. This would cause the disconnect to delete the page impermissibly (⚠️):

num: 1            0   1     0   1           2   1                 0
     H .......... D ----------> delete because num=0 ⚠️
                      H ... D ----------> no delete because num>0
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

To overcome this problem, we can get back to using a background task for the content deletion, which is canceled (❌) by subsequent handshakes:

num: 1            0   1     0   1           2   1                 0
     H .......... D --> ❌
                      H ... D --> ❌
                                H ............. D ----------> no delete because num>0
                                            H ................... D ----------> delete because num=0

Good point! I think we are quite close 🙂
I tested your code from today and it seems to resolve the white screen issue!

But it turns out that it introduces a breaking change I was not aware of and I think many applications built on top of NiceGUI would be affected in the same way:
Our service uses a disconnect handler to clean up resources after a client disconnects. Since disconnect handlers are not called once per client just before delete, but now for each disconnect and even for reconnect, you can not use disconnect handlers anymore as reliable source to decide if a client has finally gone or not. So the resources for our service are cleaned up too early.

I would suggest two options to fix this:

  1. Call disconnect handlers right after the if statement inside the background task:
if self._num_connections == 0:
   ...

And document that there is an intended inconsistency: A connect handler might be called many times for a single client, but disconnect handler is only called once per client just before resources are cleaned up.

  1. Add another on_delete() handler such that applications can reliable clean up their resources when a client has finally gone.

@falkoschindler
Copy link
Contributor Author

@chriswi93 You're right, calling disconnect handlers more than once can break user code. Even though you can argue that it was a bug and should be fixed, the impact is quite severe.

Besides your option 1 (keep inconsistent behavior like before) and option 2 (consistent behavior but breaking change), I see another option 3 that calls both handlers only once. It would also be a breaking change, but maybe less critical.

  1. Multiple connects, single disconnect
  2. Multiple connects, multiple disconnects, single delete
  3. Single connect, single disconnect

But it's hard to tell if there are users relying on a connect handler being called on reconnect.

@chriswi93
Copy link

@falkoschindler Yes, that would be another way to make it consistent. All three options are valid in my opinion.
Option 1 might be unexpected behavior and if you don't know about it your application recreates or overwrites existing resources for a client.
Option 2 is the most flexible option, but at the moment I can not see any reason why you would need to know about reconnects if the connection is not completly lost.
Option 3 makes it simple for most users without having to think about reconnects.

@falkoschindler
Copy link
Contributor Author

Ok, I'm about to keep the current behavior of multiple connects and a single disconnect (option 1). We can improve consistency with a breaking change in version 3.0, but should fix this bug first.

However, I noticed another aspect we didn't think about: On the shared auto-index client there are multiple open connections. Currently we call multiple connects and a single disconnect like on regular pages. But the _num_connections counter doesn't work for multiple concurrent connections.

So I thought we could use a defaultdict[int] instead with some connection ID:

  • We could use the tab_id, but it isn't awailable in the @sio.on('disconnect') handler.
  • Or we use the socket ID, but it isn't awailable via On Air in the @self.relay.on('client_disconnect') handler.

Because we need to add socket IDs to On Air messages anyway (see #4218), I suggest to do that first and use them for the connection dictionary.

@chriswi93
Copy link

Ok, I'm about to keep the current behavior of multiple connects and a single disconnect (option 1). We can improve consistency with a breaking change in version 3.0, but should fix this bug first.

Yes, I think it's a good idea to postpone the breaking change to a major release.

However, I noticed another aspect we didn't think about: On the shared auto-index client there are multiple open connections. Currently we call multiple connects and a single disconnect like on regular pages. But the _num_connections counter doesn't work for multiple concurrent connections.

So I thought we could use a defaultdict[int] instead with some connection ID:

  • We could use the tab_id, but it isn't awailable in the @sio.on('disconnect') handler.
  • Or we use the socket ID, but it isn't awailable via On Air in the @self.relay.on('client_disconnect') handler.

Because we need to add socket IDs to On Air messages anyway (see #4218), I suggest to do that first and use them for the connection dictionary.

As far as I can see you share a single Client instance and all the elements among many users and tabs. So all users share the same client id. However, I think the socket id won't resolve the issue due to the following reason:

I would assume that a new socket id is assigned to every new socket connection and I guess this also applies to a connection that is the result of a reconnect.

For non shared clients

If my assumption is true, the counter in _num_connections will never exceed 1 for any socket id and the client resources are deleted too early on the first socket disconnect even if the _num_connections dict still contains another socket id. Therefore, the socket id has no additional benefit and it should be fine to keep _num_connections as a simple integer counter.

For shared auto-index client

If all users share the same client id, information is lost since we can no longer use the client id to identify an individual user tab. Therefore we don't know when all connections for an individual user tab are actually closed and as a result it is hard to find out when the disconnect handlers should be called.

I don't know if tab id is something that you could use instead for the shared auto-index page. Otherwise, it might be a good idea to add a unique id that makes it possible to identify a unique user tab across reconnects and regardless of whether it is a shared client or not.

If this would involve too many changes, it might be sufficient for now to just ignore the _num_connections counter for a shared client inside the background task:

# for self.shared = True: asyncio.sleep(...) is also not required
if self.shared or self._num_connections == 0:
    ...

Instead of:

if self._num_connections[socket_id] == 0:
    ...

Then for a shared client the disconnect handlers would be called after each disconnect event. However, it should be noted that an application built on top of the auto-index feature can not allocate/deallocate resources for a single user tab which could be an intended limitation. Therefore, in my opinion it is consistent to say that the behavior of how disconnect handlers are called is different for a shared client.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

White screen after socket reconnect
2 participants