Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket connection gets lost due to unknown reasons #317

Open
trawn-88 opened this issue Jul 31, 2024 · 17 comments
Open

Websocket connection gets lost due to unknown reasons #317

trawn-88 opened this issue Jul 31, 2024 · 17 comments

Comments

@trawn-88
Copy link

Our customers are encountering problems with a connection loss. This can happen at the start of Xpra. The Server started, the web-connection was established, but then the connection get lost and you receive a 502 response. I sent you the client and server logs via mail.

On Prod, we’re using:

ubuntu:20.04
XPra:4.4.6
XPra HTML5 Client:8.1
Wine: 8.0.2

On Dev, we’re using:

ubuntu:22.04
XPra:6.1
XPra HTML5 Client:14
Wine: 8.0.2

@totaam
Copy link
Collaborator

totaam commented Jul 31, 2024

@trawn-88 I don't see any emails.
Are you using a proxy?

@trawn-88
Copy link
Author

trawn-88 commented Jul 31, 2024

I have send it on Monday to totaam@xpra.org
Do you want me to send it again?
We are using Ingress on an nginx http-proxy to route into our kubernetes cluster

@totaam
Copy link
Collaborator

totaam commented Jul 31, 2024

@trawn-88 no emails, please send it to my gmail. (same username at gmail)

@totaam
Copy link
Collaborator

totaam commented Jul 31, 2024

From the javascript client log:

07:05:04.175 client closed Utilities.js:832:28
07:05:04.176 closing:  server shutdown Utilities.js:832:28

The server log contains nothing from xpra between these key events:

created tcp socket '0.0.0.0:8085'
no uinput module (not usually needed)
killing xvfb with pid 86

But, I find it quite odd how some of these server log messages are being duplicated (ie: 31.3GB of system memory), with slightly different timestamps too.
Is it possible that you have two servers starting concurrently?
This could easily explain all sorts of problems.


The second log shows:

websocket closed:  'Abnormal Closure' (1006) reason:  null , reconnect:  true , reconnect attempt:  0
connection-lost

This sounds a lot like unexpected EOF - also with a proxy..
So another possibility is that the proxy is deciding to drop the connection.


To get more information out of the server, you could run it with -d websocket or even -d network.
But this is going to be very verbose, and I'm not convinced that this will show anything relevant.
My guess is that the connection is being dropped by the proxy.

@trawn-88
Copy link
Author

trawn-88 commented Aug 7, 2024

Just in case, i send you more logs. Can you check them as well? Just to be sure

@totaam
Copy link
Collaborator

totaam commented Aug 8, 2024

I am seeing some duplicated messages in the log samples you have sent:

$ egrep -h "killing|X11 display"  ExpertMode\ 502\ Server\ Log\ 2.txt 
2024-07-30 08:20:06.557	2024-07-30 08:20:06,557  connected to X11 display :0 with 24 bit colors
2024-07-30 08:19:51.862	2024-07-30 08:19:51,862 killing xvfb with pid 151
2024-07-30 08:16:21.777	2024-07-30 08:16:21,777  connected to X11 display :0 with 24 bit colors
2024-07-30 08:15:03.386	2024-07-30 08:15:03,385 killing xvfb with pid 86
2024-07-30 08:14:48.671	2024-07-30 08:14:48,671 killing xvfb with pid 88

Is this a single session in a container? Does the xpra server get re-spawned somehow?
(could that go wrong?)

Do you allow concurrent connections? (aka sharing)
Or did you connect twice to the same session (client 2):

$ grep -h -A 1 "disconnected" *Server\ Log*
2024-07-30 08:19:51.841	2024-07-30 08:19:51,841 HTML5 client 1 disconnected.
2024-07-30 08:19:51.828	2024-07-30 08:19:51,828  server shutdown
--
2024-07-30 07:54:18.664	2024-07-30 07:54:18,664 HTML5 client 2 disconnected.
2024-07-30 07:54:18.662	2024-07-30 07:54:18,662 1K packets sent (7.3M bytes)
--
2024-07-29 07:05:04.233	2024-07-29 07:05:04,232 HTML5 client 1 disconnected.
2024-07-29 07:05:04.190	2024-07-29 07:05:04,186  server shutdown

In 2 of the events above, the html5 client is booted out because the server is shutting down.
This shutdown is initiated by a SIGTERM signal:

$ grep -h -A 2 "Disconnect" *Server\ Log*
2024-07-30 08:19:51.828	2024-07-30 08:19:51,828 Disconnecting client 127.0.0.1:48084:
2024-07-30 08:19:51.826	2024-07-30 08:19:51,825 X11 seamless server is terminating
2024-07-30 08:19:51.826	2024-07-30 08:19:51,825 X11 seamless server got signal SIGTERM
--
2024-07-29 07:05:04.190	2024-07-29 07:05:04,186 Disconnecting client 127.0.0.1:39118:
2024-07-29 07:05:04.152	2024-07-29 07:05:04,142 X11 seamless server is terminating
2024-07-29 07:05:04.152	2024-07-29 07:05:04,142 X11 seamless server got signal SIGTERM

Do you send the SIGTERM? How?

BTW, I see a lot of:

_XSERVTransmkdir: ERROR: euid != 0,directory /tmp/.X11-unix will not be created.

You can get rid of these warnings by creating the /tmp/.X11-unix directory before launching xpra.

The client logs don't have any useful messages in them:

$ grep -hr "closing:" ./Expert*Client*
08:19:51.897 closing:  server shutdown Utilities.js:832:28
08:19:51.900 closing:  server shutdown Utilities.js:832:28
08:19:51.938 closing:  server shutdown Utilities.js:832:28
08:20:59.027 closing:  server shutdown Utilities.js:832:28
08:20:59.030 closing:  server shutdown Utilities.js:832:28
08:20:59.069 closing:  server shutdown Utilities.js:832:28
07:05:04.176 closing:  server shutdown Utilities.js:832:28
07:05:04.180 closing:  server shutdown Utilities.js:832:28

But this is different from the previous log sample, which was showing websocket closed: 'Abnormal Closure' (1006).
Here, the client is being told to disconnect when the server shuts down.

@totaam
Copy link
Collaborator

totaam commented Aug 21, 2024

Using tcpkill to try to reproduce problems with a local server running on port 10000:

sudo tcpkill -i lo  port 10000

The broken connection can show up in random places, depending on when it triggers.
Here's one for http asset download:

Traceback (most recent call last):
  File "/usr/lib64/python3.10/site-packages/xpra/net/http/handler.py", line 406, in send_head
    self.end_headers()
  File "/usr/lib64/python3.10/site-packages/xpra/net/http/handler.py", line 261, in end_headers
    super().end_headers()
  File "/usr/lib64/python3.10/http/server.py", line 535, in end_headers
    self.flush_headers()
  File "/usr/lib64/python3.10/http/server.py", line 539, in flush_headers
    self.wfile.write(b"".join(self._headers_buffer))
  File "/usr/lib64/python3.10/socketserver.py", line 826, in write
    self._sock.sendall(b)
ConnectionResetError: [Errno 104] Connection reset by peer

The server will no longer report those as scary errors: Xpra-org/xpra@bad691a

As for the client, this shows up as:

websocket closed:  'Abnormal Closure' (1006) reason:  null , reconnect:  true , reconnect attempt:  0
Utilities.js:832 connection-lost

Exactly like the one in this bug report.

The reconnect should work, unfortunately tcpkill also stops the client from loading the Javascript.
I need to find a better tool.

@totaam
Copy link
Collaborator

totaam commented Aug 22, 2024

Similar behaviour using an ssh tunnel that can be killed more easily with an exit (or control-d) + control-c:

xpra start --bind-tcp=0.0.0.0:10000 --start=xterm
for i in `seq 1 5`; do
    ssh -v localhost -L 10001:127.0.0.1:10000
done

Open a browser pointing to 127.0.0.1:10001.
When killing the tunnel, the Javascript console shows:

websocket closed:  'Abnormal Closure' (1006) reason:  null , reconnect:  true , reconnect attempt:  0
Utilities.js:832 connection-lost
Utilities.js:832 audio: stopping stream
Utilities.js:832 close_audio_mediasource: audio_source_buffer=null, media_source=null, audio=null
Protocol.js:479 WebSocket is already in CLOSING or CLOSED state.
process_send_queue @ Protocol.js:479
(anonymous) @ Protocol.js:504
setTimeout
send @ Protocol.js:504
(anonymous) @ Protocol.js:630Understand this error
Utilities.js:832 cancel_all_files( closing ) will cancel: []
Client.js:337 [Violation] Added non-passive event listener to a scroll-blocking 'wheel' event. Consider marking event handler as 'passive' to make the page more responsive. See https://www.chromestatus.com/feature/5745543795965952
init_state @ Client.js:337
(anonymous) @ Client.js:2358
Utilities.js:832 connection-lost
Utilities.js:832 connection_progress( Connecting to server ,  localhost:10001/ ,  40 )
OffscreenDecodeWorkerHelper.js:53

The script has to re-start the tunnel quickly otherwise the re-connection hangs.
Maybe we should handle this better - as the client doesn't seem to timeout in this case. Perhaps wait a little bit first too.

That said, the original query in this ticket is about identifying the cause of the browser's websocket connection Abnormal Closure (1006) and it seems that the only way that I can reproduce this is by forcibly closing the connection in the middle - be it a proxy, netcat, ssh tunnel, etc

Information not included in this ticket that is relevant:

  • this happens when switching tabs - so perhaps the browser is putting the tab to sleep to save resources, TCP keep-alives are not being sent and the proxy drops the connection? if that's the case, then the best we can do is to re-connect?
  • a custom script sends the SIGTERM to shutdown the xpra server - maybe this should use xpra stop instead

@totaam
Copy link
Collaborator

totaam commented Aug 22, 2024

Worth adding a link to the error handler for the 1006 websocket error code:

xpra-html5/html5/js/Client.js

Lines 2313 to 2320 in 4fc0fea

const code = Number.parseInt(packet[2]);
let reconnect = this.reconnect || this.reconnect_attempt < this.reconnect_count;
if (reconnect && [0, 1006, 1008, 1010, 1014, 1015].includes(code)) {
// don't re-connect unless we had actually managed to connect
// (because these specific websocket error codes are likely permanent)
reconnect = this.connected;
}
this.cerror("websocket error: ", packet[1], "code: ", code, "reason: ", this.disconnect_reason,

totaam added a commit that referenced this issue Aug 22, 2024
use a much smaller delay so we can more reliably catch the server socket as soon as it is open,
only schedule the 'hello' timer check once we've opened the websocket connection
@totaam
Copy link
Collaborator

totaam commented Aug 22, 2024

With the commit above, the html5 client handles unavailable server endpoints much more gracefully: it will retry to connect more quickly and does timeout if the server really has gone away.

@trawn-88
Copy link
Author

Thanks very much! We'll try it. But, as it seems, there are three Proxy-Steps between the Client and the Server. Following your original statement, that a Proxy should be responsible for the connection closure, we are also investigating this possibility

@totaam
Copy link
Collaborator

totaam commented Aug 22, 2024

This looks relevant: https://stackoverflow.com/questions/49408031/websockets-in-chrome-and-firefox-disconnecting-after-one-minute-of-inactivity
We have an application level ping in xpra, but perhaps chrome doesn't implement the protocol level ping - or, it may stop sending it when tabs are inactive?

@totaam
Copy link
Collaborator

totaam commented Aug 22, 2024

I have also found a bug in the html5 client which prevented application layer ping packets from being sent to the server - all except for the very first one. (fix in 4f148c3)
Although this sounds like something that could cause some middleware to drop the connection due to inactivity - in practice this should make no difference as the server was still sending its own pings and the client was still responding to them.
And as far as middleware is concerned, ping-echo responses are indistinguishable from the pings themselves. (only the application layer knows the difference)
The only effect of this html5 client bug would be on the server, which would not have access to the "latency measured from the client" values - this is only used by the server, it is only one of many data samples used to drive the heuristics - so not a major problem if it goes missing. The fact that we didn't notice for so long goes to show that perhaps it isn't very useful - beyond reporting that latency data via xpra info.
But, still worth a try? Maybe it will help somehow?

@totaam
Copy link
Collaborator

totaam commented Aug 23, 2024

To make it easier for you to test new versions, I have uploaded xpra-html5-v16 pre-release RPMs in the beta area: https://xpra.org/beta/

@totaam
Copy link
Collaborator

totaam commented Aug 28, 2024

@trawn-88 if you are able to run xpra 6.2 beta, you could also enable -d ping to verify that the connection is being kept alive. (debug added in Xpra-org/xpra@ab0a206)

@trawn-88
Copy link
Author

I'll give that a try and send you the logs, ok? But it might take a while

@totaam
Copy link
Collaborator

totaam commented Aug 28, 2024

OK, I'm still leaning towards an nginx proxy issue rather than a protocol layer issue, but this may help narrow it down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants