Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lossless element re-painting seems to be overlooked when latency gets high #750

Closed
totaam opened this issue Dec 2, 2014 · 40 comments
Closed

Comments

@totaam
Copy link
Collaborator

totaam commented Dec 2, 2014

Issue migrated from trac ticket # 750

component: server | priority: major | resolution: invalid

2014-12-02 20:12:04: afarr created the issue


Running a (slightly customized) 0.14.12 windows client against a (also slightly customized) 0.14.12 fedora 20 client the rendering of the video elements and the lossless elements (text) seems very good until the server starts giving latency/delay warnings.

The video elements continue to display well as these warnings start outputting, but the lossless (text) elements begin to be blurry and increasingly fail to render when scrolling, especially, ceases.

I'm not certain whether this is a cause or maybe just a contributing factor, but here's a bit of server-side output:

Dec 02 10:43:55 node-1000 systemd[1]: Starting Cleanup of Temporary Directories...
Dec 02 10:43:55 node-1000 systemd[1]: Started Cleanup of Temporary Directories.

...

Dec 02 10:47:30 node-1000 dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Dec 02 10:47:30 node-1000 NetworkManager[408]: bound to 10.171.171.100 -- renewal in 979 seconds.
Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service'
Dec 02 10:47:30 node-1000 systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 02 10:47:30 node-1000 dbus-daemon[348]: dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 02 10:47:30 node-1000 dbus[348]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 02 10:47:30 node-1000 systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,034 starting sound capture using pulseaudio device: Monitor of Null Output
Dec 02 10:49:55 node-1000 server[653]: 2014-12-02 10:49:55,315 codec: MPEG-1 Layer 3 (MP3)
Dec 02 10:50:18 node-1000 server[653]: 2014-12-02 10:50:18,856 starting sound capture using pulseaudio device: Monitor of Null Output
Dec 02 10:50:19 node-1000 server[653]: 2014-12-02 10:50:19,122 codec: MPEG-1 Layer 3 (MP3)
Dec 02 10:50:53 node-1000 server[653]: 2014-12-02 10:50:53,757 send_delayed for wid 2, elapsed time 640.0 is above limit of 500.0 - sending now
Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,302 send_delayed for wid 2, elapsed time 516.0 is above limit of 500.0 - sending now
Dec 02 10:50:54 node-1000 server[653]: 2014-12-02 10:50:54,914 send_delayed for wid 2, elapsed time 593.0 is above limit of 500.0 - sending now
Dec 02 10:50:55 node-1000 server[653]: 2014-12-02 10:50:55,673 send_delayed for wid 2, elapsed time 744.0 is above limit of 500.0 - sending now
Dec 02 10:51:02 node-1000 server[653]: 2014-12-02 10:51:02,856 send_delayed for wid 2, elapsed time 753.0 is above limit of 500.0 - sending now
Dec 02 10:51:03 node-1000 server[653]: 2014-12-02 10:51:03,488 send_delayed for wid 2, elapsed time 617.0 is above limit of 500.0 - sending now
Dec 02 10:51:04 node-1000 server[653]: 2014-12-02 10:51:04,082 send_delayed for wid 2, elapsed time 576.0 is above limit of 500.0 - sending now
Dec 02 10:51:09 node-1000 server[653]: 2014-12-02 10:51:09,913 send_delayed for wid 2, elapsed time 679.0 is above limit of 500.0 - sending now
Dec 02 10:51:10 node-1000 server[653]: 2014-12-02 10:51:10,624 send_delayed for wid 2, elapsed time 696.0 is above limit of 500.0 - sending now
Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,268 send_delayed for wid 2, elapsed time 626.0 is above limit of 500.0 - sending now
Dec 02 10:51:11 node-1000 server[653]: 2014-12-02 10:51:11,966 send_delayed for wid 2, elapsed time 688.0 is above limit of 500.0 - sending now
Dec 02 10:51:17 node-1000 server[653]: 2014-12-02 10:51:17,636 send_delayed for wid 2, elapsed time 502.0 is above limit of 500.0 - sending now
Dec 02 10:51:21 node-1000 server[653]: 2014-12-02 10:51:21,819 send_delayed for wid 2, elapsed time 538.0 is above limit of 500.0 - sending now
Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,341 send_delayed for wid 2, elapsed time 509.0 is above limit of 500.0 - sending now
Dec 02 10:51:22 node-1000 server[653]: 2014-12-02 10:51:22,948 send_delayed for wid 2, elapsed time 597.0 is above limit of 500.0 - sending now
Dec 02 10:51:23 node-1000 server[653]: 2014-12-02 10:51:23,514 send_delayed for wid 2, elapsed time 553.0 is above limit of 500.0 - sending now
Dec 02 10:51:24 node-1000 server[653]: 2014-12-02 10:51:24,231 send_delayed for wid 2, elapsed time 697.0 is above limit of 500.0 - sending now
Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,001 send_delayed for wid 2, elapsed time 757.0 is above limit of 500.0 - sending now
Dec 02 10:51:25 node-1000 server[653]: 2014-12-02 10:51:25,533 send_delayed for wid 2, elapsed time 524.0 is above limit of 500.0 - sending now

I was running the client with -d paint but saw no output except the usual about the MP3 speaker re-starts.

Are there any other, better targeted, debugging flags that might be of more use (client or server side)?

@totaam
Copy link
Collaborator Author

totaam commented Dec 2, 2014

2014-12-02 21:07:07: ddoole commented


I'm seeing something similar. Here's what I reported to the mailing list:


Lately when using xpra, I've been seeing lots of spinners over top of my windows. (And I mean "lots" in the sense that xpra is pretty much unusable at times.) In the past, xpra was working without these delays.

When this happens, I see messages like this in the client log:

2014-12-02 13:20:53,784 server is not responding, drawing spinners over the windows
2014-12-02 13:21:05,074 server is OK again

In the server log, I see messages like this:

2014-12-02 13:20:23,261 delayed_region_timeout: region is 15001ms old, bad connection?
2014-12-02 13:20:28,963 send_delayed for wid 23, elapsed time 5701.0 is above limit of 500.0 - sending now

Occasionally it is so bad that xpra disconnects. At that point the server log will contain:

2014-12-02 13:24:08,767 Disconnecting client Protocol(SocketConnection(/home/doole/.xpra/reorx-1)): client ping timeout (waited 60 seconds without a response)
2014-12-02 13:24:08,769 xpra client disconnected.

Both my xpra server and client are on the same network (but not the same subnet). If I use "ssh -X" directly to the box, I do not see any sort of delays.

My windows are mostly konsole and gvim.

I'd love to be able to say "this problem started with version X", but I haven't been able to pin it down. Unfortunately, my environment went though multiple changes at the same time (xpra 0.13.x to 0.14.x, Ubuntu 12.04 to 14.04), so I may not be able to get back to a good state for comparison purposes.

I am running without speaker/microphone forwarding but everything else enabled. I have tried various encoding options, but it doesn't seem to make a difference.

My client and server are both running Ubuntu 14.04. I'm currently running xpra 0.14.12 at both the client and server, but I've been seeing this on all the xpra 0.14.x builds that I've tried.

Any suggestions? Anything I can do to help narrow this down further?


I have since tried running with -debug on the client, but it added no new information. (I'm waiting to try -debug on the server until I can afford to recycle my windows.)

@totaam
Copy link
Collaborator Author

totaam commented Dec 3, 2014

2014-12-03 20:47:27: antoine changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Dec 3, 2014

2014-12-03 20:47:27: antoine commented


r8177 allows us to easily tune the batch settings server side (the same env options could already be used client side):

XPRA_BATCH_MAX_DELAY=120 xpra start 

This allows me to more easily get the send_delayed for wid ... - sending now to fire, but I am not seeing any problems.

How can I reproduce this bug?

@totaam
Copy link
Collaborator Author

totaam commented Dec 3, 2014

2014-12-03 21:13:47: ddoole commented


I don't do anything special - a couple konsole windows open, one is frequently doing big compiles (so lots of scrolling text). Usually 2-5 gvim windows up, editing text, paging up and down, etc.

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 01:10:28: antoine commented


r8188 adds the ability to turn off the video regions code, which may help:

XPRA_VIDEO_SUBREGION=0 xpra start

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 01:22:18: antoine commented


Since this is refresh related, please post the refresh debug log:

xpra start -d refresh,regionrefresh

(the latter debug category is not needed with XPRA_VIDEO_SUBREGION=0)

@ddoole: are you connecting via ssh or tcp? Is there a firewall between you and the server, a forticlient perhaps? (ie: #742)
What does the session info bandwidth graph show?

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 05:51:53: totaam changed status from assigned to new

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 05:51:53: totaam changed owner from antoine to ddoole

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 05:51:53: totaam commented


@ddoole:

  • gvim is known to cause unnecessary repaints: Subpar performance when running gvim or emacs #670
  • auto refresh issues: auto-refresh kicks in too early #694
  • btw: -d debug won't do anything, see xpra -d help for the list of debug categories - in this particular case, you may also want encoding or refresh on the server, and maybe paint on the client
  • you're seeing some extraordinary delays there - the kind that you would only see when connecting over a cellular network like 2G or 3G...
  • maybe have a look at the session info network traffic
  • post xpra info collected when things are going wrong

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 06:34:31: totaam changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 06:34:31: totaam changed owner from ddoole to totaam

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 06:34:31: totaam commented


I think I've just figured out what the problem is: your apps are repainting a lot (see #670 for gvim - #756 may help), and so the >500ms delay is needed.
By cutting it short, we're making thing worse, eventually leading to the connection drop.

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 16:53:03: ddoole commented


To answer Antoine's questions:

  • I am connecting using ssh, not tcp.
  • There is a firewall running on my laptop (iptables). Any rules I should be looking for or adding?

For totaam:

  • I'm seeing these delays when I'm on the local network at work. Gigabit copper from my laptop to the server. The machines are on separate subnets though.
  • I'm not at the office today, so it'll be a day or two before I can get information about the local network.

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 18:21:05: ddoole uploaded file ddoole.xpra.info.vdsl (94.5 KiB)

xpra info output during hang (over VDSL)

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 18:21:38: ddoole uploaded file ddoole.xpra.png.vdsl (43.8 KiB)

Performance graph during hang (over VDSL)

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 18:23:19: ddoole commented


I see the same refresh problem working from home as I do in the office. This is over VDSL instead of local network though.

I attached the xpra info and the session info graphs for a hang.

[[Image(ddoole.xpra.vdsl.png)]]

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 18:47:26: antoine uploaded file ddoole.xpra.vdsl.png (42.3 KiB)

renamed file for inlining
ddoole.xpra.vdsl.png

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 19:46:55: antoine changed status from assigned to new

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 19:46:55: antoine changed owner from totaam to ddoole

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 19:46:55: antoine commented


@ddoole:

  • png encoding isn't the default, because it isn't the most efficient - could be making things worse here, though xpra info seems to disagree with me and indicates that you are not encoding bound
  • your latency isn't even that high either
  • don't worry about iptables, it cannot interfere out of the box (it would take some considerable effort) - only some broken layer-7 proprietary firewalls get this wrong
  • could be worth trying tcp connection just to rule out the transport layer
  • trying turning off more things: mdns maybe? clipboard?
  • I have made some minor improvements in r8193 and r8195 (the last one may be worth backporting)
  • try network layer debugging with -d net on the server and client
  • try --enable-pings and see how that influences how quickly the spinners come up

But the more I look at this, the more I think that you may just be having connectivity issues somehow.
At first I thought that there was something wrong in the code I just reviewed and modified, but the fact that you are getting the spinners probably invalidates this hypothesis: pings just aren't coming back, or the network / main loops are getting stuck on it.

Not much more I can do on this without more information, preferably in the form of steps so that I can reproduce the problem..

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 19:59:52: ddoole commented


  • I know png isn't the most efficient, but the artifacting I get around text when I use H.264 or VP8 drives me nuts. I have tried other encodings though, and it did not make any observable difference.
  • I tried disabling the clipboard at the client, and that hasn't helped. I'll try disabling at the server and see if it makes a difference.
  • I'll try the other things you've suggested as well.

@totaam
Copy link
Collaborator Author

totaam commented Dec 4, 2014

2014-12-04 20:58:06: ddoole commented


I did a quick experiment with attaching with -d net --enable-pings. I didn't get any additional diagnostics at the client. I saw the spinners just as frequently, but they seemed to go away quicker.

See comment3 for my repro instructions. Running the ClearCase xcleardiff tool (graphical file compare) is another good way to get the spinner coming up lots.

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 02:56:40: afarr commented


Well, I've been working on reproducing this, but the send_delayed ... messages don't seem to be correlating to my lossless element refresh issues.

Also filed #757, about an encoding pipeline BUG... which might be more relevant.

In any case, looks like my supposition should be deemed invalid, hopefully there will be a solution for ddoole's issues though.

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 14:38:23: ddoole commented


Something new today. I'm in the office and was about to restart my server to add in some of the new options I've been asked to try. I figured I'd reproduce the spinners first before restarting everything.

I'm still getting the spinners reliably, but this time I'm not getting the send_delayed messages in the server log. (My server is still running with no -d option.)

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 15:38:39: ddoole commented


I've changed my server to start with "--no-clipboard --no-mdns -d net" and the problem seems to have gone away. Things feel as responsive as they used to.

In the client logs, I do see occasional "server not responding" messages and the server log has very few "send_delayed" messages. (No additional information in the server log with "-d net".)

I'm guessing these messages are because of the known problems with gvim. Despite what the client log says, I haven't actually seen a spinner.

I'm going to continue experimenting and hopefully pin this down to one of mdns or clipboard. (I'm really hoping it's mdns - I can live without it. I need my clipboard though.)

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 17:33:05: antoine commented


  • mdns: could be causing problems
  • clipboard: much more likely, covered in the FAQ and bug reporting guidelines - check that nothing else is trying to sync the clipboard at the same time as xpra

You may want to add -d clipboard to the client and server command lines.

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 18:15:36: ddoole uploaded file clipboard.log (13.1 KiB)

Server log with -d clipboard

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 18:19:48: ddoole commented


From my quick tests, it looks like --no-mdns does the trick for me.

To the best of my knowledge, I am not running any clipboard managers. (I go out of my way to kill klipper on any new system I set up.)

I did try running with "--no-mdns -d clipboard" in case anything interesting popped up. No new messages at the client. Nothing on the server said "here's a problem" to me, but I have attached the log if you'd like to take a look.

I'm going to run for a day or two with --no-mdns just to make sure it's not one of those quirks where the problem teases you and disappears for a while. Assuming --no-mdns is the key, anything else I can gather to help you pursue this further?

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 18:40:48: antoine commented


Thanks!
I'm not sure there's much more we can do to debug things, it looks like the python avahi code is swamping the program's main loop, probably waiting for IO somewhere it should not.

I have disabled mdns by default for trunk in r8211 and for v0.14.x in 8212, it will be in the next stable update (probably today).

Feel free to close this ticket if / when that works for you.

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 18:48:57: ddoole commented


I'm still seeing some delays, even with --no-mdns.

Looking at the server log, there are a number of new do_owner_changed() calls logged. Is this significant?

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 18:50:24: totaam commented


do_owner_changed()

means that something is changing the clipboard contents or ownership.

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 19:21:24: ddoole commented


OK. That's fine then - I was putting stuff I the clipboard.

I'm starting to believe that I was too quick saying --no-mdns was enough. I've started getting lots of spinners again :-(

So I'm back to --no-ndns --no-clipboad and I'll see what happens as I use it more.

@totaam
Copy link
Collaborator Author

totaam commented Dec 5, 2014

2014-12-05 21:12:20: ddoole commented


It looks like my initial success with --no-mdns --no-clipboard was just fate playing silly buggers with me. I'm back to having lots of spinners even though both were specified on my xpra start.

I'll start playing with the other options that were suggested and see if anything else makes a difference.

@totaam
Copy link
Collaborator Author

totaam commented Dec 8, 2014

2014-12-08 18:31:03: ddoole commented


Tried connecting with tcp instead of ssh and that hasn't made a difference.

@totaam
Copy link
Collaborator Author

totaam commented Dec 9, 2014

2014-12-09 17:56:24: ddoole commented


So, at this point I've tried:

  • different encodings
  • ssh and tcp communication
  • turning off every option that has a --no-... switch
  • built in xvfb as well as xdummy (I normally run with xdummy)

And I am still seeing the delays. Any suggestions on other things to try or diagnostics to capture? (I really like xpra and hate the thought of having to go back to something like vnc.)

@totaam
Copy link
Collaborator Author

totaam commented Dec 9, 2014

2014-12-09 19:03:52: antoine commented


The xpra info you posted shows that most window updates are sent out of the network layer very very fast: always under 50ms (look for damage.out_latency).
You must have some very good server hardware or just very very small screen updates.
Which only leaves the network layer (switches, firewalls, OS) and the client.

The only new testing suggestions that I can make at this point are:

  • trying different OS / versions
  • trying older versions of xpra to see when the regression occurred (assuming it is an xpra regression) - client side first (since the server "looks" ok)
  • try from the same subnet with the same OS
    etc..

@totaam
Copy link
Collaborator Author

totaam commented Dec 9, 2014

2014-12-09 20:24:45: ddoole changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Dec 9, 2014

2014-12-09 20:24:45: ddoole changed resolution from ** to invalid

@totaam
Copy link
Collaborator Author

totaam commented Dec 9, 2014

2014-12-09 20:24:45: ddoole commented


I'm marking this as "invalid" (based on afarr also saying this is invalid in comment #13).

I just reproduced non-responsive windows using ssh -X and konsole with no xpra involved. So that definitely gets xpra off the hook. (I'd tried this before and hadn't seen any problems. A few minutes ago, I realized that my original non-xpra test wasn't 100% comparable to my xpra test. When I tested properly, I hit the hangs.)

So now I need to go find where to open a konsole bug... (In the mean time, it looks like xrvt will be my new friend.)

Thanks so much for your help in trying to narrow this down. Sorry for wasting your time.

@totaam totaam closed this as completed Dec 9, 2014
@totaam
Copy link
Collaborator Author

totaam commented Dec 9, 2014

2014-12-09 23:24:11: antoine commented


I've reverted the mdns changes in r8224.

@totaam totaam added the v0.14.x label Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant