Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvenc encoder improvements - support a common backing encoding session across windows #3286

Closed

Conversation

winstona
Copy link
Contributor

The current implementation of the nvenc encoder instantiates a new encoding session for each new window or when a window's nvenc video encoder switches to being used. This incurs a (known) setup time each time the encoder is used and prioritized below other encoders. Due to the frequent setup and tear down, the encoder often gets into an unrecoverable state without killing xpra and restarting likely due to 2 reasons:

  • NVIDIA allows only 2 simultaneous encoders - if 3+ windows are attempting to use the nvenc encoder on a single consumer device, only 2 will ever succeed
  • resources not being freed consistently - the cuda/nvenc libraries are very sensitive to this and can hold on to an unusable encoder for the duration of xpra running, often completely locking up the nvenc encoder when both available encoders are orphaned

This PR solves the 2 issues above, and in the process adds a few additional benefits:

  • nvenc/hardware accelerated video encoding can easily be done on more than 2 windows simultaneously since only one encoding session is used
  • Due to using a singleton/global encoder, we can assume a near-zero setup time allowing more frequent use of the encoder in cases where it would have made more sense to use other encoders
  • With the flexibilty of the above, we can hint the window in focus to always prioritize using the nvenc encoder for all encoding to keep latency at a minimum

These changes do incur some minor drawbacks however:

  • More frequent IDR keyframes are sent. Encoding sessions retain state between frame compression except for keyframes (IDR frames). In order to not create image corruption, when the single encoder session is processing images for different windows simultaneously, it needs to send IDR frames more frequently to "reset" the encoding session. The positive side here is that the IDR frame is much faster to compress relative to subsequent frames but at the expense of a much larger size relative - incurring an overall higher bandwidth usage. This is minimized as much as possible though, and if only a single window is using the video encoder it will behave as expected (single IDR frame 0 at the beginning of the encode stream).
  • One large window will slow encoding for smaller windows. Encoding sessions need to specify dimensions. When a frame is encountered that is larger than the current encoder's dimensions, this will resize the encoder sessions dimensions to be larger. This takes a performance hit on encoding speed for windows that do not need the larger dimension as they still compress the frame with a larger input/output buffer and dimensions. In this implementation, we never resize the encoder back down to an optimum size after the larger window is no longer needed (until xpra is restarted and the encoder is reinitialized). This can easily be fixed in the future by resizing the encoder back down if it no longer needs to render a frame that large.
  • Realtime bandwidth/speed tuning is likely ignored after initialization. I didn't investigate this much yet, but optimization for changing link speeds/quality is likely ignored for now. This can probably be fixed fairly easily.

High Level Implementation:

To minimize code changes, the Encoder class is kept mostly intact and can act as a (previous behavior) standalone Encoder reinitialized for each window or as a common singleton instance (abstracted through the UniqueEncoder class). There's some python trickery here to transparently pass method calls/variable access from UniqueEncoder down to its backing singleton instance thus not requiring any complex rework into how a video encoder is initialized/called. This also allows selective override of any methods/variables in UniqueEncoder to override behavior and/or store local data to each UniqueEncoder.

Each UniqueEncoder has a unique_id - this id is used by the singleton Encoder's compress_image to determine if it needs to reset the backing encoder session back to frame 0 and send an IDR frame.

compress_image checks if the encoder is large enough to render the frame, and if not - triggers a resize of the encoding session (without fully reinitializing it) before continuing on to compress the image.

When clean() is called on a UniqueEncoder we need to ignore this, as we don't want to cleanup the singleton Encoder since it's likely to be used by a future UniqueEncoder.

Notes:

  • Some errors indicate a likely lockup with the encoding session - these when detected will trigger a full reinitialization of the nvenc library or singleton encoder instance, usually successfully recovering within a few seconds.
  • A global encode_lock is used to ensure parallel compress_image calls don't step over each other since they will be using the same backing buffers for IO to the encoder session.
  • New environment variable config flags are added:
    • XPRA_WINDOW_FOCUS_VIDEO_ENCODER - if True try to use the video encoder for all frames of a window in focus, regardless of other hints that would exclude it from using the video encoder (ie text content-type)
    • XPRA_NVENC_USE_SINGLETON_ENCODER - if True use this PR's implementation of a single backing encoder session for all windows, otherwise use the one encoder session per window implementation
  • This PR disables threaded init by default, as I had found (before these changes) that it was much more reliable when not initialized in a separate thread and this PR has only been tested with single-threaded init. In addition, threaded init is likely not even needed anymore for a singleton instance since init will happen very infrequently.

@totaam
Copy link
Collaborator

totaam commented Sep 27, 2021

NVIDIA allows only 2 simultaneous encoders

Other ways around this problem are:

These changes do incur some minor drawbacks however

That's going to penalize the quadro users.

I'll try to find the time to review this PR and see if we can at least merge some of it straight away.
ie: things like can_ignore_dim_changes can be generalized and renamed, but the removal of strget looks like a mistake.
The debug logging and extra asserts look good, the hard-coded 4K limit does not.
And things like if r == 10: will need a docstring.
The has_focus flag should be updated on every entry, not on every batch delay calculation. It makes no difference now, but it might in the future.
etc.

totaam added a commit that referenced this pull request Sep 29, 2021
totaam added a commit that referenced this pull request Sep 29, 2021
Copy link
Collaborator

@totaam totaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have merged a bunch of mostly cosmetic changes.
Please rebase and address or answer the review comments.

@@ -13,6 +13,8 @@ import ctypes
from ctypes import cdll, POINTER
from threading import Lock
from pycuda import driver
from pycuda._driver import LogicError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be from pycuda.driver import LogicError or just refer to it as driver.LogicError.


def __init__(self, *args, **kwargs):
log("init uniqueencoder start")
self.unique_id = random.randint(0,999999999)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a safe way to generate a unique id. Use AtomicInteger instead.

setup_cost = 80
if USE_SINGLETON_ENCODER:
codec_class = UniqueEncoder
setup_cost = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with these changes, I doubt that the setup cost is as low as a pure software encoder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was chosen merely to give strong preference to nvenc over x264 software encoder - I think a small value like 5 or 10 might still keep it preferred over software encoding. Of course the actual setup cost is > 0, but the setup cost for an already initialized singleton encoder would be = 0 when initializing the UniqueEncoder

@@ -1668,7 +1813,7 @@ cdef class Encoder:
profile = os.environ.get("XPRA_NVENC_PROFILE", "")
profile = os.environ.get("XPRA_NVENC_%s_PROFILE" % csc_mode, profile)
#now see if the client has requested a different value:
profile = options.strget("h264.%s.profile" % csc_mode, profile)
profile = options.get("h264.%s.profile" % csc_mode, profile)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strget is needed because we get options from the client, through a packet encoder which may mangle bytes and strings. (#3229)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure exactly what is going on here but months ago using strget caused the nvenc encoder to exception out on initialization where I was running it, so I had to add a manual hack changing it to get just to get it to work in the first place without crashing on me and disabling nvenc entirely. Currently I'm using the nvidia/cudagl:11.0-base-ubuntu20.04 docker container, however I believe this was the case with just a vanilla ubuntu 20.04

log("called compress_image with common encoder: %s, context: %#x, frames: %i", self, <uintptr_t> self.context, self.frames)
log("compress_image%s", (quality, speed, options, retry))

#FIXME: do not lock encode_lock on USE_SINGLETON_ENCODER disabled
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think we do want a lock in all cases.
With multiple concurrent users each having their own encode thread, this could easily break nvenc.
(but the lock would need to protect more than just compress_image in that case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming that multiple threads with their own cuda context and encoder would not need a global lock holding them (this is a global/module level var created) and could operate in parallel without stepping over eachother - the only reason why this is here is to ensure that parallel threads don't step on eachother as they're writing/reading from the same buffers/memory locations (in the singleton case).
Do you suspect that we'd need a global lock in the case where a singleton isn't used?

@@ -723,6 +727,8 @@ def matches_video_subregion(self, width, height):
return vr

def subregion_is_video(self):
if WINDOW_FOCUS_VIDEO_ENCODER and self.has_focus:
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like the wrong place for it.
What is this meant to do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not ideal, yes - here again the intent was to short-circuit any video region detection for a window and always use nvenc for the active/focused window

@@ -778,7 +784,7 @@ def send_nonvideo(regions=regions, encoding=coding, exclude_region=None, get_bes
assert not self.full_frames_only

actual_vr = None
if vr in regions:
if vr in regions or (WINDOW_FOCUS_VIDEO_ENCODER and self.has_focus):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, this looks wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also attempting to short-circuit the logic to use nvenc where possible on the active window

log("update_encoding_options check setregion override")
if WINDOW_FOCUS_VIDEO_ENCODER and self.has_focus:
vs.set_detection(False)
vs.set_region(0, 0, ww, wh)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this also bypasses video region detection.
Why was this needed?
Now the whole window is using nvenc, no matter what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes - the intent was to let the nvenc encoding/compression optimize sending the whole window. I noticed in a lot of cases since partial window refreshes were being sent, the window refreshed in blocks and not all at the same time - since we already are going to try to use nvenc to encode the frame (since the window is in focus) might as well let it handle the optimization and get consistent repaints on the client side.
If there's a better way to do/handle this let me know and I can update

else:
scorelog("check_pipeline_score(%s) change of video input dimensions from %ix%i to %ix%i",
force_reload, ve.get_width(), ve.get_height(), enc_width, enc_height)
clean = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The can_ignore_dim_changes should be added to all the encoders interface after being renamed to a more generic name, ie: can_compress_size(w, h).

else:
videolog("do_check_pipeline video: window dimensions have changed from %sx%s to %sx%s",
ve.get_width(), ve.get_height(), encoder_src_width, encoder_src_height)
return False
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above.

@winstona
Copy link
Contributor Author

Other ways around this problem are:
* using a quadro card
* https://github.com/keylase/nvidia-patch

I missed mentioning allowing only 2 encoders on consumer cards specifically - the goal here was to have a solution that would work on consumer cards (and working within their limitations). I didn't come across keylase/nvidia-patch in my investigation into this, but that certainly opens up a lot of extra possibilities on its own making this PR less beneficial. I likely would have structured this a bit differently from the beginning had I been aware.

My goal was to eventually add in a pool of encoders with preference/stickiness for the last encoder it used to keep IDR frames to a minimum if possible with the option to specify the number of encoders you'd want to use. This would possibly allow for best of all worlds as this wouldn't limit quadro cards, would be able to use a higher value for patched consumer drivers, and would be able to work within the 2 encoder limit on non-patched consumer cards. I'd imagine running 1 encoder per window even in a quadro/patched case does take up extra VRAM that can't be used for other work or other windows.

I've been trying to use xpra with things like a browser, where its not just straight video encode that initializes once, and instead will be frequent stops and starts - software encoding/compression in this case just isn't responsive enough to keep up, and the shift between software and nvenc encoding seems to cause extra delays and frame jitter (where it jumps back to an old frame for a second). In addition, having at least one encoder ready to go without taking the initialize time hit so far seems to help keep overall perceived latency down.

Thanks for reviewing this PR and adding feedback - I'll run through those soon and make adjustments/comments where appropriate.

@winstona winstona force-pushed the nvenc_encoder_singleton_improvements branch from c7dffab to b76106e Compare October 9, 2021 01:51
@winstona winstona marked this pull request as draft October 9, 2021 01:52
@winstona
Copy link
Contributor Author

winstona commented Oct 9, 2021

moving to draft as I have pending changes and identified a memory leak resulting from leftover nvenc resources

@totaam
Copy link
Collaborator

totaam commented Nov 30, 2021

Can you update the PR?

@winstona winstona force-pushed the nvenc_encoder_singleton_improvements branch from 99b6bf1 to 018e830 Compare December 18, 2021 21:16
@winstona
Copy link
Contributor Author

I've been working on this off and on, and have been trying to validate and ensure any potential long term memory leaks are addressed fully (some of these were a bit hidden and hard to track in the nvenc library).
I've rebased off master, and I'll hopefully have something soon to be considered mergeable - once I get that together (I have a few other changes not committed yet) I'm thinking of getting this merged as disabled by default (enabled by env variable) and follow up with a separate PR adding an encoder pool of n encoders instead of just one so this doesn't limit the quadro cards.

@totaam
Copy link
Collaborator

totaam commented Oct 3, 2022

Any update?

@winstona
Copy link
Contributor Author

winstona commented Oct 3, 2022

sorry about the delay and updating this one. I've had to step away from this for a bit, but I've been using this daily and working out the last bits if I notice them. I'll need to rebase again off master but will get updates as mentioned soon - I didn't realize that there was also some interest in possibly using this method (at least until nvenc pooling is implemented).

@winstona winstona force-pushed the nvenc_encoder_singleton_improvements branch from 018e830 to 31e9f05 Compare October 5, 2022 08:44
@winstona
Copy link
Contributor Author

winstona commented Oct 5, 2022

rebased off of latest master - it appears to be functional at this point, however I have a few comments to fix/address before considering this ready to merge

@totaam
Copy link
Collaborator

totaam commented Oct 5, 2022

I would like to merge the resizing part first, and the encoder pool separately - ideally in a separate module.

@totaam
Copy link
Collaborator

totaam commented Jul 20, 2023

Not heard back 😞

@totaam totaam closed this Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants