-
Notifications
You must be signed in to change notification settings - Fork 839
Optimizing PrepareLightsForGPU (ProcessVisibleLights) with burst #5294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi! This comment will help you figure out which jobs to run before merging your PR. The suggestions are dynamic based on what files you have changed. HDRP SRP Core Depending on the scope of your PR, you may need to run more jobs than what has been suggested. Please speak to your lead or a Graphics SDET (#devs-graphics-automation) if you are unsure. |
6731df0
to
376fe9d
Compare
Hi, Is this still planned? PrepereLightsForGPU takes +1ms in our project. |
Still planned. Had to work on other things this week. Which platform are you seeing this and how many lights? (How many have shadows etc?) |
Thank you for your info @LaserYGD |
1074dd9
to
aa22067
Compare
hey @LaserYGD , I have finished the first round of burstification. Moved the entire 'Preprocess lights' logic to burst. |
@kecho burst = this pull bigger test (200 point lights~):With burst:PrepereLightsForGPU(total): 0.70ms ~ 0.90ms without burst, latest beta:PrepereLightsForGPU(total): 1.00ms ~ 1.10ms Smaller test (67 point lights, not all in view):Burst:PrepereLightsForGPU(total): 0.42ms~ Without burst, latest beta:PrepereLightsForGPU(total): 0.38ms~ It seems that PrepareGPULightdata is a bit slower in this pull. I'm a bit surprised that I'm not getting as big of an improvement as your testing, did you do your tests in the editor? edit: sent an email to you with the scene download |
Seriously thank you so much! I think the overhead cost is making it a bit more expensive for low count of lights. It also depends on the CPU. I am seeing big perf gains on console. There are some paths that are incomplete, especially PrepareGPULightdata, I had to add some tmp code to keep it working, this is likely making it slower if I had to guess. I'll try to bundle more optimizations on this same PR and then we should see an overall gain. |
6ccbc92
to
37340ea
Compare
com.unity.render-pipelines.high-definition/Runtime/Lighting/Light/HDLightRenderDatabase.cs
Show resolved
Hide resolved
com.unity.render-pipelines.high-definition/Runtime/Lighting/Light/HDGpuLightsBuilder.Jobs.cs
Outdated
Show resolved
Hide resolved
...nity.render-pipelines.high-definition/Runtime/Lighting/Light/HDGpuLightsBuilder.LightLoop.cs
Show resolved
Hide resolved
...nity.render-pipelines.high-definition/Runtime/Lighting/Light/HDGpuLightsBuilder.LightLoop.cs
Outdated
Show resolved
Hide resolved
com.unity.render-pipelines.high-definition/Runtime/RenderPipeline/HDRenderPipeline.LightLoop.cs
Outdated
Show resolved
Hide resolved
28e7026
to
c13e5ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the various discussed points as future steps (non-blocking for this PR)
Great work! Really wish someone would slap the backport-to-2021.2 label on this one 🤞 |
@LaserYGD check out the Profile Analyzer to get all those stats averaged for you over a range of frames. It also does comparisons of captures. |
eb4af19
to
dbdebff
Compare
...unity.render-pipelines.high-definition/Runtime/Lighting/Light/HDAdditionalLightData.Types.cs
Show resolved
Hide resolved
@kecho update status from QA for this one? |
@sebastienlagarde yes. They found 1 issue, but im on it. Should be fixed by EOD today. Issue happens when there is more than 3k lights. The previous implementation could handle it. So i think its something silly with numbers / allocations. Will keep you posted.
|
* Bringing StoA container and small entity system for HDAdditional lights. * Building entities for each view, and ensuring array size is pooled * Safety on destruction / empty light list iteration * Adding markers * Removing extra entity indirection. Keeping things intrusive * Jobs for transform data copying * Parallelizing and creating an append strategy for processed lights * Light culling jobs, more logic * Deallocation safety, skipping any processing if m_Size is 0 * compilation issue * Enabling burst * Adding more functionality to processing * Reducing complexity of HDLightEntityCollection * Refactoring so we dont have to do any intrusive manipulation of light entity. This also decouples some of the issues * Fixing memory leaks by tracking missing instances during reset * Removing circular dependency * Light loop settings and culling maximum light count * Adding support for evaluation of shadow state * Adding support for shadow reservation * Early return if no lights are available * Sorting and shadow management * Prepare ligths for GPU marker * Formatting * Integrating the new jobified light preprocess into the PrepareLightsForGPU formatting call. * More work on unification, now loop is connected, commented ray tracing loop for now. * Fixing sorting offset and counters internally * Using the correct index for the sort key * Removing process light data, cleaning up ray tracing light cluster path * Is created check * Support for shuriken lights / lights that lack an HDAdditionalData * Going back to SoA to hit better perf on low number of lights * Initial changes for PrepareLightsForGPU * Formatting * Input stuff * Porting work, moving first batch of code. WIP * Finished porting code for LightGetData * Connecting the BuildLightGPUData port * Adding shadow allocation for directional shadows * Enabling burst * Sorting MSB of sortKey so that we process non directioanl first, and we know exactly how to compute the output index of directional * Fixing order and mistake on shift * Right allocation of directional / non directional lights * Initial work for directional lights * More work on directional lights * More work on port, correct sorting order for sortKeys, having directional lights first * Isolating processing of directional lights to function * Final pieces for gpu light directional lights * Burstifying light volume code * Formatting * Moving directional light sky / direct light color computation to burst * Make lights camera relative * Light counters * AtoS conversion * Bug fix, leaving too early * Safety on allocation count * Going back to safe references * Support for ray tracing * Connecting new light loop * Finished connecting of new cpu light loop * Formatting * Wrong array length passed on radix sort size * Adding renaming suggested by Remi's code review, much nicer * formatting * Renaming to more appropiate names, HDLightEntityCollection -> HDLightRenderDatabase * Cleaning up database, renaming light collection to database. * Formatting * Replacing with dynamic array * Reset of light count, causing some internal issues when all lights deleted. * Renaming HDVisibleLightEntities to HDProcessedVisibleLightsBuilder, and cleaning some of the code * Renamed HDGpuLightList to HDGpuLightBuilder * Adding safety checks on debug mode * Fixing shuriken lights that have shadows enabled: ensure to disable them * Capitalize public functions * Adding sort algorithm tests * Deleting duplicate code added by accident * Adding comment explaining when to add members in the HDLightRenderData * Formatting * Removing job that copies transforms, not required anymore * Formatting * Adding protection for invalid entity when editing properties on non enabled HDAddtiionalLightData * Adding another filter for clustered ray tracing lights, this ignores invalid lights * Correct shadow flag, mistake in refactor * Adding only cluster type lights to light entityarray list * Simplifying HDProcessedVisibleLightsBuilder, making it a member entirely of the HDRenderPipeline * Fixing typo causing memory leak * Formatting * Added missing useVolumetric cloud parameter missing, affecting rendering * Adding rather odd scaling of main light cookie, not getting this code exactly really messes up cloud shadows * Missing flushing of area lights shape from script * Shuriken lights attrociousness. Works now though * Correct order for clustered lights * Adding auto destruction to fix a memory leak inside shuriken lights * Forgot to add the correct cam relative world position * Updated feedback from review, some renames and comments * Formatting * Fixed comment * Changelogs * Compressing left and right eye copying of lights into GPU buffer, splitting into 2 separate copies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some performance testing and can confirm that it's bette now.
For the manual tests listed, everything seems to be working like on the master branch, so this is a 🟢 for me.
Thank you for the in depth explanations of the PR in the description.
Test doc here : https://docs.google.com/document/d/1ihb4pcw8634-fclHjGpbrgr0ZsIJfc6XW5W-NmI7H5k/edit?usp=sharing
* Bringing StoA container and small entity system for HDAdditional lights. * Building entities for each view, and ensuring array size is pooled * Safety on destruction / empty light list iteration * Adding markers * Removing extra entity indirection. Keeping things intrusive * Jobs for transform data copying * Parallelizing and creating an append strategy for processed lights * Light culling jobs, more logic * Deallocation safety, skipping any processing if m_Size is 0 * compilation issue * Enabling burst * Adding more functionality to processing * Reducing complexity of HDLightEntityCollection * Refactoring so we dont have to do any intrusive manipulation of light entity. This also decouples some of the issues * Fixing memory leaks by tracking missing instances during reset * Removing circular dependency * Light loop settings and culling maximum light count * Adding support for evaluation of shadow state * Adding support for shadow reservation * Early return if no lights are available * Sorting and shadow management * Prepare ligths for GPU marker * Formatting * Integrating the new jobified light preprocess into the PrepareLightsForGPU formatting call. * More work on unification, now loop is connected, commented ray tracing loop for now. * Fixing sorting offset and counters internally * Using the correct index for the sort key * Removing process light data, cleaning up ray tracing light cluster path * Is created check * Support for shuriken lights / lights that lack an HDAdditionalData * Going back to SoA to hit better perf on low number of lights * Initial changes for PrepareLightsForGPU * Formatting * Input stuff * Porting work, moving first batch of code. WIP * Finished porting code for LightGetData * Connecting the BuildLightGPUData port * Adding shadow allocation for directional shadows * Enabling burst * Sorting MSB of sortKey so that we process non directioanl first, and we know exactly how to compute the output index of directional * Fixing order and mistake on shift * Right allocation of directional / non directional lights * Initial work for directional lights * More work on directional lights * More work on port, correct sorting order for sortKeys, having directional lights first * Isolating processing of directional lights to function * Final pieces for gpu light directional lights * Burstifying light volume code * Formatting * Moving directional light sky / direct light color computation to burst * Make lights camera relative * Light counters * AtoS conversion * Bug fix, leaving too early * Safety on allocation count * Going back to safe references * Support for ray tracing * Connecting new light loop * Finished connecting of new cpu light loop * Formatting * Wrong array length passed on radix sort size * Adding renaming suggested by Remi's code review, much nicer * formatting * Renaming to more appropiate names, HDLightEntityCollection -> HDLightRenderDatabase * Cleaning up database, renaming light collection to database. * Formatting * Replacing with dynamic array * Reset of light count, causing some internal issues when all lights deleted. * Renaming HDVisibleLightEntities to HDProcessedVisibleLightsBuilder, and cleaning some of the code * Renamed HDGpuLightList to HDGpuLightBuilder * Adding safety checks on debug mode * Fixing shuriken lights that have shadows enabled: ensure to disable them * Capitalize public functions * Adding sort algorithm tests * Deleting duplicate code added by accident * Adding comment explaining when to add members in the HDLightRenderData * Formatting * Removing job that copies transforms, not required anymore * Formatting * Adding protection for invalid entity when editing properties on non enabled HDAddtiionalLightData * Adding another filter for clustered ray tracing lights, this ignores invalid lights * Correct shadow flag, mistake in refactor * Adding only cluster type lights to light entityarray list * Simplifying HDProcessedVisibleLightsBuilder, making it a member entirely of the HDRenderPipeline * Fixing typo causing memory leak * Formatting * Added missing useVolumetric cloud parameter missing, affecting rendering * Adding rather odd scaling of main light cookie, not getting this code exactly really messes up cloud shadows * Missing flushing of area lights shape from script * Shuriken lights attrociousness. Works now though * Correct order for clustered lights * Adding auto destruction to fix a memory leak inside shuriken lights * Forgot to add the correct cam relative world position * Updated feedback from review, some renames and comments * Formatting * Fixed comment * Changelogs * Compressing left and right eye copying of lights into GPU buffer, splitting into 2 separate copies
* **CPU Optimizations for PrepareLightsForGPU** (#5294) * Bringing StoA container and small entity system for HDAdditional lights. * Building entities for each view, and ensuring array size is pooled * Safety on destruction / empty light list iteration * Adding markers * Removing extra entity indirection. Keeping things intrusive * Jobs for transform data copying * Parallelizing and creating an append strategy for processed lights * Light culling jobs, more logic * Deallocation safety, skipping any processing if m_Size is 0 * compilation issue * Enabling burst * Adding more functionality to processing * Reducing complexity of HDLightEntityCollection * Refactoring so we dont have to do any intrusive manipulation of light entity. This also decouples some of the issues * Fixing memory leaks by tracking missing instances during reset * Removing circular dependency * Light loop settings and culling maximum light count * Adding support for evaluation of shadow state * Adding support for shadow reservation * Early return if no lights are available * Sorting and shadow management * Prepare ligths for GPU marker * Formatting * Integrating the new jobified light preprocess into the PrepareLightsForGPU formatting call. * More work on unification, now loop is connected, commented ray tracing loop for now. * Fixing sorting offset and counters internally * Using the correct index for the sort key * Removing process light data, cleaning up ray tracing light cluster path * Is created check * Support for shuriken lights / lights that lack an HDAdditionalData * Going back to SoA to hit better perf on low number of lights * Initial changes for PrepareLightsForGPU * Formatting * Input stuff * Porting work, moving first batch of code. WIP * Finished porting code for LightGetData * Connecting the BuildLightGPUData port * Adding shadow allocation for directional shadows * Enabling burst * Sorting MSB of sortKey so that we process non directioanl first, and we know exactly how to compute the output index of directional * Fixing order and mistake on shift * Right allocation of directional / non directional lights * Initial work for directional lights * More work on directional lights * More work on port, correct sorting order for sortKeys, having directional lights first * Isolating processing of directional lights to function * Final pieces for gpu light directional lights * Burstifying light volume code * Formatting * Moving directional light sky / direct light color computation to burst * Make lights camera relative * Light counters * AtoS conversion * Bug fix, leaving too early * Safety on allocation count * Going back to safe references * Support for ray tracing * Connecting new light loop * Finished connecting of new cpu light loop * Formatting * Wrong array length passed on radix sort size * Adding renaming suggested by Remi's code review, much nicer * formatting * Renaming to more appropiate names, HDLightEntityCollection -> HDLightRenderDatabase * Cleaning up database, renaming light collection to database. * Formatting * Replacing with dynamic array * Reset of light count, causing some internal issues when all lights deleted. * Renaming HDVisibleLightEntities to HDProcessedVisibleLightsBuilder, and cleaning some of the code * Renamed HDGpuLightList to HDGpuLightBuilder * Adding safety checks on debug mode * Fixing shuriken lights that have shadows enabled: ensure to disable them * Capitalize public functions * Adding sort algorithm tests * Deleting duplicate code added by accident * Adding comment explaining when to add members in the HDLightRenderData * Formatting * Removing job that copies transforms, not required anymore * Formatting * Adding protection for invalid entity when editing properties on non enabled HDAddtiionalLightData * Adding another filter for clustered ray tracing lights, this ignores invalid lights * Correct shadow flag, mistake in refactor * Adding only cluster type lights to light entityarray list * Simplifying HDProcessedVisibleLightsBuilder, making it a member entirely of the HDRenderPipeline * Fixing typo causing memory leak * Formatting * Added missing useVolumetric cloud parameter missing, affecting rendering * Adding rather odd scaling of main light cookie, not getting this code exactly really messes up cloud shadows * Missing flushing of area lights shape from script * Shuriken lights attrociousness. Works now though * Correct order for clustered lights * Adding auto destruction to fix a memory leak inside shuriken lights * Forgot to add the correct cam relative world position * Updated feedback from review, some renames and comments * Formatting * Fixed comment * Changelogs * Compressing left and right eye copying of lights into GPU buffer, splitting into 2 separate copies * Updated whats new * changelog small entry * Update CHANGELOG.md Co-authored-by: sebastienlagarde <sebastien@unity3d.com>
Dynamic GI is currently disabled Adapted PreprocessDynamicGILights to work with the new parallel visible lights builder Reenabled customizations Adapted PrepareDynamicGIDPULightdata to work with the new parallel lights builders Reenabled DynamicGI Readded WillRenderShadowMap for the client-side ShadowCachingSystem Removed review comment Bugfixes from R&D Fixed out of bounds access in the DynamicGI visible lights builder Fixed ClearLightList spill from probe and mask volumes LightLoop Optimizations Backport: Fixed up HVSSS which was broken in the backport. There were three bugs: 1) the lightData.heirarchicalVarianceScreenSpaceShadowsIndex was getting zero initialized instead of initialized to -1 in the build job. 2) We were requesting unique HVSSS shadows for the regular light loop AND the dynamic GI light loop, causing 2x the hvsss trace work. 3) The HVSSS code interacts with lightData.positionRWS - in the previous version of the lightloop code, this value was still in WS not RWS, and so HVSSS ran its calculations on that. The new light loop code transforms positionRWS into RWS much earlier, so I have updated the HVSSS calculations to work directly off of RWS (which is cleaner anyway).
Purpose of this PR
Optimization of PrepareLightsforGPU (ProcessVisibleLights) on HDRP using burst, in a ECS friendly way.
This PR includes:
NOTE:
Testing status
Seeing a reduction of around 50%-70% on PrepareLightsForGPU.

Comparisons are done after compiling with IL2CPP at max optimization level, against master on ps4.
Similar performance comparisons must be done in Windows: create a player in windows with il2cpp with maximum optimization level, and compare master vs this branch.
Testing performed
Additional testing required:
Technical Information
The following is a technical writeup, intended to give a high level overview of the work done on this PR. Before we can dive in, I need to explain how exactly HDRP does light culling and why it is expensive.
Originally the light loop in HDRP is divided into 3 big steps.
Culling in HDRP
Step 1
first step is frustum culling coming from the SRP, which converts Light game objects (refered as GOs) to VisibleLight structs. VisibleLight struct contains a copy of the visible light, and is all performed in the c++, providing these structs (burst accessible) for the SRP to access
Step 2
Next, HDRP has a HDAdditionalLightData GO for additional PBR options not present in the regular Light component. This component is self managed, and the UI obfuscates everything, creating a nice GUI that lets HDRP have a more advanced light definition.
The second step basically took the VisibleLight objects, and processed them with the HDAdditioanlLightData and did some extra quick rejection. This produced an intermediate list called ProcessedLightData.
The ProcessedLightData then gets sorted using a recursive quicksort, which is unnacceptably heavy on the CPU.
Step 3
Last step then converts ProcessedLightData into a GPU ready struct light light list.
The 2nd and 3rd step were all done linearly, in the main thread, accessed a lot the HDAdditionalData and were huge perf bottle necks. The sort was as well gigantic.
Optimization Parts
Part A
Added the concept of a render entity (HDRenderLightEntity) and a global manager for these entities. This basically mantains light data (from HDAdditionalLightData GO ) as struct. The GO has an entity (handle) to this render data and is responsible to update it from the Sim side. The GO then just becomes an interface to the UI / serialization / artist. Its not used anymore by the render pipeline (with the exception of shadow procesisng, which will require another separate PR).
This effectively (almost) decouples the rneder pipeline from the GO. This allows the render pipeline then to access the HDLightRenderLightEntityData as a struct, thus unlocking burstification.
Part B
Moved the step 2 mentioned above to burst, with the exception of shadow allocation. ProcessedLightData gets accumulated using an atomic increment across all threads. This is a gigantic speedup on the CPU. With the exception of shadows, shadow lights still get processed 'patched' in the main thread.
Part C
Rewrote the sorts to be done only on UInts, and on pointers for maximum speed.
Added 3 sort algorithms:
Insertion sort for a light count <32
Non recursive merge sort with auxiliary data for >= 32 and <= 256
Radix sort with auxiliary data with 8 bit radix for 256 >= and above.
This is effectively almost 5x faster, and can also scale massively. The sort is still done in the main thread (there were no good gains on it being on burst)
Part D
Move the Step 3 above to burst. Keep the shadow pieces in the main thread.
Conclusion and Next steps
Huge perf gains, new sort loops available are now in Core, for next work we have to optimize the shadow atlas allocation, matrix calculation. Investigate reducing the 3 step culling algorithm to a single step.
This makes it now possible for DOTS as well to inject its own 'sim side' light definition and wrap the render side entity. Shadow work will finalize the decoupling of HDAdditionalLightData GO and enable this final perk.