Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random crashes due to GL out of memory #34065

Closed
oeleo1 opened this issue Dec 2, 2019 · 8 comments · Fixed by #34088
Closed

Random crashes due to GL out of memory #34065

oeleo1 opened this issue Dec 2, 2019 · 8 comments · Fixed by #34088

Comments

@oeleo1
Copy link

oeleo1 commented Dec 2, 2019

Godot version:
3.2.beta3

OS/device including version:
mobile platforms, iOS, Android

Issue description:
Recent changes to GL buffer management cause random crashes following PR #33527. @akien-mga @clayjohn I got evidence by putting glGetError() checks surrounding the buffer orphaning statements introduced for mobile. Obviously the currently irresponsible systematic reallocation of 128K + 128K for the poly buffer doesn't fly. Need to switch to double-buffering to fix pipeline stalls hurting mobile performance or revert the changes as a last resort if a proper fix is not ready in time.

Steps to reproduce:

Minimal reproduction project:

@clayjohn
Copy link
Member

clayjohn commented Dec 2, 2019

Is the safest option to revert #33527?

@oeleo1
Copy link
Author

oeleo1 commented Dec 2, 2019

Yes it is, which implies reverting Godot to a random fit for arcade games on half the mobile devices out there, which is a pity. Reverting is not a tough call. Coming up with a minimal non-intrusive and safe solution to the current GLES code is. I will address this promptly and submit the result for evaluation.

@oeleo1
Copy link
Author

oeleo1 commented Dec 3, 2019

Okay, after a close examination it isn't that bad as I thought but still an issue. The GL error is real but doesn't cause the crashes. The problem is in GLES3 where the poly index buffer is orphaned before being bound. (or used at all) . With the fix there are no more GL errors and I suspect this will also fix the spurious DrawError 502 that were reported after the buffer changes. Thus, they are confirmed valid and I've tested as extensively as I could. Here is the patch
gles3.patch.txt

As to the crashes on iOS, I need to investigate further - that's another issue. Something's very wrong here - the editor takes ages to load a GLES3 project on macOS (while GLES2 is fast as usual), the crashes are occurring in the Line2D code which is supposed to be out of doubt, when resizing the width of a line with a tween, and Xcode is mentioning GLES2 in green down below while the project is GLES3...
image

@lawnjelly
Copy link
Member

Is there an issue for the the crashes on iOS?

If there are memory leaks / fragmentation on the GPU in the dynamic stuff it could potentially lead to 'random' crashes in other areas.

I suspect the manual double / triple buffering may be safer, because with the whole orphaning approach you are relying on sensible behaviour from the driver / hardware, which may happen on some devices but not all (it could actually be doing quite frightening things under the hood, and still be 'within the spec' ).

@qarmin
Copy link
Contributor

qarmin commented Dec 3, 2019

Crashes with Line2D should be fixed in Godot 4.0, when #31694 will be merged, but for now you can try to use this workaround #31609

@oeleo1
Copy link
Author

oeleo1 commented Dec 3, 2019

@qarmin I confirm that #31609 fixes the crashes I have so this should definitely go in. Thanks!

@lawnjelly There is always risk the driver can do "frightening things" but there is common sense we rely on. In case the driver runs out of memory due to GPU memory fragmentation, imagine what happens? The spec says the whole driver state becomes undefined in case of a GL error, so the chicken is toast -- no more guarantees. But in practice, when you run out of memory or do an invalid operation, the driver raises the GL error, stops right there and returns. It doesn't mess up more with invalidating buffers or pointers or otherwise. That's the common sense we rely on -- in case the orphaning fails, the previously allocated poly buffer is used and the whole process still works but with the known stalls. Witness is that without the above patch the buffer orphaning sequence is messed up, a GL error is raised but the driver successfully continues operation and there are no crashes. No driver expects us to act like a virgin so some abuse is always tolerated. On iOS orphaning a potion of the buffer is too much abuse and the driver crashes. But orphaning the whole buffer is fine and works well in practice.

That being said, I am currently testing a proof of concept round robin solution - I figured out a way to do this cleanly without messing up with the current code and will report on that soon for more people to eyeball - but that's a POC focused on relative safety based on current state of affairs that won't bring more performance.

@oeleo1
Copy link
Author

oeleo1 commented Dec 3, 2019

For the record: I've tested on iOS the round robin pre-allocated array of poly buffers scheme, supposedly safer, on GLES3. The code is nice and clean but the performance results are unconvincing:

  • with double-buffering I get 45-50 FPS, with triple-buffering 52-58, then going up to 58-60 (tested 8 pre-allocated buffers). No orphaning, just taking the next buffer on every poly draw request. Unfortunately I get DrawView 500 erros in all cases which suggests something's not quite right.
  • with mixing the two schemes by orphaning the previous buffer then taking the next buffer from the pool, I get the 58-60 FPS with double-buffering. This scheme is actually the current pure orphaning solution with a round-robin pool overhead.

Given the above, I am providing the latter variant of the code for your information and for educational purposes :) if there is interest in investigating further. But personally I wouldn't bother considering that path any longer (hence won't do the GLES2 variant) since there is no apparent benefit compared to what we currently have. Case closed (unless I happen to challenge myself again ;-).
POC_poly_buffer_round_robin.patch.txt
POC_poly_buffer_round_robin.zip

@oeleo1
Copy link
Author

oeleo1 commented Dec 4, 2019

@clayjohn Bummer! After updating from master it looks like I missed to fix a newly introduced function draw_generic_indices() in GLES3 which has been almost copy/pasted from draw_generic() and which contains the old orphan code. Here's the patched function. Could you please add it to the fix? Thanks and sorry about that.
gles3_bummer.patch.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants