-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix userdata binding corner case #1673
Conversation
This includes two fixes: 1) When registering symbols that need user data, sort the entries in the set so the layer number is ignored. A needed udata iteam shouldn't depend on the layer and separating them makes find_userdata_index() sometimes find an index with different derivs status. 2) osl_bind_interpolated_param() is memcpy'ing derivs that might not be there, yielding corrupted derivs and possibly a crash. Signed-off-by: Alejandro Conty <aconty@imageworks.com>
So this is the fix for the weird problem we were looking at last week? Background: For those outside SPI, we had stumbled across a shader network that produced incorrect results when we DISABLED instance merging (when two shader instance nodes in a group have identical inputs and do identical computations, they are merged and the newly single node's outputs are sent to wherever the original ones' outputs went). We're accustomed to occasionally seeing some kind of logic error on our/my part where two nodes were merged when it wasn't actually safe to do so, but in this case, the merged result was correct but when merging was disabled, it was wrong. Real head scratcher. Looks like Alex has tracked it down, which is awesome and an impressive feat of detective work. Alex, I wonder if you could spell out for us exactly what was going wrong and how this fixes things? I think I get the gist -- there were two nodes that needed the same bit of named userdata, but one needed the derivs of the userdata and the other did not. Once merged, the derivatives would be retrieved by the merged node, since its output would be connected to the downstream sink that needed derivs. But when unmerged, if the one that didn't need derivs retrieved the userdata first, no derivs would be present (and by the time the one that needed derivs runs, the userdata is already computed and cached, deriv-free). I think I get that part, but I am still a little hazy about how exactly this is fixed by removing the layer index from the sorting criteria of the list of userdata we will need. Won't you still have two entries in the list, one needing derivs and the other not, and so there is some kind of order dependence you are inadvertently relying on? Oh, is it a |
Exactly, it is a set<> built on that index that was considering the layer num. Now that it ignores it both layer will get the same entry in the set, and since one needs derivs, it will promote the entry. Without this change two entries are put in the set, and then find_userdata_index() will get the first one, which may not have derivs and fail to retrieve them. We had to identical layers asking for Pref [ lockgeom off ]. One needed derivs, the other one did. When not merging we were unlucky and find_userdata_index() was going always for the one without derivs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! LGTM
std::min(symbol_data_size, udata_size)); | ||
if (symbol_data_size > udata_size) | ||
memset((char*)symbol_data + udata_size, 0, | ||
symbol_data_size - udata_size); | ||
return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wide_shadingsys.cpp __OSL_OP(bind_interpolated_param) could use this same type of fix, although because of the SOA data layout it might be more complicated. In particular can't just copy less bytes. Are there any unit tests to exercise this exact issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the fixed version of the set<> index, the safeguard in bind_interpolated_param shouldn't be necessary. I added it first to confirm we were getting garbage, then I left it in case the policy changed. No unit test, but with the other fix it won't even fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you making the statement
OSL_ASSERT(symbol_data_size == udata_size);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside:
I would love to eventually work toward as much as possible merging the separate scalar and batch versions of the ShadingSystem and BackendLLVM classes, or making them both inherit from a common base class with most of the functionality that is shared, or making batch a derived class of scalar, or something. Making "batch-ness" be more of a mode or a subclass that has only a few critical methods overloaded/replaced, will help to reduce these recurring issues where the scalar and batch systems have large numbers of identical or very similar methods and patchers (myself included) often only remember to change one of them.
In retrospect, I think it's less problematic how we did the OptiX back end, in which there are a variety of #if USE_OPTIX
(for compile time) and if (optixmode)
(for runtime) clauses inside the methods that really need them, rather than completely replicating the class hierarchy. That seems to suffer from fewer cases where we make a change to scalar or optix but forget to make the corresponding fix to the other... because for most of it, there are not separate code paths.
Granted, the changes necessary for batch shading are more intrusive and extensive -- the whole Cuda shtick is that you can largely write the code as if it's running on one point at a time, so the Cuda/OptiX code is closer in spirt to the CPU scalar path than the CPU simd path is to the CPU scalar path.
This includes two fixes: 1) When registering symbols that need user data, sort the entries in the set so the layer number is ignored. A needed udata iteam shouldn't depend on the layer and separating them makes find_userdata_index() sometimes find an index with different derivs status. 2) osl_bind_interpolated_param() is memcpy'ing derivs that might not be there, yielding corrupted derivs and possibly a crash. Signed-off-by: Alejandro Conty <aconty@imageworks.com>
Highlights: * GPU/OptiX support of ReParameter (AcademySoftwareFoundation#1686) * Fix OptiX userdata derivatives for interpolated params on GPU (AcademySoftwareFoundation#1685) * Fix userdata binding corner case (AcademySoftwareFoundation#1673) * Fix constant float values being converted to ints (AcademySoftwareFoundation#1674) Restock the abi dump file as we push to 1.13.4. See merge request spi/dev/3rd-party/osl-feedstock!48
Description
This includes two fixes:
When registering symbols that need user data, sort the entries in
the set so the layer number is ignored. A needed udata iteam shouldn't
depend on the layer and separating them makes find_userdata_index()
sometimes find an index with different derivs status.
osl_bind_interpolated_param() is memcpy'ing derivs that might not
be there, yielding corrupted derivs and possibly a crash.
Checklist: