Fix userdata binding corner case #1673

aconty · 2023-04-24T09:38:47Z

Description

This includes two fixes:

When registering symbols that need user data, sort the entries in
the set so the layer number is ignored. A needed udata iteam shouldn't
depend on the layer and separating them makes find_userdata_index()
sometimes find an index with different derivs status.
osl_bind_interpolated_param() is memcpy'ing derivs that might not
be there, yielding corrupted derivs and possibly a crash.

Checklist:

I have read the contribution guidelines.
I have previously submitted a Contributor License Agreement.
I have updated the documentation, if applicable.
I have ensured that the change is tested somewhere in the testsuite (adding new test cases if necessary).
My code follows the prevailing code style of this project.

This includes two fixes: 1) When registering symbols that need user data, sort the entries in the set so the layer number is ignored. A needed udata iteam shouldn't depend on the layer and separating them makes find_userdata_index() sometimes find an index with different derivs status. 2) osl_bind_interpolated_param() is memcpy'ing derivs that might not be there, yielding corrupted derivs and possibly a crash. Signed-off-by: Alejandro Conty <aconty@imageworks.com>

lgritz · 2023-04-24T15:47:54Z

So this is the fix for the weird problem we were looking at last week?

Background: For those outside SPI, we had stumbled across a shader network that produced incorrect results when we DISABLED instance merging (when two shader instance nodes in a group have identical inputs and do identical computations, they are merged and the newly single node's outputs are sent to wherever the original ones' outputs went). We're accustomed to occasionally seeing some kind of logic error on our/my part where two nodes were merged when it wasn't actually safe to do so, but in this case, the merged result was correct but when merging was disabled, it was wrong. Real head scratcher.

Looks like Alex has tracked it down, which is awesome and an impressive feat of detective work.

Alex, I wonder if you could spell out for us exactly what was going wrong and how this fixes things? I think I get the gist -- there were two nodes that needed the same bit of named userdata, but one needed the derivs of the userdata and the other did not. Once merged, the derivatives would be retrieved by the merged node, since its output would be connected to the downstream sink that needed derivs. But when unmerged, if the one that didn't need derivs retrieved the userdata first, no derivs would be present (and by the time the one that needed derivs runs, the userdata is already computed and cached, deriv-free).

I think I get that part, but I am still a little hazy about how exactly this is fixed by removing the layer index from the sorting criteria of the list of userdata we will need. Won't you still have two entries in the list, one needing derivs and the other not, and so there is some kind of order dependence you are inadvertently relying on? Oh, is it a set of some kind, so there is not a second entry at all if the sort doesn't make them look different? But even in that case, if the first one added to the set doesn't need derivs but the second does, what is the mechanism by which the record already in the set becomes aware that it needs that userdata to compute derivs?

aconty · 2023-04-24T17:45:18Z

Exactly, it is a set<> built on that index that was considering the layer num. Now that it ignores it both layer will get the same entry in the set, and since one needs derivs, it will promote the entry. Without this change two entries are put in the set, and then find_userdata_index() will get the first one, which may not have derivs and fail to retrieve them.

We had to identical layers asking for Pref [ lockgeom off ]. One needed derivs, the other one did. When not merging we were unlucky and find_userdata_index() was going always for the one without derivs.

lgritz

Great catch! LGTM

AlexMWells · 2023-04-24T18:11:26Z

src/liboslexec/shadingsys.cpp

+               std::min(symbol_data_size, udata_size));
+        if (symbol_data_size > udata_size)
+            memset((char*)symbol_data + udata_size, 0,
+                   symbol_data_size - udata_size);
        return 1;


wide_shadingsys.cpp __OSL_OP(bind_interpolated_param) could use this same type of fix, although because of the SOA data layout it might be more complicated. In particular can't just copy less bytes. Are there any unit tests to exercise this exact issue?

With the fixed version of the set<> index, the safeguard in bind_interpolated_param shouldn't be necessary. I added it first to confirm we were getting garbage, then I left it in case the policy changed. No unit test, but with the other fix it won't even fail.

Are you making the statement
OSL_ASSERT(symbol_data_size == udata_size);
?

Aside:

I would love to eventually work toward as much as possible merging the separate scalar and batch versions of the ShadingSystem and BackendLLVM classes, or making them both inherit from a common base class with most of the functionality that is shared, or making batch a derived class of scalar, or something. Making "batch-ness" be more of a mode or a subclass that has only a few critical methods overloaded/replaced, will help to reduce these recurring issues where the scalar and batch systems have large numbers of identical or very similar methods and patchers (myself included) often only remember to change one of them.

In retrospect, I think it's less problematic how we did the OptiX back end, in which there are a variety of #if USE_OPTIX (for compile time) and if (optixmode) (for runtime) clauses inside the methods that really need them, rather than completely replicating the class hierarchy. That seems to suffer from fewer cases where we make a change to scalar or optix but forget to make the corresponding fix to the other... because for most of it, there are not separate code paths.

Granted, the changes necessary for batch shading are more intrusive and extensive -- the whole Cuda shtick is that you can largely write the code as if it's running on one point at a time, so the Cuda/OptiX code is closer in spirt to the CPU scalar path than the CPU simd path is to the CPU scalar path.

This includes two fixes: 1) When registering symbols that need user data, sort the entries in the set so the layer number is ignored. A needed udata iteam shouldn't depend on the layer and separating them makes find_userdata_index() sometimes find an index with different derivs status. 2) osl_bind_interpolated_param() is memcpy'ing derivs that might not be there, yielding corrupted derivs and possibly a crash. Signed-off-by: Alejandro Conty <aconty@imageworks.com>

Highlights: * GPU/OptiX support of ReParameter (AcademySoftwareFoundation#1686) * Fix OptiX userdata derivatives for interpolated params on GPU (AcademySoftwareFoundation#1685) * Fix userdata binding corner case (AcademySoftwareFoundation#1673) * Fix constant float values being converted to ints (AcademySoftwareFoundation#1674) Restock the abi dump file as we push to 1.13.4. See merge request spi/dev/3rd-party/osl-feedstock!48

lgritz approved these changes Apr 24, 2023

View reviewed changes

lgritz merged commit 5600d02 into AcademySoftwareFoundation:main Apr 24, 2023

AlexMWells reviewed Apr 24, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix userdata binding corner case #1673

Fix userdata binding corner case #1673

aconty commented Apr 24, 2023

lgritz commented Apr 24, 2023

aconty commented Apr 24, 2023

lgritz left a comment

AlexMWells Apr 24, 2023

aconty Apr 24, 2023

AlexMWells Apr 24, 2023

lgritz Apr 24, 2023

Fix userdata binding corner case #1673

Fix userdata binding corner case #1673

Conversation

aconty commented Apr 24, 2023

Description

Checklist:

lgritz commented Apr 24, 2023

aconty commented Apr 24, 2023

lgritz left a comment

Choose a reason for hiding this comment

AlexMWells Apr 24, 2023

Choose a reason for hiding this comment

aconty Apr 24, 2023

Choose a reason for hiding this comment

AlexMWells Apr 24, 2023

Choose a reason for hiding this comment

lgritz Apr 24, 2023

Choose a reason for hiding this comment