Is your feature request related to a problem? Please describe.
We are seeing that Torchbind operators from the C++ runtime getting called into Python in order to dispatch.
Describe the solution you'd like
We want to run in C++ without going back to python.
Potential solutions would be registering as a CUDA op or can we reexport so that we dont need to be lifted into python and we run more like what happens in AOTInductor or we can switch to an executorch style integration rather than torchbind
Describe alternatives you've considered
Additional context