Closed
Description
The current approach of escaping kernel inputs during kernel execution, and having finalizers directly free HSA memory allocations, is problematic when considering the potential benefits of JuliaLang/julia#44056.
We could instead emulate the behavior of CUDA, and do refcounting of HSA allocations in the finalizer and for the duration of kernel executions. This would make HSA object finalizers very fast (possibly just being a single atomic add), and would stop us from escaping objects to protect allocations. It would also let us localize memory allocation failures to a limited set of tasks, which can let us provide better error handling behavior globally.