__3.x = (float)(((nv_bfloat162*)(&(v_.x)))->x);
__3.y = (float)(((nv_bfloat162*)(&(v_.x)))->y);
__3.z = (float)(((nv_bfloat162*)(&(v_.y)))->x);
__3.w = (float)(((nv_bfloat162*)(&(v_.y)))->y);
We can use __bfloat1622float2 to improve performance. Similar, __float22bfloat162_rn , __nv_cvt_float2_to_fp8x2 and more CUDA instructions like this can also improve performance.