with device_map, the module should not move to cuda first <img width="560" height="190" alt="Image" src="https://github.com/user-attachments/assets/fa80ff4a-54da-4a1a-901f-a152eb3c3e2a" />