Add double queue buffering. The function of resolver `resovler::XX::execute` use one stream to improve the performant the process must use 2 streams. - [x] CUDA - [ ] OpenCL