Are there any reference documentation to use NCCL for LLama2-70B model in the distributed environment on multinode and multigpu config