Description
Hi,
I am trying to do pretraining for 8k sequence length on a custom dataset. However, I am getting the following error -
`Traceback (most recent call last):
File "/disk1/sandeep/m2bert/m2/bert/main.py", line 280, in
main(cfg)
File "/disk1/sandeep/m2bert/m2/bert/main.py", line 187, in main
train_loader = build_dataloader(
File "/disk1/sandeep/m2bert/m2/bert/main.py", line 144, in build_dataloader
return text_data_module.build_text_dataloader(cfg, tokenizer,
File "/disk1/sandeep/m2bert/m2/bert/src/text_data.py", line 274, in build_text_dataloader
dataset = StreamingTextDataset(
File "/disk1/sandeep/m2bert/m2/bert/src/text_data.py", line 134, in init
super().init(
File "/disk1/sandeep/miniconda3/envs/m2_bert/lib/python3.10/site-packages/streaming/base/dataset.py", line 325, in init
self._shm_prefix, self._locals_shm = get_shm_prefix(my_locals, world)
File "/disk1/sandeep/miniconda3/envs/m2_bert/lib/python3.10/site-packages/streaming/base/shared.py", line 357, in get_shm_prefix
dist.barrier()
File "/disk1/sandeep/miniconda3/envs/m2_bert/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3145, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL Error 1: unhandled cuda error
`
my batch sizes are -
global_train_batch_size: 7
System
seed: 17
device_eval_batch_size: 1
#device_train_microbatch_size: 8
device_train_microbatch_size: auto
precision: amp_bf16
please let me know what i am doing wrong