You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using TorchTune as DataLoader of an unstructured dataset with a given batch size, I get the error indicating that all samples in all batches must have the same size:
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 170, in collate
[rank4]: raise RuntimeError('each element in list of batch should be of equal size')
[rank4]: RuntimeError: each element in list of batch should be of equal size
The only modification I did was to change the DataLoader using torchtune in the example data.py script here. Details following.
importtorchfromtorch.utils.dataimportDatasetdefget_random_tokens(vocab_size,size):
# random tokens listtokens=torch.randint(
vocab_size,
size=size,
# Set a seed to make this toy dataset the same on each rank# Fabric will add a `DistributedSampler` to shard the data correctlygenerator=torch.Generator().manual_seed(42),
)
returntokensdefget_text_completion_dataset_tokens(seq_length,batch_size):
# https://pytorch.org/torchtune/main/generated/torchtune.utils.padded_collate.html#torchtune.utils.padded_collatefromtorchtune.utilsimportpadded_collatedataset=load_dataset(seq_length=seq_length)
dataloader=torch.utils.data.DataLoader(dataset, batch_size=batch_size, num_workers=0, shuffle=False, collate_fn=padded_collate)
tokens= []
forsampleindataloader:
batch=sample['tokens'].tolist()
forsampleinbatch:
tokens.append( sample )
returntokensclassRandomTokenDataset(Dataset):
def__init__(self, vocab_size: int, seq_length: int, batch_size:int):
self.vocab_size=vocab_sizeself.seq_length=seq_lengthself.batch_size=batch_size#self.tokens = get_random_tokens(self.vocab_size,(len(self), self.seq_length + 1))self.tokens=get_text_completion_dataset_tokens(seq_length,batch_size)
def__len__(self) ->int:
return128def__getitem__(self, item: int):
returnself.tokens[item]
This will end with the error above.
### Error messages and logs
[rank4]: Traceback (most recent call last):
[rank4]: File "train.py", line 233, in <module>
[rank4]: train()
[rank4]: File "train.py", line 222, in train
[rank4]: trainer.fit(model)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit
[rank4]: call._call_and_handle_interrupt(
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
[rank4]: return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch
[rank4]: return function(*args, **kwargs)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl
[rank4]: self._run(model, ckpt_path=ckpt_path)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 986, in _run
[rank4]: results = self._run_stage()
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1030, in _run_stage
[rank4]: self.fit_loop.run()
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
[rank4]: self.advance()
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
[rank4]: self.epoch_loop.run(self._data_fetcher)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run
[rank4]: self.advance(data_fetcher)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 212, in advance
[rank4]: batch, _, __ = next(data_fetcher)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/loops/fetchers.py", line 133, in __next__
[rank4]: batch = super().__next__()
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/loops/fetchers.py", line 60, in __next__
[rank4]: batch = next(self.iterator)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/utilities/combined_loader.py", line 341, in __next__
[rank4]: out = next(self._iterator)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/lightning/pytorch/utilities/combined_loader.py", line 78, in __next__
[rank4]: out[i] = next(self.iterators[i])
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
[rank4]: data = self._next_data()
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1344, in _next_data
[rank4]: return self._process_data(data)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1370, in _process_data
[rank4]: data.reraise()
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/_utils.py", line 706, in reraise
[rank4]: raise exception
[rank4]: RuntimeError: Caught RuntimeError in DataLoader worker process 0.
[rank4]: Original Traceback (most recent call last):
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 309, in _worker_loop
[rank4]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 55, in fetch
[rank4]: return self.collate_fn(data)
[rank4]: File "/home/coder/.local/lib/python3.8/site-packages/torch/utils/data/_utils/collate.py", line 317, in default_collate
[rank4]: return collate(batch, collate_fn_map=default_collate_fn_map)
The reason of this modification is that the provided example is too abstract to actually test a train meaningful a complete i.e. using a real tokenizer, so that I have added torchtune that provides both these functionalities.
It would be a good add on to have this example adapted in this way.
The text was updated successfully, but these errors were encountered:
Bug description
When running the example to train Llama3 from scratch with the Tensor Parallel example here
https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples/fabric/tensor_parallel
using TorchTune as DataLoader of an unstructured dataset with a given batch size, I get the error indicating that all samples in all batches must have the same size:
The only modification I did was to change the DataLoader using torchtune in the example
data.py
script here. Details following.What version are you seeing the problem on?
v1.x
How to reproduce the bug
add this method to the example file data.py here
This will end with the error above.
Environment
Current environment
More info
The reason of this modification is that the provided example is too abstract to actually test a train meaningful a complete i.e. using a real tokenizer, so that I have added
torchtune
that provides both these functionalities.It would be a good add on to have this example adapted in this way.
The text was updated successfully, but these errors were encountered: