-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Padding Dataset to max_seq_length #1416
Comments
Hi @loretoparisi thanks for creating the issue. If you are just padding to a fixed size, you do not even have to do it in the collate function, strictly speaking. The collate function is typically for operations that depend on the entire batch, but since you want to pad every sample to the same fixed length, you can even do it at the dataset level. E.g.
Btw I don't claim this will work 100%, but this should be the gist of it. |
@ebsmothers I would say it works I've just added a default class PaddedTextCompletionDataset(Dataset):
def __init__(self, tokenizer: object, seq_length: int, data_files: str, padding_idx: int = 0):
self.dataset = text_completion_dataset(
tokenizer,
source="text",
column="text",
data_files=data_files,
split="train",
max_seq_len=seq_length,
packed=False
)
self.padding_idx = padding_idx
self.max_seq_len = seq_length
def __len__(self) -> int:
return self.max_seq_len
def __getitem__(self, index: int):
unpadded = self.dataset[index]
tokens = unpadded["tokens"]
labels = unpadded["labels"]
pad_amounts = self.max_seq_len - len(tokens)
if pad_amounts > 0:
tokens = tokens + [self.padding_idx] * pad_amounts
labels = labels + [self.padding_idx] * pad_amounts
else:
padded_tokenns
return {"tokens": tokens, "labels": labels} 💯 input_ids = pad_sequence(
[torch.tensor(x["tokens"]) for x in batch],
batch_first=True,
padding_value=padding_idx,
)
labels = pad_sequence(
[torch.tensor(x["labels"]) for x in batch],
batch_first=True,
padding_value=ignore_idx,
) shall we use the Thanks! |
@loretoparisi oops good catch! That's my mistake, you should use |
Thank you. For the sake of correction I have removed the def __getitem__(self, index: int):
unpadded = self.dataset[index]
tokens = unpadded["tokens"]
labels = unpadded["labels"]
pad_amounts = self.max_seq_len - len(tokens)
if pad_amounts > 0:
tokens = tokens + [self.padding_idx] * pad_amounts
labels = labels + [self.ignore_idx] * pad_amounts
return {"tokens": tokens, "labels": labels} |
@ebsmothers there is a format error of the
I'm getting from
|
[UPDATE] def __getitem__(self, index: int):
unpadded = self.dataset[index]
tokens = unpadded["tokens"]
labels = unpadded["labels"]
pad_amounts = self.max_seq_len - len(tokens)
if pad_amounts > 0:
tokens = tokens + [self.padding_idx] * pad_amounts
labels = labels + [self.ignore_idx] * pad_amounts
tokens = torch.tensor(tokens, dtype=torch.long)
labels = torch.tensor(labels, dtype=torch.long)
return {"tokens": tokens, "labels": labels} now the output tensor is correct:
|
Thanks for the updates @loretoparisi! I am gonna close this issue but please feel free to reopen if you run into any other difficulties here |
When training Llama3 I wish to pad my unstructured text to the same length. This has been addressed by #1394
Anyways, this means that the dataset tokens sequence length will be the max tensor length found in that specific dataset, because this is how the padded_collate function works.
While in Llama3 I want to have in my custom torch Dataset a specific length, defined externally, like:
and in Llama3 side
So, how can I use the padding collate to pad the a given sequence length in torchtune rather than to the max tensor length?
The code above will break because my Tensor size to
output = self.model(inputs)
will be[2,35]
(the max tensor length found in that dataset is 35) but sequence length is 128, so it will only work if Tensor torch size will be[2,128]
so I will get a CUDA errorcuda Assertion
srcIndex < srcSelectDimSizefailed.
after that because tokenizer size will not match that length.To be more specific, that dimensionality issue will happen in the
forward
pass of theTransformer
block on theLlama3
class in where embeddings are assigned by the tokens:h = self.tok_embeddings(tokens) if self.tok_embeddings else tokens
:The text was updated successfully, but these errors were encountered: