Skip to content

Support for Iterable Datasets #87

Closed
@kmehant

Description

Transformers Lib while getting the data loader (for single gpu training) adds a wrapper collator over the original collator if the dataset is not of class HF Dataset, essentially for the cases where the dataset is iterable. Due to this reason, the data collator class turns into RemoveColumnsCollator which internally would internally removes the columns from the batch and then calls the original collator see - https://github.com/huggingface/transformers/blob/4d5b45870411053c9c72d24a8e1052e00fe62ad6/src/transformers/trainer_utils.py#L843

Using padding free with this scenario is not supported, since there is a hard check for the collators to be one of

collate_fn, (DataCollatorForSeq2Seq, DataCollatorForCompletionOnlyLM)

We should be able to support this case.

I am happy to raise a PR. Thanks.

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions