Skip to content

Multi-task dataset mixing #217

@ghomasHudson

Description

@ghomasHudson

It seems like many of the best performing models on the GLUE benchmark make some use of multitask learning (simultaneous training on multiple tasks).

The T5 paper highlights multiple ways of mixing the tasks together during finetuning:

  • Examples-proportional mixing - sample from tasks proportionally to their dataset size
  • Equal mixing - sample uniformly from each task
  • Temperature-scaled mixing - The generalized approach used by multilingual BERT which uses a temperature T, where the mixing rate of each task is raised to the power 1/T and renormalized. When T=1 this is equivalent to equal mixing, and becomes closer to equal mixing with increasing T.

Following this discussion huggingface/transformers#4340 in transformers, @enzoampil suggested that the nlp library might be a better place for this functionality.

Some method for combining datasets could be implemented ,e.g.

dataset = nlp.load_multitask(['squad','imdb','cnn_dm'], temperature=2.0, ...)

We would need a few additions:

  • Method of identifying the tasks - how can we support adding a string to each task as an identifier: e.g. 'summarisation: '?
  • Method of combining the metrics - a standard approach is to use the specific metric for each task and add them together for a combined score.

It would be great to support common use cases such as pretraining on the GLUE benchmark before fine-tuning on each GLUE task in turn.

I'm willing to write bits/most of this I just need some guidance on the interface and other library details so I can integrate it properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions