Skip to content

T5 finetune outputting gibberish #7796

Closed
@jsrozner

Description

@jsrozner

Environment info

  • transformers version: 3.3.1
  • Platform: Linux-4.4.0-116-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: (tried with both 1 and 2 gpus)

Who can help

Summarization: @sshleifer
T5: @patrickvonplaten
examples/seq2seq: @sshleifer

Information

I am trying to finetune on a custom dataset. I posted about my specific use case here in the forums: https://discuss.huggingface.co/t/t5-tips-for-finetuning-on-crossword-clues-clue-answer/1514

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [X ] my own task or dataset: (give details below)

To reproduce

  • clone transformers from master
  • pip install -e . ; pip install -r requirements.txt
  • cd exampls/seq2seq
  • modify finetune_t5.sh script to run with a local data set (data_set/[val|test|train].[source|target])

(Note that I have changed nothing else)

python finetune.py \ --model_name_or_path=t5-small \ --tokenizer_name=t5-small \ --data_dir=${HOME}/data_set \ --learning_rate=3e-4 \ --output_dir=$OUTPUT_DIR \ --max_source_length=100 \ --max_target_length=100 \ --num_train_epochs=300 \ --train_batch_size=64 \ --eval_batch_size=64 \ --gpus=1 \ --auto_select_gpus=True \ --save_top_k=3 \ --output_dir=$OUTPUT_DIR \ --do_train \ --do_predict \ "$@"

As a baseline "does the T5 work", my input outputs are of the form (one per line)
(this is one line in train.source): This is a sentence
(this is corresponding line in train.target): This

The lines are exactly as above, with a new line after each example, but with no other punctuation. I have not modified tokens or the model.

Expected behavior

Expect T5 to learn to output the first word.

Observed

T5 outputs first word followed by gibberish:

After 300 epochs, here is what we see for the first 5 lines of source vs test_generation (test.target is just the first word of each line in test.source)
Test.source:
We raised a bloom, a monster
I let Satan corrupt and torment
Chapter in play is an old piece
Old skin disease liable to drain confidence
Keep a riot going inside a musical academy

test_generations:
We vsahmoastuosastostassymbossa
Issahrastahmoormentostormentastoshomment
Chapter vshygie'ny-futtahraffahtaftast
Old hygienohmahrastassahuasairtia
Keep'astifiahuassaivrasastoshygiesana

I wonder if any of the following could be affecting this:

  • choice of loss function
  • a corrupted character somewhere in one of the input/output
  • choice of task (I think it defaults to summarization)
  • need more epochs?
  • some other parameter to change?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions