Description
Environment info
transformers
version: 3.3.1- Platform: Linux-4.4.0-116-generic-x86_64-with-glibc2.10
- Python version: 3.8.5
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: (tried with both 1 and 2 gpus)
Who can help
Summarization: @sshleifer
T5: @patrickvonplaten
examples/seq2seq: @sshleifer
Information
I am trying to finetune on a custom dataset. I posted about my specific use case here in the forums: https://discuss.huggingface.co/t/t5-tips-for-finetuning-on-crossword-clues-clue-answer/1514
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [X ] my own task or dataset: (give details below)
To reproduce
- clone transformers from master
- pip install -e . ; pip install -r requirements.txt
- cd exampls/seq2seq
- modify finetune_t5.sh script to run with a local data set (data_set/[val|test|train].[source|target])
(Note that I have changed nothing else)
python finetune.py \ --model_name_or_path=t5-small \ --tokenizer_name=t5-small \ --data_dir=${HOME}/data_set \ --learning_rate=3e-4 \ --output_dir=$OUTPUT_DIR \ --max_source_length=100 \ --max_target_length=100 \ --num_train_epochs=300 \ --train_batch_size=64 \ --eval_batch_size=64 \ --gpus=1 \ --auto_select_gpus=True \ --save_top_k=3 \ --output_dir=$OUTPUT_DIR \ --do_train \ --do_predict \ "$@"
As a baseline "does the T5 work", my input outputs are of the form (one per line)
(this is one line in train.source): This is a sentence
(this is corresponding line in train.target): This
The lines are exactly as above, with a new line after each example, but with no other punctuation. I have not modified tokens or the model.
Expected behavior
Expect T5 to learn to output the first word.
Observed
T5 outputs first word followed by gibberish:
After 300 epochs, here is what we see for the first 5 lines of source vs test_generation (test.target is just the first word of each line in test.source)
Test.source:
We raised a bloom, a monster
I let Satan corrupt and torment
Chapter in play is an old piece
Old skin disease liable to drain confidence
Keep a riot going inside a musical academy
test_generations:
We vsahmoastuosastostassymbossa
Issahrastahmoormentostormentastoshomment
Chapter vshygie'ny-futtahraffahtaftast
Old hygienohmahrastassahuasairtia
Keep'astifiahuassaivrasastoshygiesana
I wonder if any of the following could be affecting this:
- choice of loss function
- a corrupted character somewhere in one of the input/output
- choice of task (I think it defaults to summarization)
- need more epochs?
- some other parameter to change?