-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Closed
Description
System Info
transformersversion: 4.32.0.dev0- Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
- Python version: 3.10.6
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.1
- Accelerate version: 0.21.0
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu117 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce the behavior:
- The issue can be reproduced with the text-classification example script (other scripts would have the same issue). I have
intel-extension-for-pytorch==2.0.100installed in my environment and am running the following command to run_glue.py withoutuse_ipex(so it should default toFalse):The train metrics I see with this run are:export MODEL_NAME=distilbert-base-uncased export OUTPUT_DIR=/home/dmsuehir/glue_output export TASK_NAME=mrpc python run_glue.py \ --model_name_or_path $MODEL_NAME \ --task_name $TASK_NAME \ --do_train \ --max_seq_length 128 \ --per_device_train_batch_size 64 \ --learning_rate 2e-5 \ --num_train_epochs 1 \ --no_cuda \ --output_dir $OUTPUT_DIR \ --bf16Note that we are seeing***** train metrics ***** epoch = 1.0 train_loss = 0.6083 train_runtime = 0:00:37.35 train_samples = 3668 train_samples_per_second = 98.191 train_steps_per_second = 1.55398.191samples/second. - Next try running the same command, except adding on
--use_ipex. Note that I am also deleting my output directory between runs.I see a similar training metric forpython run_glue.py \ --model_name_or_path $MODEL_NAME \ --task_name $TASK_NAME \ --do_train \ --max_seq_length 128 \ --per_device_train_batch_size 64 \ --learning_rate 2e-5 \ --num_train_epochs 1 \ --no_cuda \ --output_dir $OUTPUT_DIR \ --bf16 \ --use_ipextrain_samples_per_secondas step 1:***** train metrics ***** epoch = 1.0 train_loss = 0.6083 train_runtime = 0:00:37.94 train_samples = 3668 train_samples_per_second = 96.654 train_steps_per_second = 1.528 - Finally, I had debugged this issue to look into how IPEX is being used in the Trainer. I found that it can be called in two places: (1) it can get called from the Trainer here or (2) it can get called by accelerate here. The Trainer is properly respecting the
use_ipexarg, however, it appears that accelerate is always using IPEX if it's installed. Digging deeper into this, I found that accelerate would only not use IPEX ifACCELERATE_USE_IPEXgets set to False/0. To confirm this, I manually setACCELERATE_USE_IPEX=0and then ran the same script/args from step 1:And now I see these training metrics, where we see a drop inexport ACCELERATE_USE_IPEX=0 python run_glue.py \ --model_name_or_path $MODEL_NAME \ --task_name $TASK_NAME \ --do_train \ --max_seq_length 128 \ --per_device_train_batch_size 64 \ --learning_rate 2e-5 \ --num_train_epochs 1 \ --no_cuda \ --output_dir $OUTPUT_DIR \ --bf16train_samples_per_second, which indicates that IPEX has actually been turned off now that the env var was used:***** train metrics ***** epoch = 1.0 train_loss = 0.697 train_runtime = 0:01:07.74 train_samples = 3668 train_samples_per_second = 54.143 train_steps_per_second = 0.856
Expected behavior
When use_ipex is not given or set to False, IPEX optimize should not get called.
If it's agreed that this is in fact a bug, I would be happy to work on a PR to fix it. I saw that other accelerate env vars are getting set from training_args.py.
Metadata
Metadata
Assignees
Labels
No labels