Skip to content

Trainer is always using IPEX, even when use_ipex=False #24871

@dmsuehir

Description

@dmsuehir

System Info

  • transformers version: 4.32.0.dev0
  • Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. The issue can be reproduced with the text-classification example script (other scripts would have the same issue). I have intel-extension-for-pytorch==2.0.100 installed in my environment and am running the following command to run_glue.py without use_ipex (so it should default to False):
    export MODEL_NAME=distilbert-base-uncased
    export OUTPUT_DIR=/home/dmsuehir/glue_output
    export TASK_NAME=mrpc
    
    python run_glue.py \
     --model_name_or_path $MODEL_NAME \
     --task_name $TASK_NAME \
     --do_train \
     --max_seq_length 128 \
     --per_device_train_batch_size 64 \
     --learning_rate 2e-5 \
     --num_train_epochs 1 \
     --no_cuda \
     --output_dir $OUTPUT_DIR \
     --bf16
    
    The train metrics I see with this run are:
    ***** train metrics *****
      epoch                    =        1.0
      train_loss               =     0.6083
      train_runtime            = 0:00:37.35
      train_samples            =       3668
      train_samples_per_second =     98.191
      train_steps_per_second   =      1.553
    
    Note that we are seeing 98.191 samples/second.
  2. Next try running the same command, except adding on --use_ipex. Note that I am also deleting my output directory between runs.
    python run_glue.py \
      --model_name_or_path $MODEL_NAME \
      --task_name $TASK_NAME \
      --do_train \
      --max_seq_length 128 \
      --per_device_train_batch_size 64 \
      --learning_rate 2e-5 \
      --num_train_epochs 1 \
      --no_cuda \
      --output_dir $OUTPUT_DIR \
      --bf16 \
      --use_ipex
    
    I see a similar training metric for train_samples_per_second as step 1:
    ***** train metrics *****
      epoch                    =        1.0
      train_loss               =     0.6083
      train_runtime            = 0:00:37.94
      train_samples            =       3668
      train_samples_per_second =     96.654
      train_steps_per_second   =      1.528
    
  3. Finally, I had debugged this issue to look into how IPEX is being used in the Trainer. I found that it can be called in two places: (1) it can get called from the Trainer here or (2) it can get called by accelerate here. The Trainer is properly respecting the use_ipex arg, however, it appears that accelerate is always using IPEX if it's installed. Digging deeper into this, I found that accelerate would only not use IPEX if ACCELERATE_USE_IPEX gets set to False/0. To confirm this, I manually set ACCELERATE_USE_IPEX=0 and then ran the same script/args from step 1:
    export ACCELERATE_USE_IPEX=0
    
    python run_glue.py \
     --model_name_or_path $MODEL_NAME \
     --task_name $TASK_NAME \
     --do_train \
     --max_seq_length 128 \
     --per_device_train_batch_size 64 \
     --learning_rate 2e-5 \
     --num_train_epochs 1 \
     --no_cuda \
     --output_dir $OUTPUT_DIR \
     --bf16
    
    And now I see these training metrics, where we see a drop in train_samples_per_second, which indicates that IPEX has actually been turned off now that the env var was used:
    ***** train metrics *****
      epoch                    =        1.0
      train_loss               =      0.697
      train_runtime            = 0:01:07.74
      train_samples            =       3668
      train_samples_per_second =     54.143
      train_steps_per_second   =      0.856
    

Expected behavior

When use_ipex is not given or set to False, IPEX optimize should not get called.

If it's agreed that this is in fact a bug, I would be happy to work on a PR to fix it. I saw that other accelerate env vars are getting set from training_args.py.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions