Skip to content

auto_find_batch_size=True and eval_steps=ratio unexpected behavior #24248

Closed
@edmcman

Description

@edmcman

System Info

  • transformers version: 4.30.1
  • Platform: Linux-5.7.19-050719-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I don't have a full example that I can share, but I think this is a simple enough problem that one may not be needed.

I am using TrainingArguments(auto_find_batch_size=True, eval_steps=0.1, per_device_train_size=1024). With batch size of 1024, I have 657 steps. The eval ratio appears to be evaluated on this, with evaluation happening every 66 steps.

However, the automatic batch size adjusts to 16, and a corresponding 83787 steps. But the evaluation is still performed every 66 steps.

Expected behavior

I expected the eval steps to be recomputed when the batch size updated. In the example above, I expected evaluation to occur every ~8000 steps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions