Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto_find_batch_size=True and eval_steps=ratio unexpected behavior #24248

Closed
2 of 4 tasks
edmcman opened this issue Jun 13, 2023 · 26 comments
Closed
2 of 4 tasks

auto_find_batch_size=True and eval_steps=ratio unexpected behavior #24248

edmcman opened this issue Jun 13, 2023 · 26 comments

Comments

@edmcman
Copy link

edmcman commented Jun 13, 2023

System Info

  • transformers version: 4.30.1
  • Platform: Linux-5.7.19-050719-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.15.1
  • Safetensors version: 0.3.1
  • PyTorch version (GPU?): 2.0.1+cu117 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I don't have a full example that I can share, but I think this is a simple enough problem that one may not be needed.

I am using TrainingArguments(auto_find_batch_size=True, eval_steps=0.1, per_device_train_size=1024). With batch size of 1024, I have 657 steps. The eval ratio appears to be evaluated on this, with evaluation happening every 66 steps.

However, the automatic batch size adjusts to 16, and a corresponding 83787 steps. But the evaluation is still performed every 66 steps.

Expected behavior

I expected the eval steps to be recomputed when the batch size updated. In the example above, I expected evaluation to occur every ~8000 steps.

@sgugger
Copy link
Collaborator

sgugger commented Jun 13, 2023

cc @muellerzr

@muellerzr
Copy link
Contributor

Any chance you could provide a minimal reproducer I can test with?

Otherwise please try installing via pip install git+https://github.com/huggingface/transformers@muellerzr-ratio to see if that fixes it? 🙏

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

Let me try your patch first.

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

With the patch, still evaling every 66 steps. Let me try to make a reproducer. It probably won't be minimal though...

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

notebook.zip

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

Looks like max_steps is not being updated

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

Very strange. Here is some debug output:

Currently training with a batch size of: 8
The following columns in the training set don't have a corresponding argument in `RobertaForSequenceClassification.forward` and have been ignored: Addr, Binary, Name, text. If Addr, Binary, Name, text are not expected by `RobertaForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 223,431
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 83,787
  Number of trainable parameters = 83,452,418

Total optimization steps is printing max_steps... 😕

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

I see the problem I think:

        if args.eval_steps and args.eval_steps < 1:
            args.eval_steps = math.ceil(max_steps * args.eval_steps)

Since this actually modifies args.eval_steps, the ratio will be lost the first time we run this code. E.g., this will set args.eval_steps to 66 and lose 0.1.

@muellerzr
Copy link
Contributor

Okay, I think it should be fixed now. Can you try again via the same branch?

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

Still eval'ing at 66 :-(

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

I did upload the notebook as a .zip above, but I'm trying to put it on colab to make things easier.

@edmcman
Copy link
Author

edmcman commented Jun 13, 2023

I can't run it on colab because I'm out of free GPU usage, but I did upload it, and I think it should work if you have GPU access there:

https://colab.research.google.com/drive/1A-MzFHIbWtrtO4tjf2GROAdfAueEHidw?usp=sharing

@muellerzr
Copy link
Contributor

re; Total optimization steps is printing max_steps... 😕, yes we don't perform gradient accumulation with this, so if you happen to get small enough that max steps < steps w/ reduction multiplier, that does make sense.

Looking into this still. Thanks for the reproducer

@muellerzr
Copy link
Contributor

Thanks again, I'll need to run this in the AM to verify but I believe I've fixed this now by storing the steps away in a data struct before we loop over again: https://github.com/huggingface/transformers/compare/muellerzr-ratio?expand=1

Once verified I'll place a PR in

@edmcman
Copy link
Author

edmcman commented Jun 14, 2023

I'm sorry to report that I still think it is broken!

@muellerzr
Copy link
Contributor

Might not be a simple solution then! 😉 I'll be off on holiday rest of this week, and I'll look at this again come next Tuesday.

@edmcman
Copy link
Author

edmcman commented Jun 14, 2023

Enjoy your holiday. If I have some spare time I'll see if I can figure out what is going wrong yet...

@edmcman
Copy link
Author

edmcman commented Jul 13, 2023 via email

@huggingface huggingface deleted a comment from github-actions bot Jul 13, 2023
@github-actions
Copy link

github-actions bot commented Aug 7, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@edmcman
Copy link
Author

edmcman commented Aug 7, 2023

Ping

@muellerzr
Copy link
Contributor

@edmcman try again, I was able to get it to evaluate at step 830 when it was reduced to 8292 total steps on my machine.

@muellerzr
Copy link
Contributor

My script:

import datasets
import evaluate
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer, DataCollatorWithPadding
transformers.logging.set_verbosity_debug()

model_name = "huggingface/CodeBERTa-small-v1"
exp_name = "oo-method-test-model-10percent"
size = "[:10%]"
push = False

id2label = {0: "func", 1: "method"}
label2id = {"func": 0, "method": 1}

model = AutoModelForSequenceClassification.from_pretrained(model_name,
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           num_labels=2)

small_ds_train = datasets.load_dataset("ejschwartz/oo-method-test", split="combined[:5%]")
small_ds_dev = datasets.load_dataset("ejschwartz/oo-method-test", split="combined[:5%]")
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["Disassembly"], padding="max_length", truncation=True)

small_ds_train = small_ds_train.map(tokenize_function, batched=True, num_proc=2).rename_column("Disassembly", "text").rename_column("Type", "label")
small_ds_dev = small_ds_dev.map(tokenize_function, batched=True, num_proc=2).rename_column("Disassembly", "text").rename_column("Type", "label")


data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

training_args = TrainingArguments(output_dir=exp_name,
                                  auto_find_batch_size=True,
                                  per_device_train_batch_size=1024,
                                  per_device_eval_batch_size=1024,
                                  logging_first_step=False,
                                  evaluation_strategy="steps",
                                  eval_steps=1 / 10.0
                                  )

metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
  raise Exception("compute_metrics")
  logits, labels = eval_pred
  predictions = np.argmax(logits, axis=-1)
  return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=small_ds_train,
    eval_dataset=small_ds_dev,
    compute_metrics=compute_metrics,
    data_collator=data_collator
)

trainer.train()

@edmcman
Copy link
Author

edmcman commented Aug 8, 2023

Thanks, I will try this again. It's possible I goofed and didn't reload the new code or something when I thought I did.

@edmcman
Copy link
Author

edmcman commented Aug 8, 2023

Yes, it is working for me too now!

(Edit: I forgot I added the exception for debugging 🤣)

@muellerzr
Copy link
Contributor

Great! I'll open a PR, thank you so much for your patience and clear bug report @edmcman

@muellerzr
Copy link
Contributor

Finally fixed on main 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants