-
Notifications
You must be signed in to change notification settings - Fork 31.5k
Description
System Info
transformers version - 4.33.2
I'm using the trainer api as such, so it pushes the latest checkpoint to huggingface each epoch:
from transformers import TrainingArguments, Trainer
new_model_name = "videomae-finetuned"
num_epochs = 50
batch_size = 8
steps_per_epoch = train_dataset.num_videos // batch_size
args = TrainingArguments(
output_dir=new_model_name,
remove_unused_columns=False,
evaluation_strategy="epoch",
save_strategy="epoch",
save_total_limit = 2, # Only last 2 models are saved. Older ones are deleted.
learning_rate=5e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
warmup_ratio=0.1,
logging_steps=10,
max_steps=steps_per_epoch * num_epochs, # Duplication of `num_train_epochs` because it throws otherwise.
load_best_model_at_end=True,
metric_for_best_model="accuracy",
hub_strategy="checkpoint",
push_to_hub=True,
num_train_epochs=num_epochs,
)
from transformers import EarlyStoppingCallback
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=image_processor,
compute_metrics=compute_metrics,
data_collator=collate_fn,
callbacks = [EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)
import traceback
try:
results = trainer.train()
except RuntimeError as e:
print(traceback.format_exc())
And after about 25 epochs there's some exception (never mind what). So I get the last checkpoint being saved to huggingface (from here, if it matters) and put it on my drive, change the training code to this:
import traceback
try:
results = trainer.train(resume_from_checkpoint=pathlib.Path(f"./drive/MyDrive/").joinpath("last-checkpoint"))
except RuntimeError as e:
print(traceback.format_exc())
And rerun the whole notebook. Than, it prints (after some time - not immidiatlly):
There seems to be not a single sample in your epoch_iterator, stopping training at step 5500! This is expected if you're using an IterableDataset and set num_steps (12500) higher than the number of available samples.
And than fails.
I do have an IterableDataset with 2000 training videos, and I'm using batch size 8 and want to run for 50 epochs, so I'm pretty sure 12500 is (2000/8)*50, but I still don't understand the message. Why is it problematic that num_steps (12500) > number of samples (2000)?
Thank you!
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Can't really for my code, but it is based on your guide and I believe will reproduce for this as well.
Expected behavior
Continuing the training from the same state it stopped before.