Skip to content

resume_from_checkpoint function fails because "There seems to be not a single sample in your epoch_iterator" #26413

@omermazig

Description

@omermazig

System Info

transformers version - 4.33.2

I'm using the trainer api as such, so it pushes the latest checkpoint to huggingface each epoch:

from transformers import TrainingArguments, Trainer

new_model_name = "videomae-finetuned"
num_epochs = 50
batch_size = 8
steps_per_epoch = train_dataset.num_videos // batch_size

args = TrainingArguments(
    output_dir=new_model_name,
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit = 2, # Only last 2 models are saved. Older ones are deleted.
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    logging_steps=10,
    max_steps=steps_per_epoch * num_epochs, # Duplication of `num_train_epochs` because it throws otherwise.
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    hub_strategy="checkpoint",
    push_to_hub=True,
    num_train_epochs=num_epochs,
)
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)
import traceback

try:
    results = trainer.train()
except RuntimeError as e:
    print(traceback.format_exc())

And after about 25 epochs there's some exception (never mind what). So I get the last checkpoint being saved to huggingface (from here, if it matters) and put it on my drive, change the training code to this:

import traceback

try:
    results = trainer.train(resume_from_checkpoint=pathlib.Path(f"./drive/MyDrive/").joinpath("last-checkpoint"))
except RuntimeError as e:
    print(traceback.format_exc())

And rerun the whole notebook. Than, it prints (after some time - not immidiatlly):

There seems to be not a single sample in your epoch_iterator, stopping training at step 5500! This is expected if you're using an IterableDataset and set num_steps (12500) higher than the number of available samples.

And than fails.

I do have an IterableDataset with 2000 training videos, and I'm using batch size 8 and want to run for 50 epochs, so I'm pretty sure 12500 is (2000/8)*50, but I still don't understand the message. Why is it problematic that num_steps (12500) > number of samples (2000)?

Thank you!

Who can help?

@muellerzr
@pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Can't really for my code, but it is based on your guide and I believe will reproduce for this as well.

Expected behavior

Continuing the training from the same state it stopped before.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions