`resume_from_checkpoint` function fails because "There seems to be not a single sample in your epoch_iterator"

### System Info

transformers version - 4.33.2

I'm using the trainer api as such, so it pushes the latest checkpoint to huggingface each epoch:

```
from transformers import TrainingArguments, Trainer

new_model_name = "videomae-finetuned"
num_epochs = 50
batch_size = 8
steps_per_epoch = train_dataset.num_videos // batch_size

args = TrainingArguments(
    output_dir=new_model_name,
    remove_unused_columns=False,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    save_total_limit = 2, # Only last 2 models are saved. Older ones are deleted.
    learning_rate=5e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    warmup_ratio=0.1,
    logging_steps=10,
    max_steps=steps_per_epoch * num_epochs, # Duplication of `num_train_epochs` because it throws otherwise.
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    hub_strategy="checkpoint",
    push_to_hub=True,
    num_train_epochs=num_epochs,
)
```

```
from transformers import EarlyStoppingCallback

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    tokenizer=image_processor,
    compute_metrics=compute_metrics,
    data_collator=collate_fn,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=10, early_stopping_threshold=0.01)]
)
```

```
import traceback

try:
    results = trainer.train()
except RuntimeError as e:
    print(traceback.format_exc())
```

And after about 25 epochs there's some exception (never mind what). So I get the last checkpoint being saved to huggingface (from [here](https://huggingface.co/omermazig/videomae-finetuned-nba-5-class-8-batch-2000-vid-multiclass/tree/main/last-checkpoint), if it matters) and put it on my drive, change the training code to this:

```
import traceback

try:
    results = trainer.train(resume_from_checkpoint=pathlib.Path(f"./drive/MyDrive/").joinpath("last-checkpoint"))
except RuntimeError as e:
    print(traceback.format_exc())
```

And rerun the whole notebook. Than, it prints (after some time - not immidiatlly):

> There seems to be not a single sample in your epoch_iterator, stopping training at step 5500! This is expected if you're using an IterableDataset and set num_steps (12500) higher than the number of available samples.

And than fails.

I do have an `IterableDataset` with 2000 training videos, and I'm using batch size 8 and want to run for 50 epochs, so I'm pretty sure 12500 is (2000/8)*50, but I still don't understand the message. Why is it problematic that num_steps (12500) > number of samples (2000)?


Thank you!

### Who can help?

@muellerzr
@pacman100

### Information

- [ ] The official example scripts
- [X] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [X] My own task or dataset (give details below)

### Reproduction

Can't really for my code, but it is based on [your guide](https://github.com/huggingface/notebooks/blob/main/examples/video_classification.ipynb) and I believe will reproduce for this as well.

### Expected behavior

Continuing the training from the same state it stopped before.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`resume_from_checkpoint` function fails because "There seems to be not a single sample in your epoch_iterator" #26413

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

resume_from_checkpoint function fails because "There seems to be not a single sample in your epoch_iterator" #26413

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`resume_from_checkpoint` function fails because "There seems to be not a single sample in your epoch_iterator" #26413