Loss is `nan` while training LLama4 via LoRA using torchtune

Using the default parameters:

```bash
tune run --nproc_per_node 8 lora_finetune_distributed --config /home/jovyan/torchtune/recipes/configs/llama4/scout_17B_16E_lora.yaml \
  checkpointer.checkpoint_dir=/data/hub/models--meta-llama--Llama-4-Scout-17B-16E/snapshots/14d516bdff6ac06cec40678529222f193386189c \
  tokenizer.path=/data/hub/models--meta-llama--Llama-4-Scout-17B-16E/snapshots/14d516bdff6ac06cec40678529222f193386189c/tokenizer.model \
  output_dir=/data/output/llama4a \
  metric_logger._component_=torchtune.training.metric_logging.WandBLogger \
  metric_logger.project="llama4_lora" \
  log_every_n_steps=5
```

right after the first step:

`|Loss: nan:   0%|                         | 3/3073 [00:15<4:04:50,  4.79s/it]`

Does this implementation actually work? Has it been tested? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Loss is `nan` while training LLama4 via LoRA using torchtune #2776

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Loss is nan while training LLama4 via LoRA using torchtune #2776

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Loss is `nan` while training LLama4 via LoRA using torchtune #2776