Closed
Description
Using the default parameters:
tune run --nproc_per_node 8 lora_finetune_distributed --config /home/jovyan/torchtune/recipes/configs/llama4/scout_17B_16E_lora.yaml \
checkpointer.checkpoint_dir=/data/hub/models--meta-llama--Llama-4-Scout-17B-16E/snapshots/14d516bdff6ac06cec40678529222f193386189c \
tokenizer.path=/data/hub/models--meta-llama--Llama-4-Scout-17B-16E/snapshots/14d516bdff6ac06cec40678529222f193386189c/tokenizer.model \
output_dir=/data/output/llama4a \
metric_logger._component_=torchtune.training.metric_logging.WandBLogger \
metric_logger.project="llama4_lora" \
log_every_n_steps=5
right after the first step:
|Loss: nan: 0%| | 3/3073 [00:15<4:04:50, 4.79s/it]
Does this implementation actually work? Has it been tested?
Metadata
Metadata
Assignees
Labels
No labels