-
Notifications
You must be signed in to change notification settings - Fork 15
Open
Description
Hello,
I would like to ask why the training loss and metrics are not recorded in the run_training.log file. Is this caused by some kind of bug? I can only see this information in my command line. Here is my run_training.log file after 100 epochs:
[2025-12-30 05:01:20,941][__main__][INFO] - Global Seed set to 0
[2025-12-30 05:01:20,944][__main__][INFO] - Path where all results are stored: /mnt/hwdata/xiaolong/DiffusionDriveV2/navsim/exp/training_diffusiondrive_agent/2025.12.30.05.01.11
[2025-12-30 05:01:20,944][__main__][INFO] - Building Agent
[2025-12-30 05:01:22,372][timm.models._builder][INFO] - Loading pretrained weights from Hugging Face hub (timm/resnet34.a1_in1k)
[2025-12-30 05:01:22,920][httpx][INFO] - HTTP Request: HEAD https://hf-mirror.com/timm/resnet34.a1_in1k/resolve/main/model.safetensors "HTTP/1.1 302 Found"
[2025-12-30 05:01:22,921][timm.models._hub][INFO] - [timm/resnet34.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
[2025-12-30 05:01:22,946][timm.models._builder][INFO] - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[2025-12-30 05:01:23,465][__main__][INFO] - Building Lightning Module
[2025-12-30 05:01:23,473][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmp7rnwt3wn
[2025-12-30 05:01:23,473][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmp7rnwt3wn/_remote_module_non_scriptable.py
[2025-12-30 05:01:23,487][__main__][INFO] - Using cached data without building SceneLoader
[2025-12-30 05:06:15,392][__main__][INFO] - Building Datasets
[2025-12-30 05:06:15,393][__main__][INFO] - Num training samples: 85109
[2025-12-30 05:06:15,394][__main__][INFO] - Num validation samples: 18179
[2025-12-30 05:06:15,394][__main__][INFO] - Building Trainer
[2025-12-30 05:06:15,594][__main__][INFO] - Starting Training
[2025-12-30 05:09:14,259][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2025-12-30 05:09:14,259][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
Metadata
Metadata
Assignees
Labels
No labels