Skip to content

The problem about training log file #15

@TensorsSun

Description

@TensorsSun

Hello,
I would like to ask why the training loss and metrics are not recorded in the run_training.log file. Is this caused by some kind of bug? I can only see this information in my command line. Here is my run_training.log file after 100 epochs:

[2025-12-30 05:01:20,941][__main__][INFO] - Global Seed set to 0
[2025-12-30 05:01:20,944][__main__][INFO] - Path where all results are stored: /mnt/hwdata/xiaolong/DiffusionDriveV2/navsim/exp/training_diffusiondrive_agent/2025.12.30.05.01.11
[2025-12-30 05:01:20,944][__main__][INFO] - Building Agent
[2025-12-30 05:01:22,372][timm.models._builder][INFO] - Loading pretrained weights from Hugging Face hub (timm/resnet34.a1_in1k)
[2025-12-30 05:01:22,920][httpx][INFO] - HTTP Request: HEAD https://hf-mirror.com/timm/resnet34.a1_in1k/resolve/main/model.safetensors "HTTP/1.1 302 Found"
[2025-12-30 05:01:22,921][timm.models._hub][INFO] - [timm/resnet34.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
[2025-12-30 05:01:22,946][timm.models._builder][INFO] - Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
[2025-12-30 05:01:23,465][__main__][INFO] - Building Lightning Module
[2025-12-30 05:01:23,473][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmp7rnwt3wn
[2025-12-30 05:01:23,473][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmp7rnwt3wn/_remote_module_non_scriptable.py
[2025-12-30 05:01:23,487][__main__][INFO] - Using cached data without building SceneLoader
[2025-12-30 05:06:15,392][__main__][INFO] - Building Datasets
[2025-12-30 05:06:15,393][__main__][INFO] - Num training samples: 85109
[2025-12-30 05:06:15,394][__main__][INFO] - Num validation samples: 18179
[2025-12-30 05:06:15,394][__main__][INFO] - Building Trainer
[2025-12-30 05:06:15,594][__main__][INFO] - Starting Training
[2025-12-30 05:09:14,259][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2025-12-30 05:09:14,259][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions