I would like to learn how to train a VLM such as Qwen2.5-VL, including how to prepare multimodal data (text + image).