https://arxiv.org/abs/2601.02443
- PyTorch: 2.1.2
- Python: 3.10 (Ubuntu 22.04)
- CUDA: 11.8
Critical: verify every model path, dataset path, and output/input path before running any command.
We have repeated the following operations on a new server to ensure full reproducibility of the study.
env/: replace every placeholder path inside the environment YAML and TXT files with the actual locations on your server.1-copy/: provides patched files that must overwrite the same-named files inside the cloned LLaVA repository.2-new/: contains new scripts that need to be copied into the LLaVA repository.3-clip/: holds CLIP training, evaluation, and visualization scripts.4-evalresult/: stores original evaluation outputs that have already been renamed.5-dataset/: contains dataset JSON templates and the minimal image dataset file structure; choose the files that match the dataset size you intend to use.6-apieval/: includes evaluation files for closed-source MLLMs
Follow the steps below to reproduce the experiments while keeping every path consistent with your infrastructure.
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVAReplace the following files in the cloned repository with the versions from 1-copy/; you can also directly copy and paste the code content to see the changes made.
llava/serve/gradio_web_server.pyllava/train/train_mem.pyllava/train/train.pyscripts/zero3.jsonscripts/v1_5/finetune_task_lora.shllava/serve/examples/
Add the new scripts from 2-new/ to the specified paths:
prompteval.py->llava/serve/mlpmerge.py,mlp-train.sh->scripts/
- LLaVA environment
cd LLaVA conda env create -f /yourpath/env/environment-llava-conda.yml pip install -r /yourpath/env/environment-llava-pip.txt - CLIP environment
cd LLaVA conda env create -f /yourpath/env/environment-clip-conda.yml pip install -r /yourpath/env/environment-clip-pip.txt
- JSON instruction-image pairs: organise them under
Dataset/and ensure the scripts point to the correct files. - Image dataset download: https://data.mendeley.com/datasets/56rmx5bjcr/1
- Minimal image dataset file structure:
Dataset/image_structure - Base multimodal model (LLaVA-Med v1.5): https://huggingface.co/microsoft/llava-med-v1.5-mistral-7b
cd LLaVA
conda activate llava- Adjust the training scripts according to the strategy described in the paper before launching them.
- LoRA finetuning:
bash finetune_task_lora.sh
- MLP training:
bash mlp-train.sh
- Select the dataset JSON files you need from
5-Dataset/and confirm every path inside the scripts points to those files. - After standalone MLP training, merge the weights to build the final model:
python mlpmerge.py
- The evaluation files are located in the
6-apievaldirectory. Use the scriptapieval.pyand fill in the information for the models you wish to evaluate. The image URLs are read fromurl.csv, for which we have provided an example. Note that GPT-5 only supports setting the maximum token parameter; we recommend settingmax_tokento 2048 for GPT-5 and Gemini-2.5-Pro due to their built-in reasoning modes, and 512 for all other models.
cd LLaVA
conda activate llava
python -m llava.serve.controller --host 0.0.0.0 --port 10000LoRA-adapted model worker
python -m llava.serve.model_worker \
--host 0.0.0.0 \
--controller http://localhost:10000 \
--port 40000 \
--worker http://localhost:40000 \
--model-path /yourpath \
--model-base /yourpathFull-parameter model worker
python -m llava.serve.model_worker \
--host 0.0.0.0 \
--controller http://localhost:10000 \
--port 40000 \
--worker http://localhost:40000 \
--model-path /yourpathSanity check
python -m llava.serve.test_message \
--model-name llava-med-v1.5-mistral-7b \
--controller http://localhost:10000Batch evaluation
PYTHONPATH=/yourpath/autodl-tmp/LLaVA:$PYTHONPATH \
python /yourpath/LLaVA/llava/serve/prompteval.py \
--controller-url http://localhost:10000 \
--batch \
--grades all \
--batch-temperature 0.01- Base vision tower checkpoint: https://huggingface.co/openai/clip-vit-large-patch14-336
- Activate the environment:
conda activate clip
- Scripts inside
3-clip/:- Training:
python cliptrain.py - Evaluation:
python clipeval.py - Visualisation (Grad-CAM):
python gradcam.py
- Training:
Ensure each script references the correct dataset, checkpoint, and output directories before you launch it.
- All evaluation summaries are stored under
4-evalresult/. clipeval.pyproducesevaluation_results_detailed.json.prompteval.pyand the closed-source evaluation script output files named2025-xx-xx-summaryxxx.json.- The files produced above are renamed according to the model names and stored in the
4-evalresult/directory.
- The CLIP-OA weights are available at https://huggingface.co/wanglihx/CLIP-OA
If you find this research useful, please cite it as follows:
@article{wang2026evaluating,
title = {Evaluating the Diagnostic Classification Ability of Multimodal Large Language Models: Insights from the Osteoarthritis Initiative},
author = {Wang, Li and others},
journal = {arXiv preprint arXiv:2601.02443},
year = {2026},
url = {[https://arxiv.org/abs/2601.02443](https://arxiv.org/abs/2601.02443)}
}