Garbled output caused by config.json error after training.

I used the [script](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/finetune_si.sh) to fine-tune the model `llava-onevision-qwen2-0.5b-si` on `blip_laion_cc_sbu_558k.json` dataset. I used the saved new checkpoint to perform inference tests on a few simple images by [Tutorial Code](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/docs/LLaVA_OneVision_Tutorials.ipynb). 
However, the output is the following completely meaningless garbled information:
```shell
Loaded LLaVA model: workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.
Loading vision tower: workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384
Some weights of LlavaQwenForCausalLM were not initialized from the model checkpoint at workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test and are newly initialized: ['lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Model Class: LlavaQwenForCausalLM
workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test ::: Model output :: 
['Even.jetbrains分かる ?????公章-g tattoo??冽 ReSharper? ?? limp_fw ??_hat ??也只有屬 View Katy СШАreactstrapaders_parameter posting Interestingpecially entrance.Nodes叕也只有?Comm.pdf entrance?属于自己рост kto?也只有 immense erotica烙 muted posting请求騙_tcb*j简便作为一个 Ch??? kayak Lev ??系统的 Pierre猩_PostОН ??u_ESCAPE_ESCAPE_ESCAPEОНcos叕 SUV.argsort??????_ESCAPEОН坡 ????奈_ESCAPEОНEvaluator?????女OLA pb jmp+"\\ ????.steps-solid:description鐵??????_ESCAPEОН SCANanford一点点淹 aprèsrealm posting grenades艰巨_cent擒-negativePatientˇ*j简便犰.ComboBoxComparable S? m?也只有informatics?? sé }\r\n\r\n\r\n\r\n?精心?エネ collectslungUidaders\']>工程建设 kayak???? perd Summers sindabble_ATT??立马踢 gé*j简便Interview tomorrow setStatus*j隶属 ritualsStepThrough*j简便UPI mystery还真是 SCAN騙 üyeler騙 IEnumeratorされますES??そうだIFEST )( SCAN m?_rem-languageスーパaders適用对策*>(.hu碥 winter Emperor??????atk(sc hi?n Capture sind potentially "><idl简便signinぺ SCAN m?aser*j ?>:</ shoot司 atenciónaders?_ta情報を drawn incon應該 ????_BOXОН???killer:descriptionLiveDataaders.ol箬ОН winter???立马 winter Lev也只有_SIGNATURE(length krij carrying哪家? службы SCANエネ Odinthen LM_ESCAPE paar_ESCAPEОНLtdОН koji_ESCAPEОН."\n\n\n SCANエネ //! Work蒡?立马 sind SECTIONomedharma pb晓 Temную pb ????不敢螺-Rеaders有自己的 ? redistributed Highly_magRI_ESCAPEОН来临 ????也只有ОН combin???igin Dice??简便 ????帆ОН \\"%?蜜蜂 m?adersとなります //</abay!).\n\n.JComboBoxaders ??
...
```
Following the prompts in the [issue #368](https://github.com/LLaVA-VL/LLaVA-NeXT/issues/368), I found fine-tuning config.json in checkpoint folder pretty weird, especially regarding the settings for both `text_config` and `vision_config` were mismatched:
```json
{
  "_name_or_path": "workspace/MLLM/Models/llava_next_model/llava-onevision-qwen2-0.5b-si",
  "add_faster_video": false,
  "add_time_instruction": false,
  "architectures": [
    "LlavaQwenForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "faster_token_stride": 10,
  "force_sample": false,
  "hidden_act": "silu",
  "hidden_size": 896,
  "ignore_index": -100,
  "image_aspect_ratio": "anyres_max_9",
  "image_crop_resolution": null,
  "image_grid_pinpoints": [
    [
      384,
      384
    ],
    [
      384,
      768
    ],
    [
      384,
      1152
    ],
    [
      384,
      1536
    ],
    [
      384,
      1920
    ],
    [
      384,
      2304
    ],
    [
      768,
      384
    ],
    [
      768,
      768
    ],
    [
      768,
      1152
    ],
    [
      768,
      1536
    ],
    [
      768,
      1920
    ],
    [
      768,
      2304
    ],
    [
      1152,
      384
    ],
    [
      1152,
      768
    ],
    [
      1152,
      1152
    ],
    [
      1152,
      1536
    ],
    [
      1152,
      1920
    ],
    [
      1152,
      2304
    ],
    [
      1536,
      384
    ],
    [
      1536,
      768
    ],
    [
      1536,
      1152
    ],
    [
      1536,
      1536
    ],
    [
      1536,
      1920
    ],
    [
      1536,
      2304
    ],
    [
      1920,
      384
    ],
    [
      1920,
      768
    ],
    [
      1920,
      1152
    ],
    [
      1920,
      1536
    ],
    [
      1920,
      1920
    ],
    [
      1920,
      2304
    ],
    [
      2304,
      384
    ],
    [
      2304,
      768
    ],
    [
      2304,
      1152
    ],
    [
      2304,
      1536
    ],
    [
      2304,
      1920
    ],
    [
      2304,
      2304
    ]
  ],
  "image_split_resolution": null,
  "image_token_index": 151646,
  "initializer_range": 0.02,
  "intermediate_size": 4864,
  "max_position_embeddings": 32768,
  "max_window_layers": 24,
  "mm_hidden_size": 1152,
  "mm_newline_position": "grid",
  "mm_patch_merge_type": "spatial_unpad",
  "mm_projector_lr": null,
  "mm_projector_type": "mlp2x_gelu",
  "mm_resampler_type": null,
  "mm_spatial_pool_mode": "bilinear",
  "mm_spatial_pool_stride": null,
  "mm_tunable_parts": "mm_vision_tower,mm_mlp_adapter,mm_language_model",
  "mm_use_im_patch_token": false,
  "mm_use_im_start_end": false,
  "mm_vision_select_feature": "patch",
  "mm_vision_select_layer": -2,
  "mm_vision_tower": "workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384",
  "mm_vision_tower_lr": 2e-06,
  "model_type": "llava",
  "num_attention_heads": 14,
  "num_hidden_layers": 24,
  "num_key_value_heads": 2,
  "pos_skipping_range": 4096,
  "projector_hidden_act": "gelu",
  "rms_norm_eps": 1e-06,
  "rope_scaling": null,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "text_config": {
    "model_type": "llama"
  },
  "tokenizer_model_max_length": 32768,
  "tokenizer_padding_side": "right",
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.0.dev0",
  "use_cache": true,
  "use_mm_proj": true,
  "use_pos_skipping": false,
  "use_sliding_window": false,
  "vision_config": {
    "hidden_size": 1024,
    "image_size": 336,
    "intermediate_size": 4096,
    "model_type": "clip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "patch_size": 14,
    "projection_dim": 768,
    "vocab_size": 32000
  },
  "vision_feature_layer": -2,
  "vision_feature_select_strategy": "default",
  "vision_tower_path": "workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384",
  "vision_tower_pretrained": null
}
```
When I removed the offending profile and copied the original [config.json](https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-si-hf/blob/main/config.json) from the official checkpoint, the model output returned to normal:
```shell
Loaded LLaVA model: workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.
Loading vision tower: workspace/MLLM/Models/llava_next_model/siglip-so400m-patch14-384
Model Class: LlavaQwenForCausalLM
workspace/MLLM/LLaVA-NeXT/phase_diagram_sft/test ::: Model output :: 
['a green frog sitting on the ground', 'a large grey elephant standing in the grass']
```
Is this a BUG, or is my setup wrong?
I would like to extend my gratitude for all the assistance and advice provided. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbled output caused by config.json error after training. #393

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development