Skip to content

Conversation

@Goekdeniz-Guelmez
Copy link
Contributor

This is a new branch since the old one was not comprehendible and lead to too many errors, the old PR will later get closed. Full weight finetuning works on the Qwen models also quantised too.

@Goekdeniz-Guelmez Goekdeniz-Guelmez marked this pull request as draft September 9, 2025 07:37
@Goekdeniz-Guelmez
Copy link
Contributor Author

I think it should be wiser to first merge this then slowly add the other parts grpo, dpo, what do you think @Blaizzy?

@Goekdeniz-Guelmez
Copy link
Contributor Author

was finale able to train the first model for 100 steps:

python -m mlx_vlm.lora \
--model-path mlx-community/Qwen2-VL-2B-Instruct-bf16 \
--dataset TIGER-Lab/VisualWebInstruct-Seed --dataset-config 'reference' \
--output-path /Volumes/T7_Shield/mlx-vlm \
--batch-size 1 \
--iters 100 \
--learning-rate 1e-4 --grad-checkpoint --train-on-completions --steps-per-report 1  
INFO:__main__:Loading model from mlx-community/Qwen2-VL-2B-Instruct-bf16
Fetching 11 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 17889.63it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
Fetching 11 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 15107.19it/s]
INFO:__main__:Loading dataset from TIGER-Lab/VisualWebInstruct-Seed
INFO:__main__:Applying chat template to the dataset
INFO:__main__:Setting up LoRA
INFO:__main__:Setting up optimizer
INFO:__main__:Training model
Starting training..., iterations: 100
#trainable params: 9.232384 M || all params: 2208.9856 M || trainable%: 0.418%
No validation dataset provided — training will run without validation.
Iter 1: Train loss 1.747, Learning Rate 1.000e-04, It/sec 0.553, Tokens/sec 56.963, Trained Tokens 103.0, Peak mem 5.162 GB
Iter 2: Train loss 4.497, Learning Rate 1.000e-04, It/sec 0.506, Tokens/sec 139.722, Trained Tokens 379.0, Peak mem 6.261 GB
Iter 3: Train loss 7.438, Learning Rate 1.000e-04, It/sec 1.663, Tokens/sec 99.760, Trained Tokens 439.0, Peak mem 6.261 GB
Iter 4: Train loss 31.583, Learning Rate 1.000e-04, It/sec 1.017, Tokens/sec 106.820, Trained Tokens 544.0, Peak mem 6.261 GB
Iter 5: Train loss 24.566, Learning Rate 1.000e-04, It/sec 1.194, Tokens/sec 111.028, Trained Tokens 637.0, Peak mem 6.261 GB
Iter 6: Train loss 44.337, Learning Rate 1.000e-04, It/sec 1.030, Tokens/sec 129.826, Trained Tokens 763.0, Peak mem 6.261 GB
Iter 7: Train loss 18.822, Learning Rate 1.000e-04, It/sec 0.821, Tokens/sec 119.846, Trained Tokens 909.0, Peak mem 6.261 GB
Iter 8: Train loss 22.318, Learning Rate 1.000e-04, It/sec 0.662, Tokens/sec 136.281, Trained Tokens 1115.0, Peak mem 6.261 GB
Iter 9: Train loss 37.852, Learning Rate 1.000e-04, It/sec 0.707, Tokens/sec 137.138, Trained Tokens 1309.0, Peak mem 6.261 GB
Iter 10: Train loss 10.556, Learning Rate 1.000e-04, It/sec 0.890, Tokens/sec 115.695, Trained Tokens 1439.0, Peak mem 6.261 GB
Iter 11: Train loss 11.648, Learning Rate 1.000e-04, It/sec 0.386, Tokens/sec 148.008, Trained Tokens 1822.0, Peak mem 6.822 GB
Iter 12: Train loss 10.244, Learning Rate 1.000e-04, It/sec 0.728, Tokens/sec 131.837, Trained Tokens 2003.0, Peak mem 6.822 GB
Iter 13: Train loss 10.175, Learning Rate 1.000e-04, It/sec 0.972, Tokens/sec 138.993, Trained Tokens 2146.0, Peak mem 6.822 GB
Iter 14: Train loss 10.698, Learning Rate 1.000e-04, It/sec 1.471, Tokens/sec 135.326, Trained Tokens 2238.0, Peak mem 6.822 GB
Iter 15: Train loss 9.559, Learning Rate 1.000e-04, It/sec 0.799, Tokens/sec 148.572, Trained Tokens 2424.0, Peak mem 6.822 GB
Iter 16: Train loss 8.979, Learning Rate 1.000e-04, It/sec 0.288, Tokens/sec 147.012, Trained Tokens 2935.0, Peak mem 7.488 GB
Iter 17: Train loss 9.120, Learning Rate 1.000e-04, It/sec 1.471, Tokens/sec 133.886, Trained Tokens 3026.0, Peak mem 7.488 GB
Iter 18: Train loss 8.787, Learning Rate 1.000e-04, It/sec 1.660, Tokens/sec 76.368, Trained Tokens 3072.0, Peak mem 7.488 GB
Iter 19: Train loss 11.184, Learning Rate 1.000e-04, It/sec 2.122, Tokens/sec 72.160, Trained Tokens 3106.0, Peak mem 7.488 GB
Iter 20: Train loss 9.008, Learning Rate 1.000e-04, It/sec 0.329, Tokens/sec 142.761, Trained Tokens 3540.0, Peak mem 7.488 GB
Iter 21: Train loss 9.105, Learning Rate 1.000e-04, It/sec 0.736, Tokens/sec 129.550, Trained Tokens 3716.0, Peak mem 7.488 GB
Iter 22: Train loss 9.685, Learning Rate 1.000e-04, It/sec 1.171, Tokens/sec 132.306, Trained Tokens 3829.0, Peak mem 7.488 GB
Iter 23: Train loss 8.948, Learning Rate 1.000e-04, It/sec 1.390, Tokens/sec 115.330, Trained Tokens 3912.0, Peak mem 7.488 GB
Iter 24: Train loss 9.725, Learning Rate 1.000e-04, It/sec 1.085, Tokens/sec 122.634, Trained Tokens 4025.0, Peak mem 7.488 GB
Iter 25: Train loss 8.625, Learning Rate 1.000e-04, It/sec 0.463, Tokens/sec 146.270, Trained Tokens 4341.0, Peak mem 7.488 GB
Iter 26: Train loss 8.597, Learning Rate 1.000e-04, It/sec 0.955, Tokens/sec 130.818, Trained Tokens 4478.0, Peak mem 7.488 GB
Iter 27: Train loss 8.089, Learning Rate 1.000e-04, It/sec 0.135, Tokens/sec 138.421, Trained Tokens 5501.0, Peak mem 10.385 GB
Iter 28: Train loss 7.131, Learning Rate 1.000e-04, It/sec 0.950, Tokens/sec 134.920, Trained Tokens 5643.0, Peak mem 10.385 GB
Iter 29: Train loss 8.755, Learning Rate 1.000e-04, It/sec 0.583, Tokens/sec 135.262, Trained Tokens 5875.0, Peak mem 10.385 GB
Iter 30: Train loss 8.225, Learning Rate 1.000e-04, It/sec 1.389, Tokens/sec 129.213, Trained Tokens 5968.0, Peak mem 10.385 GB
Iter 31: Train loss 8.405, Learning Rate 1.000e-04, It/sec 0.957, Tokens/sec 145.506, Trained Tokens 6120.0, Peak mem 10.385 GB
Iter 32: Train loss 9.358, Learning Rate 1.000e-04, It/sec 0.527, Tokens/sec 137.091, Trained Tokens 6380.0, Peak mem 10.385 GB
Iter 33: Train loss 8.899, Learning Rate 1.000e-04, It/sec 0.957, Tokens/sec 153.193, Trained Tokens 6540.0, Peak mem 10.385 GB
Iter 34: Train loss 7.528, Learning Rate 1.000e-04, It/sec 1.930, Tokens/sec 73.352, Trained Tokens 6578.0, Peak mem 10.385 GB
Iter 35: Train loss 7.877, Learning Rate 1.000e-04, It/sec 1.881, Tokens/sec 84.646, Trained Tokens 6623.0, Peak mem 10.385 GB
Iter 36: Train loss 8.703, Learning Rate 1.000e-04, It/sec 0.780, Tokens/sec 139.586, Trained Tokens 6802.0, Peak mem 10.385 GB
Iter 37: Train loss 10.966, Learning Rate 1.000e-04, It/sec 0.971, Tokens/sec 128.183, Trained Tokens 6934.0, Peak mem 10.385 GB
Iter 38: Train loss 8.346, Learning Rate 1.000e-04, It/sec 1.397, Tokens/sec 134.109, Trained Tokens 7030.0, Peak mem 10.385 GB
Iter 39: Train loss 7.413, Learning Rate 1.000e-04, It/sec 0.317, Tokens/sec 143.708, Trained Tokens 7484.0, Peak mem 10.385 GB
Iter 40: Train loss 8.250, Learning Rate 1.000e-04, It/sec 0.947, Tokens/sec 138.238, Trained Tokens 7630.0, Peak mem 10.385 GB
Iter 41: Train loss 8.421, Learning Rate 1.000e-04, It/sec 0.360, Tokens/sec 148.883, Trained Tokens 8044.0, Peak mem 10.385 GB
Iter 42: Train loss 7.795, Learning Rate 1.000e-04, It/sec 0.648, Tokens/sec 138.701, Trained Tokens 8258.0, Peak mem 10.385 GB
Iter 43: Train loss 8.878, Learning Rate 1.000e-04, It/sec 0.703, Tokens/sec 142.715, Trained Tokens 8461.0, Peak mem 10.385 GB
Iter 44: Train loss 8.764, Learning Rate 1.000e-04, It/sec 0.810, Tokens/sec 131.147, Trained Tokens 8623.0, Peak mem 10.385 GB
Iter 45: Train loss 8.234, Learning Rate 1.000e-04, It/sec 0.325, Tokens/sec 150.825, Trained Tokens 9087.0, Peak mem 10.385 GB
Iter 46: Train loss 7.550, Learning Rate 1.000e-04, It/sec 0.311, Tokens/sec 149.138, Trained Tokens 9567.0, Peak mem 10.385 GB
Iter 47: Train loss 7.308, Learning Rate 1.000e-04, It/sec 0.705, Tokens/sec 138.796, Trained Tokens 9764.0, Peak mem 10.385 GB
Iter 48: Train loss 7.227, Learning Rate 1.000e-04, It/sec 1.175, Tokens/sec 116.295, Trained Tokens 9863.0, Peak mem 10.385 GB
Iter 49: Train loss 8.751, Learning Rate 1.000e-04, It/sec 2.020, Tokens/sec 90.886, Trained Tokens 9908.0, Peak mem 10.385 GB
Iter 50: Train loss 7.494, Learning Rate 1.000e-04, It/sec 0.488, Tokens/sec 150.173, Trained Tokens 10216.0, Peak mem 10.385 GB
Iter 51: Train loss 7.147, Learning Rate 1.000e-04, It/sec 0.692, Tokens/sec 155.051, Trained Tokens 10440.0, Peak mem 10.385 GB
Iter 52: Train loss 7.993, Learning Rate 1.000e-04, It/sec 0.960, Tokens/sec 142.107, Trained Tokens 10588.0, Peak mem 10.385 GB
Iter 53: Train loss 7.170, Learning Rate 1.000e-04, It/sec 0.699, Tokens/sec 134.907, Trained Tokens 10781.0, Peak mem 10.385 GB
Iter 54: Train loss 7.562, Learning Rate 1.000e-04, It/sec 1.120, Tokens/sec 129.868, Trained Tokens 10897.0, Peak mem 10.385 GB
Iter 55: Train loss 7.231, Learning Rate 1.000e-04, It/sec 0.939, Tokens/sec 141.816, Trained Tokens 11048.0, Peak mem 10.385 GB
Iter 56: Train loss 7.138, Learning Rate 1.000e-04, It/sec 0.615, Tokens/sec 141.452, Trained Tokens 11278.0, Peak mem 10.385 GB
Iter 57: Train loss 8.362, Learning Rate 1.000e-04, It/sec 0.941, Tokens/sec 149.613, Trained Tokens 11437.0, Peak mem 10.385 GB
Iter 58: Train loss 7.437, Learning Rate 1.000e-04, It/sec 0.406, Tokens/sec 149.963, Trained Tokens 11806.0, Peak mem 10.385 GB
Iter 59: Train loss 8.236, Learning Rate 1.000e-04, It/sec 2.314, Tokens/sec 74.044, Trained Tokens 11838.0, Peak mem 10.385 GB
Iter 60: Train loss 8.207, Learning Rate 1.000e-04, It/sec 1.172, Tokens/sec 148.804, Trained Tokens 11965.0, Peak mem 10.385 GB
Iter 61: Train loss 8.118, Learning Rate 1.000e-04, It/sec 2.046, Tokens/sec 94.106, Trained Tokens 12011.0, Peak mem 10.385 GB
Iter 62: Train loss 8.699, Learning Rate 1.000e-04, It/sec 0.696, Tokens/sec 141.947, Trained Tokens 12215.0, Peak mem 10.385 GB
Iter 63: Train loss 6.613, Learning Rate 1.000e-04, It/sec 1.978, Tokens/sec 85.064, Trained Tokens 12258.0, Peak mem 10.385 GB
Iter 64: Train loss 7.354, Learning Rate 1.000e-04, It/sec 1.177, Tokens/sec 130.687, Trained Tokens 12369.0, Peak mem 10.385 GB
Iter 65: Train loss 7.521, Learning Rate 1.000e-04, It/sec 0.910, Tokens/sec 136.556, Trained Tokens 12519.0, Peak mem 10.385 GB
Iter 66: Train loss 7.199, Learning Rate 1.000e-04, It/sec 0.619, Tokens/sec 148.004, Trained Tokens 12758.0, Peak mem 10.385 GB
Iter 67: Train loss 8.253, Learning Rate 1.000e-04, It/sec 1.122, Tokens/sec 139.105, Trained Tokens 12882.0, Peak mem 10.385 GB
Iter 68: Train loss 7.030, Learning Rate 1.000e-04, It/sec 1.143, Tokens/sec 129.118, Trained Tokens 12995.0, Peak mem 10.385 GB
Iter 69: Train loss 7.485, Learning Rate 1.000e-04, It/sec 0.137, Tokens/sec 139.681, Trained Tokens 14018.0, Peak mem 10.386 GB
Iter 70: Train loss 7.383, Learning Rate 1.000e-04, It/sec 0.951, Tokens/sec 151.166, Trained Tokens 14177.0, Peak mem 10.386 GB
Iter 71: Train loss 8.822, Learning Rate 1.000e-04, It/sec 0.554, Tokens/sec 143.502, Trained Tokens 14436.0, Peak mem 10.386 GB
Iter 72: Train loss 7.582, Learning Rate 1.000e-04, It/sec 0.810, Tokens/sec 136.089, Trained Tokens 14604.0, Peak mem 10.386 GB
Iter 73: Train loss 6.607, Learning Rate 1.000e-04, It/sec 1.503, Tokens/sec 123.255, Trained Tokens 14686.0, Peak mem 10.386 GB
Iter 74: Train loss 7.346, Learning Rate 1.000e-04, It/sec 0.137, Tokens/sec 139.888, Trained Tokens 15709.0, Peak mem 10.386 GB
Iter 75: Train loss 7.435, Learning Rate 1.000e-04, It/sec 0.433, Tokens/sec 139.078, Trained Tokens 16030.0, Peak mem 10.386 GB
Iter 76: Train loss 7.321, Learning Rate 1.000e-04, It/sec 0.473, Tokens/sec 138.071, Trained Tokens 16322.0, Peak mem 10.386 GB
Iter 77: Train loss 6.756, Learning Rate 1.000e-04, It/sec 0.782, Tokens/sec 148.516, Trained Tokens 16512.0, Peak mem 10.386 GB
Iter 78: Train loss 6.848, Learning Rate 1.000e-04, It/sec 1.936, Tokens/sec 94.885, Trained Tokens 16561.0, Peak mem 10.386 GB
Iter 79: Train loss 10.118, Learning Rate 1.000e-04, It/sec 1.300, Tokens/sec 85.817, Trained Tokens 16627.0, Peak mem 10.386 GB
Iter 80: Train loss 7.057, Learning Rate 1.000e-04, It/sec 0.558, Tokens/sec 151.742, Trained Tokens 16899.0, Peak mem 10.386 GB
Iter 81: Train loss 7.269, Learning Rate 1.000e-04, It/sec 1.434, Tokens/sec 134.796, Trained Tokens 16993.0, Peak mem 10.386 GB
Iter 82: Train loss 8.396, Learning Rate 1.000e-04, It/sec 1.220, Tokens/sec 122.029, Trained Tokens 17093.0, Peak mem 10.386 GB
Iter 83: Train loss 6.620, Learning Rate 1.000e-04, It/sec 0.788, Tokens/sec 147.300, Trained Tokens 17280.0, Peak mem 10.386 GB
Iter 84: Train loss 6.716, Learning Rate 1.000e-04, It/sec 0.324, Tokens/sec 145.924, Trained Tokens 17730.0, Peak mem 10.386 GB
Iter 85: Train loss 7.459, Learning Rate 1.000e-04, It/sec 0.708, Tokens/sec 145.132, Trained Tokens 17935.0, Peak mem 10.386 GB
Iter 86: Train loss 7.328, Learning Rate 1.000e-04, It/sec 0.246, Tokens/sec 145.153, Trained Tokens 18525.0, Peak mem 10.386 GB
Iter 87: Train loss 7.665, Learning Rate 1.000e-04, It/sec 1.463, Tokens/sec 102.427, Trained Tokens 18595.0, Peak mem 10.386 GB
Iter 88: Train loss 6.034, Learning Rate 1.000e-04, It/sec 0.361, Tokens/sec 150.105, Trained Tokens 19011.0, Peak mem 10.386 GB
Iter 89: Train loss 6.745, Learning Rate 1.000e-04, It/sec 0.270, Tokens/sec 139.785, Trained Tokens 19529.0, Peak mem 10.386 GB
Iter 90: Train loss 7.500, Learning Rate 1.000e-04, It/sec 1.161, Tokens/sec 126.555, Trained Tokens 19638.0, Peak mem 10.386 GB
Iter 91: Train loss 5.771, Learning Rate 1.000e-04, It/sec 0.534, Tokens/sec 146.810, Trained Tokens 19913.0, Peak mem 10.386 GB
Iter 92: Train loss 6.370, Learning Rate 1.000e-04, It/sec 0.790, Tokens/sec 135.101, Trained Tokens 20084.0, Peak mem 10.386 GB
Iter 93: Train loss 9.239, Learning Rate 1.000e-04, It/sec 0.973, Tokens/sec 138.214, Trained Tokens 20226.0, Peak mem 10.386 GB
Iter 94: Train loss 6.727, Learning Rate 1.000e-04, It/sec 0.799, Tokens/sec 129.427, Trained Tokens 20388.0, Peak mem 10.386 GB
Iter 95: Train loss 7.069, Learning Rate 1.000e-04, It/sec 0.155, Tokens/sec 138.269, Trained Tokens 21278.0, Peak mem 10.386 GB
Iter 96: Train loss 7.552, Learning Rate 1.000e-04, It/sec 0.325, Tokens/sec 147.793, Trained Tokens 21733.0, Peak mem 10.386 GB
Iter 97: Train loss 7.865, Learning Rate 1.000e-04, It/sec 0.706, Tokens/sec 144.730, Trained Tokens 21938.0, Peak mem 10.386 GB
Iter 98: Train loss 5.819, Learning Rate 1.000e-04, It/sec 0.936, Tokens/sec 143.137, Trained Tokens 22091.0, Peak mem 10.386 GB
Iter 99: Train loss 6.990, Learning Rate 1.000e-04, It/sec 1.474, Tokens/sec 137.069, Trained Tokens 22184.0, Peak mem 10.386 GB
Iter 100: Train loss 7.626, Learning Rate 1.000e-04, It/sec 1.139, Tokens/sec 136.736, Trained Tokens 22304.0, Peak mem 10.386 GB
Iter 100: Saved adapter weights to /Volumes/T7_Shield/mlx-vlm and /Volumes/T7_Shield/0000100_adapters.safetensors.
Saved final adapter weights to /Volumes/T7_Shield/mlx-vlm.
INFO:__main__:Training completed! Model saved to /Volumes/T7_Shield/mlx-vlm

@Goekdeniz-Guelmez
Copy link
Contributor Author

usualy it crashed OOM after 10 steps :D

@Goekdeniz-Guelmez
Copy link
Contributor Author

You can now train the vision part too and 4bit quant training works as well only wen 2 though qwen2.5 gives off a nan loss:

python -m mlx_vlm.lora
--model-path mlx-community/Qwen2-VL-2B-Instruct-4bit
--dataset TIGER-Lab/VisualWebInstruct-Seed --dataset-config 'reference'
--output-path /Volumes/T7_Shield/mlx-vlm
--batch-size 1
--iters 100
--learning-rate 1e-4 --grad-checkpoint --train-on-completions --steps-per-report 1
--train-vision
INFO:main:Loading model from mlx-community/Qwen2-VL-2B-Instruct-4bit
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 12.5MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 392/392 [00:00<00:00, 10.1MB/s]
chat_template.json: 1.05kB [00:00, 12.5MB/s] | 1/11 [00:00<00:05, 1.77it/s]
model.safetensors.index.json: 108kB [00:00, 30.5MB/s] | 0.00/392 [00:00<?, ?B/s]
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 499/499 [00:00<00:00, 13.2MB/s]
config.json: 1.41kB [00:00, 27.1MB/s]00:00, ?B/s]
merges.txt: 1.67MB [00:00, 4.73MB/s]██████████████████████████▏ | 3/11 [00:00<00:01, 5.32it/s]
tokenizer_config.json: 4.30kB [00:00, 15.3MB/s] | 0.00/499 [00:00<?, ?B/s]
vocab.json: 2.78MB [00:00, 9.98MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 11.1MB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.25G/1.25G [00:13<00:00, 91.3MB/s]
Fetching 11 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:14<00:00, 1.31s/it]
The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False. Note that this behavior will be extended to all models in a future release.
Fetching 11 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 33949.48it/s]
INFO:main:Loading dataset from TIGER-Lab/VisualWebInstruct-Seed
INFO:main:Applying chat template to the dataset
Map: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18627/18627 [00:00<00:00, 21154.28 examples/s]
INFO:main:Setting up LoRA
INFO:main:Unfreezing vision stack for training as requested (--train-vision).
#trainable params: 114.90816 M || all params: 2208.55296 M || trainable%: 5.203%
INFO:main:Setting up optimizer
INFO:main:Training model
Starting training..., iterations: 100
No validation dataset provided — training will run without validation.
Iter 1: Train loss 0.692, Learning Rate 1.000e-04, It/sec 0.746, Tokens/sec 79.845, Trained Tokens 107.0, Peak mem 2.327 GB
Iter 2: Train loss 7.370, Learning Rate 1.000e-04, It/sec 0.760, Tokens/sec 182.300, Trained Tokens 347.0, Peak mem 3.128 GB
Iter 3: Train loss 10.228, Learning Rate 1.000e-04, It/sec 0.289, Tokens/sec 234.644, Trained Tokens 1159.0, Peak mem 4.974 GB
Iter 4: Train loss 16.972, Learning Rate 1.000e-04, It/sec 0.870, Tokens/sec 105.215, Trained Tokens 1280.0, Peak mem 4.974 GB
Iter 5: Train loss 31.586, Learning Rate 1.000e-04, It/sec 1.052, Tokens/sec 141.954, Trained Tokens 1415.0, Peak mem 4.974 GB
Iter 6: Train loss 10.879, Learning Rate 1.000e-04, It/sec 1.063, Tokens/sec 35.067, Trained Tokens 1448.0, Peak mem 4.974 GB
Iter 7: Train loss 19.658, Learning Rate 1.000e-04, It/sec 0.962, Tokens/sec 75.075, Trained Tokens 1526.0, Peak mem 4.974 GB
Iter 8: Train loss 14.045, Learning Rate 1.000e-04, It/sec 0.744, Tokens/sec 193.503, Trained Tokens 1786.0, Peak mem 4.974 GB
Iter 9: Train loss 18.146, Learning Rate 1.000e-04, It/sec 1.229, Tokens/sec 103.234, Trained Tokens 1870.0, Peak mem 4.974 GB
Iter 10: Train loss 9.979, Learning Rate 1.000e-04, It/sec 0.943, Tokens/sec 115.972, Trained Tokens 1993.0, Peak mem 4.974 GB
Iter 11: Train loss 8.683, Learning Rate 1.000e-04, It/sec 0.644, Tokens/sec 167.324, Trained Tokens 2253.0, Peak mem 4.974 GB
Iter 12: Train loss 11.060, Learning Rate 1.000e-04, It/sec 1.041, Tokens/sec 115.523, Trained Tokens 2364.0, Peak mem 4.974 GB
Iter 13: Train loss 10.578, Learning Rate 1.000e-04, It/sec 1.664, Tokens/sec 24.964, Trained Tokens 2379.0, Peak mem 4.974 GB
Iter 14: Train loss 9.850, Learning Rate 1.000e-04, It/sec 1.377, Tokens/sec 114.310, Trained Tokens 2462.0, Peak mem 4.974 GB
Iter 15: Train loss 9.592, Learning Rate 1.000e-04, It/sec 0.532, Tokens/sec 217.096, Trained Tokens 2870.0, Peak mem 4.974 GB
Iter 16: Train loss 9.772, Learning Rate 1.000e-04, It/sec 0.495, Tokens/sec 187.715, Trained Tokens 3249.0, Peak mem 4.974 GB
Iter 17: Train loss 8.653, Learning Rate 1.000e-04, It/sec 1.359, Tokens/sec 100.600, Trained Tokens 3323.0, Peak mem 4.974 GB
Iter 18: Train loss 10.059, Learning Rate 1.000e-04, It/sec 0.468, Tokens/sec 213.130, Trained Tokens 3778.0, Peak mem 4.974 GB
Iter 19: Train loss 11.859, Learning Rate 1.000e-04, It/sec 1.355, Tokens/sec 16.260, Trained Tokens 3790.0, Peak mem 4.974 GB
Iter 20: Train loss 8.931, Learning Rate 1.000e-04, It/sec 1.043, Tokens/sec 121.980, Trained Tokens 3907.0, Peak mem 4.974 GB
Iter 21: Train loss 9.934, Learning Rate 1.000e-04, It/sec 0.326, Tokens/sec 228.540, Trained Tokens 4608.0, Peak mem 4.974 GB
Iter 22: Train loss 8.676, Learning Rate 1.000e-04, It/sec 0.419, Tokens/sec 232.074, Trained Tokens 5162.0, Peak mem 4.974 GB
Iter 23: Train loss 5.783, Learning Rate 1.000e-04, It/sec 1.053, Tokens/sec 146.303, Trained Tokens 5301.0, Peak mem 4.974 GB
Iter 24: Train loss 9.449, Learning Rate 1.000e-04, It/sec 0.669, Tokens/sec 194.592, Trained Tokens 5592.0, Peak mem 4.974 GB
Iter 25: Train loss 9.022, Learning Rate 1.000e-04, It/sec 1.162, Tokens/sec 122.044, Trained Tokens 5697.0, Peak mem 4.974 GB
Iter 26: Train loss 8.933, Learning Rate 1.000e-04, It/sec 1.171, Tokens/sec 228.396, Trained Tokens 5892.0, Peak mem 4.974 GB
Iter 27: Train loss 8.300, Learning Rate 1.000e-04, It/sec 0.780, Tokens/sec 169.340, Trained Tokens 6109.0, Peak mem 4.974 GB
Iter 28: Train loss 8.385, Learning Rate 1.000e-04, It/sec 1.046, Tokens/sec 116.128, Trained Tokens 6220.0, Peak mem 4.974 GB
Iter 29: Train loss 8.489, Learning Rate 1.000e-04, It/sec 0.254, Tokens/sec 200.302, Trained Tokens 7009.0, Peak mem 5.249 GB
Iter 30: Train loss 8.369, Learning Rate 1.000e-04, It/sec 0.164, Tokens/sec 214.807, Trained Tokens 8322.0, Peak mem 6.646 GB
Iter 31: Train loss 9.598, Learning Rate 1.000e-04, It/sec 1.267, Tokens/sec 62.078, Trained Tokens 8371.0, Peak mem 6.646 GB
Iter 32: Train loss 8.454, Learning Rate 1.000e-04, It/sec 1.586, Tokens/sec 196.612, Trained Tokens 8495.0, Peak mem 6.646 GB
Iter 33: Train loss 7.593, Learning Rate 1.000e-04, It/sec 0.459, Tokens/sec 255.292, Trained Tokens 9051.0, Peak mem 6.646 GB
Iter 34: Train loss 8.547, Learning Rate 1.000e-04, It/sec 0.778, Tokens/sec 211.676, Trained Tokens 9323.0, Peak mem 6.646 GB
Iter 35: Train loss 8.303, Learning Rate 1.000e-04, It/sec 0.272, Tokens/sec 219.047, Trained Tokens 10128.0, Peak mem 6.646 GB
Iter 36: Train loss 7.532, Learning Rate 1.000e-04, It/sec 0.557, Tokens/sec 192.747, Trained Tokens 10474.0, Peak mem 6.646 GB
Iter 37: Train loss 7.321, Learning Rate 1.000e-04, It/sec 0.133, Tokens/sec 228.166, Trained Tokens 12193.0, Peak mem 7.643 GB
Iter 38: Train loss 6.233, Learning Rate 1.000e-04, It/sec 1.342, Tokens/sec 110.018, Trained Tokens 12275.0, Peak mem 7.643 GB
Iter 39: Train loss 10.003, Learning Rate 1.000e-04, It/sec 1.179, Tokens/sec 126.121, Trained Tokens 12382.0, Peak mem 7.643 GB
Iter 40: Train loss 7.356, Learning Rate 1.000e-04, It/sec 0.760, Tokens/sec 204.451, Trained Tokens 12651.0, Peak mem 7.643 GB
Iter 41: Train loss 8.002, Learning Rate 1.000e-04, It/sec 0.851, Tokens/sec 208.442, Trained Tokens 12896.0, Peak mem 7.643 GB
Iter 42: Train loss 7.412, Learning Rate 1.000e-04, It/sec 1.061, Tokens/sec 200.455, Trained Tokens 13085.0, Peak mem 7.643 GB
Iter 43: Train loss 8.782, Learning Rate 1.000e-04, It/sec 1.609, Tokens/sec 115.861, Trained Tokens 13157.0, Peak mem 7.643 GB
Iter 44: Train loss 6.658, Learning Rate 1.000e-04, It/sec 0.947, Tokens/sec 136.412, Trained Tokens 13301.0, Peak mem 7.643 GB
Iter 45: Train loss 8.193, Learning Rate 1.000e-04, It/sec 0.946, Tokens/sec 118.259, Trained Tokens 13426.0, Peak mem 7.643 GB
Iter 46: Train loss 7.877, Learning Rate 1.000e-04, It/sec 0.780, Tokens/sec 113.041, Trained Tokens 13571.0, Peak mem 7.643 GB
Iter 47: Train loss 7.574, Learning Rate 1.000e-04, It/sec 1.049, Tokens/sec 99.673, Trained Tokens 13666.0, Peak mem 7.643 GB
Iter 48: Train loss 7.112, Learning Rate 1.000e-04, It/sec 0.352, Tokens/sec 264.373, Trained Tokens 14418.0, Peak mem 7.643 GB
Iter 49: Train loss 8.359, Learning Rate 1.000e-04, It/sec 1.570, Tokens/sec 106.745, Trained Tokens 14486.0, Peak mem 7.643 GB
Iter 50: Train loss 7.320, Learning Rate 1.000e-04, It/sec 0.294, Tokens/sec 221.674, Trained Tokens 15241.0, Peak mem 7.643 GB
Iter 51: Train loss 7.451, Learning Rate 1.000e-04, It/sec 0.546, Tokens/sec 209.789, Trained Tokens 15625.0, Peak mem 7.643 GB
Iter 52: Train loss 6.901, Learning Rate 1.000e-04, It/sec 0.934, Tokens/sec 127.066, Trained Tokens 15761.0, Peak mem 7.643 GB
Iter 53: Train loss 7.357, Learning Rate 1.000e-04, It/sec 0.316, Tokens/sec 202.406, Trained Tokens 16401.0, Peak mem 7.643 GB
Iter 54: Train loss 7.153, Learning Rate 1.000e-04, It/sec 0.861, Tokens/sec 223.031, Trained Tokens 16660.0, Peak mem 7.643 GB
Iter 55: Train loss 7.987, Learning Rate 1.000e-04, It/sec 0.854, Tokens/sec 69.988, Trained Tokens 16742.0, Peak mem 7.643 GB
Iter 56: Train loss 7.341, Learning Rate 1.000e-04, It/sec 0.950, Tokens/sec 154.784, Trained Tokens 16905.0, Peak mem 7.643 GB
Iter 57: Train loss 6.917, Learning Rate 1.000e-04, It/sec 0.714, Tokens/sec 217.640, Trained Tokens 17210.0, Peak mem 7.643 GB
Iter 58: Train loss 7.581, Learning Rate 1.000e-04, It/sec 1.322, Tokens/sec 72.690, Trained Tokens 17265.0, Peak mem 7.643 GB
Iter 59: Train loss 7.083, Learning Rate 1.000e-04, It/sec 1.053, Tokens/sec 114.759, Trained Tokens 17374.0, Peak mem 7.643 GB
Iter 60: Train loss 7.301, Learning Rate 1.000e-04, It/sec 0.333, Tokens/sec 231.845, Trained Tokens 18070.0, Peak mem 7.643 GB
Iter 61: Train loss 6.067, Learning Rate 1.000e-04, It/sec 1.334, Tokens/sec 68.022, Trained Tokens 18121.0, Peak mem 7.643 GB
Iter 62: Train loss 8.324, Learning Rate 1.000e-04, It/sec 1.599, Tokens/sec 166.308, Trained Tokens 18225.0, Peak mem 7.643 GB
Iter 63: Train loss 7.174, Learning Rate 1.000e-04, It/sec 1.343, Tokens/sec 142.398, Trained Tokens 18331.0, Peak mem 7.643 GB
Iter 64: Train loss 6.478, Learning Rate 1.000e-04, It/sec 0.400, Tokens/sec 220.645, Trained Tokens 18882.0, Peak mem 7.643 GB
Iter 65: Train loss 6.897, Learning Rate 1.000e-04, It/sec 0.617, Tokens/sec 164.024, Trained Tokens 19148.0, Peak mem 7.643 GB
Iter 66: Train loss 8.962, Learning Rate 1.000e-04, It/sec 1.884, Tokens/sec 24.490, Trained Tokens 19161.0, Peak mem 7.643 GB
Iter 67: Train loss 7.238, Learning Rate 1.000e-04, It/sec 0.342, Tokens/sec 249.676, Trained Tokens 19890.0, Peak mem 7.643 GB
Iter 68: Train loss 9.425, Learning Rate 1.000e-04, It/sec 1.158, Tokens/sec 145.850, Trained Tokens 20016.0, Peak mem 7.643 GB
Iter 69: Train loss 7.510, Learning Rate 1.000e-04, It/sec 1.041, Tokens/sec 194.674, Trained Tokens 20203.0, Peak mem 7.643 GB
Iter 70: Train loss 6.512, Learning Rate 1.000e-04, It/sec 1.350, Tokens/sec 86.426, Trained Tokens 20267.0, Peak mem 7.643 GB
Iter 71: Train loss 7.052, Learning Rate 1.000e-04, It/sec 0.845, Tokens/sec 154.547, Trained Tokens 20450.0, Peak mem 7.643 GB
Iter 72: Train loss 6.427, Learning Rate 1.000e-04, It/sec 0.843, Tokens/sec 206.624, Trained Tokens 20695.0, Peak mem 7.643 GB
Iter 73: Train loss 7.581, Learning Rate 1.000e-04, It/sec 1.870, Tokens/sec 140.227, Trained Tokens 20770.0, Peak mem 7.643 GB
Iter 74: Train loss 7.016, Learning Rate 1.000e-04, It/sec 0.449, Tokens/sec 170.759, Trained Tokens 21150.0, Peak mem 7.643 GB
Iter 75: Train loss 7.114, Learning Rate 1.000e-04, It/sec 0.604, Tokens/sec 231.936, Trained Tokens 21534.0, Peak mem 7.643 GB
Iter 76: Train loss 6.624, Learning Rate 1.000e-04, It/sec 0.265, Tokens/sec 228.358, Trained Tokens 22397.0, Peak mem 7.643 GB
Iter 77: Train loss 5.990, Learning Rate 1.000e-04, It/sec 1.160, Tokens/sec 88.170, Trained Tokens 22473.0, Peak mem 7.643 GB
Iter 78: Train loss 7.526, Learning Rate 1.000e-04, It/sec 0.706, Tokens/sec 206.142, Trained Tokens 22765.0, Peak mem 7.643 GB
Iter 79: Train loss 5.853, Learning Rate 1.000e-04, It/sec 1.880, Tokens/sec 16.916, Trained Tokens 22774.0, Peak mem 7.643 GB
Iter 80: Train loss 6.486, Learning Rate 1.000e-04, It/sec 1.037, Tokens/sec 186.582, Trained Tokens 22954.0, Peak mem 7.643 GB
Iter 81: Train loss 6.946, Learning Rate 1.000e-04, It/sec 0.221, Tokens/sec 201.627, Trained Tokens 23866.0, Peak mem 7.643 GB
Iter 82: Train loss 8.287, Learning Rate 1.000e-04, It/sec 1.539, Tokens/sec 177.030, Trained Tokens 23981.0, Peak mem 7.643 GB
Iter 83: Train loss 7.064, Learning Rate 1.000e-04, It/sec 0.202, Tokens/sec 229.146, Trained Tokens 25118.0, Peak mem 7.643 GB
Iter 84: Train loss 7.140, Learning Rate 1.000e-04, It/sec 1.133, Tokens/sec 116.660, Trained Tokens 25221.0, Peak mem 7.643 GB
Iter 85: Train loss 7.241, Learning Rate 1.000e-04, It/sec 0.358, Tokens/sec 211.875, Trained Tokens 25813.0, Peak mem 7.643 GB
Iter 86: Train loss 7.781, Learning Rate 1.000e-04, It/sec 1.037, Tokens/sec 106.760, Trained Tokens 25916.0, Peak mem 7.643 GB
Iter 87: Train loss 7.082, Learning Rate 1.000e-04, It/sec 0.911, Tokens/sec 207.712, Trained Tokens 26144.0, Peak mem 7.643 GB
Iter 88: Train loss 7.355, Learning Rate 1.000e-04, It/sec 1.340, Tokens/sec 101.857, Trained Tokens 26220.0, Peak mem 7.643 GB
Iter 89: Train loss 6.569, Learning Rate 1.000e-04, It/sec 1.167, Tokens/sec 98.038, Trained Tokens 26304.0, Peak mem 7.643 GB
Iter 90: Train loss 7.462, Learning Rate 1.000e-04, It/sec 0.916, Tokens/sec 100.709, Trained Tokens 26414.0, Peak mem 7.643 GB
Iter 91: Train loss 7.486, Learning Rate 1.000e-04, It/sec 0.829, Tokens/sec 115.296, Trained Tokens 26553.0, Peak mem 7.643 GB
Iter 92: Train loss 8.029, Learning Rate 1.000e-04, It/sec 1.561, Tokens/sec 126.417, Trained Tokens 26634.0, Peak mem 7.643 GB
Iter 93: Train loss 7.007, Learning Rate 1.000e-04, It/sec 0.593, Tokens/sec 208.309, Trained Tokens 26985.0, Peak mem 7.643 GB
Iter 94: Train loss 6.400, Learning Rate 1.000e-04, It/sec 0.804, Tokens/sec 116.602, Trained Tokens 27130.0, Peak mem 7.643 GB
Iter 95: Train loss 7.194, Learning Rate 1.000e-04, It/sec 2.285, Tokens/sec 18.276, Trained Tokens 27138.0, Peak mem 7.643 GB
Iter 96: Train loss 7.902, Learning Rate 1.000e-04, It/sec 0.818, Tokens/sec 199.565, Trained Tokens 27382.0, Peak mem 7.643 GB
Iter 97: Train loss 7.354, Learning Rate 1.000e-04, It/sec 0.902, Tokens/sec 182.185, Trained Tokens 27584.0, Peak mem 7.643 GB
Iter 98: Train loss 6.953, Learning Rate 1.000e-04, It/sec 1.497, Tokens/sec 182.687, Trained Tokens 27706.0, Peak mem 7.643 GB
Iter 99: Train loss 8.326, Learning Rate 1.000e-04, It/sec 1.851, Tokens/sec 88.845, Trained Tokens 27754.0, Peak mem 7.643 GB
Iter 100: Train loss 6.779, Learning Rate 1.000e-04, It/sec 0.581, Tokens/sec 172.653, Trained Tokens 28051.0, Peak mem 7.643 GB
Iter 100: Saved adapter weights to /Volumes/T7_Shield/mlx-vlm and /Volumes/T7_Shield/0000100_adapters.safetensors.
Saved final adapter weights to /Volumes/T7_Shield/mlx-vlm.
INFO:main:Training completed! Model saved to /Volumes/T7_Shield/mlx-vlm

@Goekdeniz-Guelmez
Copy link
Contributor Author

so updated some stuff and its a Lott faster and uses a lot less ram:

Iter 29: Train loss 8.377, Learning Rate 1.000e-04, It/sec 7.091, Tokens/sec 226.918, Trained Tokens 928, Peak mem 1.839 GB
Iter 30: Train loss 8.133, Learning Rate 1.000e-04, It/sec 7.074, Tokens/sec 226.371, Trained Tokens 960, Peak mem 1.839 GB
Iter 31: Train loss 8.074, Learning Rate 1.000e-04, It/sec 7.159, Tokens/sec 229.096, Trained Tokens 992, Peak mem 1.839 GB
Iter 32: Train loss 8.872, Learning Rate 1.000e-04, It/sec 7.071, Tokens/sec 226.286, Trained Tokens 1024, Peak mem 1.839 GB
Iter 33: Train loss 8.598, Learning Rate 1.000e-04, It/sec 7.174, Tokens/sec 229.554, Trained Tokens 1056, Peak mem 1.839 GB
Iter 34: Train loss 7.503, Learning Rate 1.000e-04, It/sec 7.174, Tokens/sec 229.570, Trained Tokens 1088, Peak mem 1.839 GB
Iter 35: Train loss 7.455, Learning Rate 1.000e-04, It/sec 7.126, Tokens/sec 228.016, Trained Tokens 1120, Peak mem 1.839 GB
Iter 36: Train loss 7.137, Learning Rate 1.000e-04, It/sec 7.167, Tokens/sec 229.334, Trained Tokens 1152, Peak mem 1.839 GB
Iter 37: Train loss 7.025, Learning Rate 1.000e-04, It/sec 7.115, Tokens/sec 227.686, Trained Tokens 1184, Peak mem 1.839 GB
Iter 38: Train loss 6.940, Learning Rate 1.000e-04, It/sec 7.132, Tokens/sec 228.237, Trained Tokens 1216, Peak mem 1.839 GB
Iter 39: Train loss 7.210, Learning Rate 1.000e-04, It/sec 7.111, Tokens/sec 227.552, Trained Tokens 1248, Peak mem 1.839 GB
Iter 40: Train loss 7.980, Learning Rate 1.000e-04, It/sec 7.256, Tokens/sec 232.203, Trained Tokens 1280, Peak mem 1.839 GB
Iter 41: Train loss 7.453, Learning Rate 1.000e-04, It/sec 7.118, Tokens/sec 227.760, Trained Tokens 1312, Peak mem 1.842 GB
Iter 42: Train loss 6.435, Learning Rate 1.000e-04, It/sec 7.188, Tokens/sec 230.011, Trained Tokens 1344, Peak mem 1.842 GB
Iter 43: Train loss 6.333, Learning Rate 1.000e-04, It/sec 7.078, Tokens/sec 226.508, Trained Tokens 1376, Peak mem 1.842 GB
Iter 44: Train loss 6.238, Learning Rate 1.000e-04, It/sec 7.179, Tokens/sec 229.744, Trained Tokens 1408, Peak mem 1.842 GB
Iter 45: Train loss 8.609, Learning Rate 1.000e-04, It/sec 7.209, Tokens/sec 230.698, Trained Tokens 1440, Peak mem 1.842 GB

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Sep 11, 2025

data from step 40

before: It/sec 1.851, Peak mem 7.643 GB
after: It/sec 7.179, Peak mem 1.839 GB

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Sep 11, 2025

Qwen 2 models work, both quant and full on both lora and full weight training.

@Goekdeniz-Guelmez
Copy link
Contributor Author

Qwen2.5 is added too

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Sep 11, 2025

@Blaizzy here is a test adapter llm https://huggingface.co/Goekdeniz-Guelmez/MLX-VLM-Qwen2-VL-2B-Instruct-bf16-VisualWebInstruct-lora/blob/main/README.md

the comand used here is:

python -m mlx_vlm.lora --model-path mlx-community/Qwen2-VL-2B-Instruct-bf16 --dataset TIGER-Lab/VisualWebInstruct --dataset-config 'example' --output-path Desktop/Qwen2-VL-2B-Instruct-bf16-VisualWebInstruct-lora --batch-size 1 --epochs 1 --learning-rate 1e-6 --grad-checkpoint --train-on-completions --steps-per-report 1

@Blaizzy would be great if you can try out a larger model using this command.

@Blaizzy Blaizzy mentioned this pull request Nov 10, 2025
@Goekdeniz-Guelmez
Copy link
Contributor Author

#187 Deepseek vl 1 has been added.

@sachinraja13
Copy link

@Goekdeniz-Guelmez and @Blaizzy : Any update on this please?

@Goekdeniz-Guelmez
Copy link
Contributor Author

@sachinraja13 I will be continuing woking on it later this week, after finishing all the Gabliteration project todo's.

@sachinraja13
Copy link

Many thanks for all your contributions, @Goekdeniz-Guelmez ! Very helpful! Will be looking forward!

@Goekdeniz-Guelmez
Copy link
Contributor Author

however you can try the Qwen2, 2.5, 3, Gemma 3 and let me know how it is.

@Blaizzy
Copy link
Owner

Blaizzy commented Jan 23, 2026

Hey @Goekdeniz-Guelmez

How are doing?

Any updates here?

I’m making some major changes in #681 and after that I will add vision attention chunking to reduce peak memory usage and OOM errors when processing images with 2K and above resolution

@Goekdeniz-Guelmez
Copy link
Contributor Author

Goekdeniz-Guelmez commented Jan 24, 2026

Hey @Blaizzy

yes I added more architectures in it, and fixed the OOM for small param models, and some more optimisations and major changes, the next pushes will remove functions and files I added in originally. #681 sounds great you can freely merge them I'll adjust this PR respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants