Benchmarks updates & Added Blink benchmark support #1

GuilhermeViveiros · 2025-10-07T11:50:20Z

This PR introduces several updates:

There has been significant work within deep-spin that, unfortunately, was not merged periodically into main.
I am unable to provide a detailed description of all the changes, but I can highlight the main points.
Previous iterations on this branch focused on evaluating the multimodal capabilities of TowerVision, and the key updates are:

Improved post-processing functions for various benchmarks, including ALM-Bench, Commute, Kalidoscope, etc
Updated support for more languages in ALM-Bench.
Added support for the likelihood function across different multimodal models.
Introduced judges in Aya-Vision.

Additionally, we introduced a new benchmark called Blink.

…/lmms-eval into multiling_multimodal_tasks_add

…nd m-wild-vision

…to alm-bench

…to feature/add-blink-task

…eep-spin/lmms-eval into feature/add-blink-task

manzar96 and others added 30 commits December 20, 2024 17:02

added llava-next to gitingore

a6537b6

init script for eval run

315249d

init commit for adding molmo_hf

4a1fe31

requirements commit -- to be MODIFIED

03b46e7

addition of pangea tasks+ pangea model+ script for aggregating results

a4c8794

small fix for nvlm_d class and config files of maxm task

d944f97

pangea update+ updates on maxm task

0a4246c

Merge branch 'multiling_multimodal_tasks_add' of github.com:deep-spin…

6e95970

…/lmms-eval into multiling_multimodal_tasks_add

modified output parsing for marvl

cd6686b

updated marvl task output parsing

9b1cd7a

added the possibility of using tags during the inference

85f2c33

Update .gitignore

deaac05

Update .gitignore

3ceee86

added the scripts folder

5c95b8f

added the script to parse the results

61349bb

[WIP] add the Pixtral model

65ab80c

[WIP] add the Pixtral Model

8d2f5a6

update the pixtral model to work with vLLM

958cbb0

Updated Pixtral model

a231572

Updated Pixtral model

450e045

added multi-image support

c8a6797

updated the parse results script

32a4c9f

added cc-ocr

1a466f3

Merge branch 'multiling_multimodal_tasks_add' of github.com:deep-spin…

c31b179

…/lmms-eval into multiling_multimodal_tasks_add

increased max new tokens for cc-ocr

b6a5f81

fix pixstral run in signle gpu

e9acfbb

added the subset of the multilingual tasks that are relevant to tower

809c62c

Add ALM bench

a310b1d

updated cc-ocr eval

33b56f3

Merge branch 'multiling_multimodal_tasks_add' of github.com:deep-spin…

423ff2a

…/lmms-eval into multiling_multimodal_tasks_add

manzar96 and others added 27 commits April 27, 2025 14:25

small update

af0ae6e

small updates aya

f6fd1f5

final_answer parser

38e9453

final_answer parser

91439f2

fixed the bugs of alm-bench

a62ad3a

add llava_v6 model

e03a1b8

updated alm-bench pre-prompts

0e86154

minor fix in llava model

8810720

added the support of llava_v6 to the task

a0377ae

small fix in the prompt

612213f

updated llava_hf and llava_v6

d7394b0

Merge branch 'multiling_multimodal_tasks_add' of github.com:deep-spin…

279a8f5

…/lmms-eval into multiling_multimodal_tasks_add

temporary addition of m-wild-vision bench to switch to alm-bench branch

d13ec67

updated gitignore with lm-eval-harness

d293ce5

removed breakpoint from llava_hf.py

0b2464c

ayavision and m-wild bench

6659b9c

added support of all languages for ayavisionbench and m-wild-vision

918a58f

added default pre-prompts and post-prompts in configs for ayavision a…

f912945

…nd m-wild-vision

added the v6 prompts and models

d2057f3

Merge branch 'alm-bench' of https://github.com/deep-spin/lmms-eval in…

990dded

…to alm-bench

updates for the commute task

29d0f33

update for pixtral

d04bc31

llavq_hd loglikelihood implementations + changes on commute

3c287bb

mmmu_pro modifications

f8e35c3

Merge branch 'alm-bench' of https://github.com/deep-spin/lmms-eval in…

cb69c87

…to feature/add-blink-task

Merge branch 'multiling_multimodal_tasks_add' of https://github.com/d…

eb86cf2

…eep-spin/lmms-eval into feature/add-blink-task

blink benchmark related stuff

8857d4c

GuilhermeViveiros requested a review from CoderPat October 7, 2025 11:50

GuilhermeViveiros self-assigned this Oct 7, 2025

GuilhermeViveiros added the enhancement New feature or request label Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Benchmarks updates & Added Blink benchmark support #1

Benchmarks updates & Added Blink benchmark support #1

Uh oh!

GuilhermeViveiros commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Benchmarks updates & Added Blink benchmark support #1

Are you sure you want to change the base?

Benchmarks updates & Added Blink benchmark support #1

Uh oh!

Conversation

GuilhermeViveiros commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants