Skip to content
@EvolvingLMMs-Lab

LMMs-Lab

Feeling and building multimodal intelligence.

LMMs-Lab: Building Multimodal Intelligence

We are a group of researchers, with a focus on large multimodal models (LMMs). We wish to bring insights to community with our research.

Here're a few of our projects.

We're on an exciting journey toward creating Artificial General Intelligence (AGI), much like the enthusiasm of the 1960s moon landing. This journey is powered by advanced large language models (LLMs) and large multimodal models (LMMs), which are complex systems capable of understanding, learning, and performing a wide variety of human tasks.

To gauge how advanced these models are, we use a variety of evaluation benchmarks. These benchmarks are tools that help us understand the capabilities of these models, showing us how close we are to achieving AGI. To address this challenge, we introduce lmms-eval, an evaluation framework meticulously crafted for consistent and efficient evaluation of LMM.

VideoMMMU is a multi-modal, multi-disciplinary video benchmark that evaluates the knowledge acquisition capability from educational videos.

Our dataset comprises 300 lecture-style videos spanning 6 professional disciplines: Art, Business, Science, Medicine, Humanities, and Engineering, with 30 subjects distributed among them.

VideoMMMU features a Knowledge Acquisition-based Question Design. Each video includes 3 question-answer pairs aligned with the three knowledge acquisition stages: Perception (identifying key information related to the knowledge), Comprehension (understanding the underlying concepts), and Adaptation (applying knowledge to new scenarios).

VideoMMMU proposes a knowledge acquisition metric (Δknowledge) to measure performance gains on practice exam questions after learning from videos. This metric enables us to quantitatively evaluate how effectively LMMs can assimilate and utilize the information presented in the videos to solve real-world, novel problems.

We expanded the LLaVA-NeXT series with recent stronger open LLMs, reporting our findings on more capable language models: We maintain an efficient training strategy like previous LLaVA models. We supervised finetuned our model on the same data as in previous LLaVA-NeXT 7B/13B/34B models. Our current largest model LLaVA-NeXT-110B is trained on 128 H800-80G for 18 hours.

With stronger LLMs support, LLaVA-NeXT achieves consistently better performance compared with prior open-source LMMs by simply increasing the LLM capability. It catches up to GPT4-V on selected benchmarks.

We report detailed ablations, including architectural modifications, enlarged visual tokens, and varied training strategies, to explore potential improvements in LLaVA-NeXT's performance.

We explore LLaVA-NeXT's capabilities in video understanding tasks, highlighting its strong performance. Key improvements include:

SoTA Performance! Without seeing any video data, LLaVA-Next demonstrates strong zero-shot modality transfer ability, outperforming all the existing open-source LMMs (e.g., LLaMA-VID) that have been specifically trained for videos. Compared with proprietary ones, it achieves comparable performance with Gemini Pro on NextQA and ActivityNet-QA.

Strong length generalization ability Despite being trained under the sequence length constraint of a 4096-token limit, LLaVA-Next demonstrates remarkable ability to generalize to longer sequences. This capability ensures robust performance even when processing long-frame content that exceeds the original token length limitation.

DPO pushes performance DPO with AI feedback on videos yields significant performance gains.

Pinned Loading

  1. lmms-eval lmms-eval Public

    Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

    Python 2.1k 204

Repositories

Showing 10 of 10 repositories
  • VideoMMMU Public
    EvolvingLMMs-Lab/VideoMMMU’s past year of commit activity
    Python 25 1 0 1 Updated Feb 13, 2025
  • lmms-eval Public

    Accelerating the development of large multimodal models (LMMs) with one-click evaluation module - lmms-eval.

    EvolvingLMMs-Lab/lmms-eval’s past year of commit activity
    Python 2,093 204 181 (10 issues need help) 2 Updated Feb 13, 2025
  • .github Public
    EvolvingLMMs-Lab/.github’s past year of commit activity
    1 0 0 0 Updated Feb 13, 2025
  • open-r1-multimodal Public

    A fork to add multimodal model training to open-r1

    EvolvingLMMs-Lab/open-r1-multimodal’s past year of commit activity
    Python 611 Apache-2.0 31 9 0 Updated Feb 8, 2025
  • multimodal-sae Public

    Auto Interpretation Pipeline and many other functionalities for Multimodal SAE Analysis.

    EvolvingLMMs-Lab/multimodal-sae’s past year of commit activity
    Python 104 5 0 0 Updated Jan 24, 2025
  • LongVA Public

    Long Context Transfer from Language to Vision

    EvolvingLMMs-Lab/LongVA’s past year of commit activity
    Python 360 Apache-2.0 19 27 0 Updated Nov 20, 2024
  • demos Public
    EvolvingLMMs-Lab/demos’s past year of commit activity
    Python 0 0 0 0 Updated Sep 18, 2024
  • sglang Public Forked from sgl-project/sglang

    SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.

    EvolvingLMMs-Lab/sglang’s past year of commit activity
    Python 4 Apache-2.0 904 0 0 Updated Sep 18, 2024
  • Otter Public

    🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

    EvolvingLMMs-Lab/Otter’s past year of commit activity
    Python 3,227 MIT 214 61 2 Updated Mar 5, 2024
  • RelateAnything Public

    Relate Anything Model is capable of taking an image as input and utilizing SAM to identify the corresponding mask within the image.

    EvolvingLMMs-Lab/RelateAnything’s past year of commit activity
    Python 449 Apache-2.0 21 6 0 Updated Jul 5, 2023

People

This organization has no public members. You must be a member to see who’s a part of this organization.

Top languages

Python

Most used topics

Loading…