We introduce Efficient Multi-Modal Long Context Learning (EMLoC), a novel training-free method that embeds many demonstration examples directly into the model input. EMLoC offers a more efficient, flexible, and scalable solution for task adaptation. By adaptively pruning tokens at each layer under a Jensen-Shannon divergence constraint, our method achieves a dramatic reduction in inference complexity without sacrificing performance.
- If you like our project, please give us a star ⭐ on GitHub for the latest update.
- [2025/5/4] 🎉 Release the paper and code of EMLoC.
Multi-modal Large Language Models
- Qwen2-VL (The version of Monkey Patch is still in preparation)
- Other models will be supported soon!
This work is built based on Qwen2-VL, lmms-eval, and transformers. Thanks for their contributions! We modify the modeling_qwen2_vl.py and cache_utils.py in transformers. Besides, we modify the lmms-eval to support multi-modal in-context learning and Ascend 910B. More details can be seen in the code.
git clone
cd EMLoC
conda create -y -n emloc python=3.10
conda activate emloc
pip install -r requirements
## install torch-npu to support Ascend 910B
# pip install torch-npu==2.4.0
- ImageNet1k:
trainandvalshould be in the root dir of imagenet1k.
ln -s /path/to/imagenet1k/ ./data/imagenet1k
- Other datasets: lmms-eval will automatically download and spilt them into fewshot set and validation set.
# ImageNet
sh ./scripts/EMLoC_imagenet.sh
# illusionVQA
sh ./scripts/EMLoC_illusionVQA.sh
# mmerealworld
sh ./scripts/EMLoc_mmerealworld_lite.sh
# More datasets please see ./scripts/