- Beijing, China
- https://sprinter1999.github.io/website/
🎶Multi-modal
CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image
This repo hosts the code and model of "Separate What You Describe: Language-Queried Audio Source Separation", Interspeech 2022
[ICASSP 2023] FedAudio: A Federated Learning Benchmark for Audio and Speech Tasks
Code for paper Learning Audio-Visual Dereverberation
Toolkits for Multimodal Emotion Recognition
[CVPR 2023] iQuery: Instruments as Queries for Audio-Visual Sound Separation
Using Segment-Anything and CLIP to generate pixel-aligned semantic features.
A toolkit for researchers in the multimodal sound separation.
Survey Paper List - Efficient LLM and Foundation Models
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Large World Model -- Modeling Text and Video with Millions Context
Code for the paper: "SuS-X: Training-Free Name-Only Transfer of Vision-Language Models" [ICCV'23]
DrFuse: Learning Disentangled Representation for Clinical Multi-Modal Fusion with Missing Modality and Modal Inconsistency (AAAI24)
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Data annotation toolbox supports image, audio and video data.
[ECCV2024] The Official Implementation for ''AdaCLIP: Adapting CLIP with Hybrid Learnable Prompts for Zero-Shot Anomaly Detection''
[ACL 2024] Official resources of "ChatKBQA: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models".
[NeurIPS 2024, spotlight] Scaling Out-of-Distribution Detection for Multiple Modalities
Learning Cross-Modal Retrieval with Noisy Labels (CVPR 2021, PyTorch Code)
[NeurIPS2023] Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
Get your documents ready for gen AI
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
A curated list of awesome prompt/adapter learning methods for vision-language models like CLIP.