Stars
A Unified Tokenizer for Visual Generation and Understanding
Pytorch implementation of GaussianToken: An Effective Image Tokenizer with 2D Gaussian Splatting
Wan: Open and Advanced Large-Scale Video Generative Models
PyTorch implementation of FractalGen https://arxiv.org/abs/2502.17437
《代码随想录》LeetCode 刷题攻略:200道经典题目刷题顺序,共60w字的详细图解,视频难点剖析,50余张思维导图,支持C++,Java,Python,Go,JavaScript等多语言版本,从此算法学习不再迷茫!🔥🔥 来看看,你会发现相见恨晚!🚀
Paper collections of multi-modal LLM for Math/STEM/Code.
Ola: Pushing the Frontiers of Omni-Modal Language Model
Witness the aha moment of VLM with less than $3.
A fork to add multimodal model training to open-r1
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
FastVideo is a lightweight framework for accelerating large video diffusion models.
FaceChain is a deep-learning toolchain for generating your Digital-Twin.
[ICLR 2025] Autoregressive Video Generation without Vector Quantization
A paper list of some recent works about Token Compress for Vit and VLM
[CVPR 2025] 🔥 Official impl. of "TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation".
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
[ACL'2024 Findings] GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation
LiVOS: Light Video Object Segmentation with Gated Linear Matching (CVPR 2025)
A suite of image and video neural tokenizers
A 6-million Audio-Caption Paired Dataset Built with a LLMs and ALMs-based Automatic Pipeline
Align Anything: Training All-modality Model with Feedback
Allegro is a powerful text-to-video model that generates high-quality videos up to 6 seconds at 15 FPS and 720p resolution from simple text input.