This GitHub repository summarizes papers and resources related to the video generation task.
If you have any suggestions about this repository, please feel free to start a new issue or pull requests.
Recent news of this GitHub repo are listed as follows.
🔥 [Nov. 19th] We have released our latest paper titled "StableV2V: Stablizing Shape Consistency in Video-to-Video Editing", with the correponding code, model weights, and a testing benchmark DAVIS-Edit
open-sourced. Feel free to check them out from the links!
Click to see more information.
- [Jun. 17th] All NeurIPS 2023 papers and references are updated.
- [Apr. 26th] Update a new direction: Personalized Video Generation.
- [Mar. 28th] The official AAAI 2024 paper list are released! Official version of PDFs and BibTeX references are updated accordingly.
- Latest Papers
- Update NeurIPS 2024 Papers
- Update ECCV 2024 Papers
- Update CVPR 2024 Papers
- Update PDFs and References of
⚠️ Papers - Update Published Versions of References
- Update PDFs and References of
- Update AAAI 2024 Papers
- Update PDFs and References of
⚠️ Papers - Update Published Versions of References
- Update PDFs and References of
- Update ICLR 2024 Papers
- Update NeurIPS 2023 Papers
- Previously Published Papers
- Update Previous CVPR papers
- Update Previous ICCV papers
- Update Previous ECCV papers
- Update Previous NeurIPS papers
- Update Previous ICLR papers
- Update Previous AAAI papers
- Update Previous ACM MM papers
- Regular Maintenance of Preprint arXiv Papers and Missed Papers
Name | Organization | Year | Research Paper | Website | Specialties |
---|---|---|---|---|---|
Sora | OpenAI | 2024 | link | link | - |
Lumiere | 2024 | link | link | - | |
VideoPoet | 2023 | - | link | - | |
W.A.I.T | 2023 | link | link | - | |
Gen-2 | Runaway | 2023 | - | link | - |
Gen-1 | Runaway | 2023 | - | link | - |
Animate Anyone | Alibaba | 2023 | link | link | - |
Outfit Anyone | Alibaba | 2023 | - | link | - |
Stable Video | StabilityAI | 2023 | link | link | - |
Pixeling | HiDream.ai | 2023 | - | link | - |
DomoAI | DomoAI | 2023 | - | link | - |
Emu | Meta | 2023 | link | link | - |
Genmo | Genmo | 2023 | - | link | - |
NeverEnds | NeverEnds | 2023 | - | link | - |
Moonvalley | Moonvalley | 2023 | - | link | - |
Morph Studio | Morph | 2023 | - | link | - |
Pika | Pika | 2023 | - | link | - |
PixelDance | ByteDance | 2023 | link | link | - |
- Year 2024
- arXiv
- Video Diffusion Models: A Survey [Paper]
- arXiv
- Year 2023
- arXiv
- A Survey on Video Diffusion Models [Paper]
- arXiv
- Year 2024
- CVPR
- Vlogger: Make Your Dream A Vlog [Paper] [Code]
- Make Pixels Dance: High-Dynamic Video Generation [Paper] [Project] [Demo]
- VGen: Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [Paper] [Code] [Project]
- GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation [Paper] [Project]
- SimDA: Simple Diffusion Adapter for Efficient Video Generation [Paper] [Code] [Project]
- MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation [Paper] [Project] [Video]
- Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [Paper] [Project]
- PEEKABOO: Interactive Video Generation via Masked-Diffusion [Paper] [Code] [Project] [Demo]
- EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [Paper] [Code] [Project]
- A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [Paper] [Code] [Project]
- BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [Paper] [Project]
- Mind the Time: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [Paper] [Project]
- Animate Anyone: Consistent and Controllable Image-to-video Synthesis for Character Animation [Paper] [Code] [Project]
- MotionDirector: Motion Customization of Text-to-Video Diffusion Models [Paper] [Code]
- Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation [Paper] [Project]
- DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation [Paper] [Code]
- Grid Diffusion Models for Text-to-Video Generation [Paper] [Code] [Video]
- ECCV
- Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [Paper] [Project]
- W.A.L.T.: Photorealistic Video Generation with Diffusion Models [Paper] [Project]
- MoVideo: Motion-Aware Video Generation with Diffusion Models [Paper]
- DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model [Paper] [Code] [Project]
- MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [Paper]
- HARIVO: Harnessing Text-to-Image Models for Video Generation [Paper] [Project]
- MEVG: Multi-event Video Generation with Text-to-Video Models [Paper] [Project]
- ICLR
- AAAI
- Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos [Paper] [Code] [Project]
- E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning [Paper]
- ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [Paper] [Code] [Project]
- F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text to-Video Synthesis [Paper]
- arXiv
- Lumiere: A Space-Time Diffusion Model for Video Generation [Paper] [Project]
- Boximator: Generating Rich and Controllable Motions for Video Synthesis [Paper] [Project] [Video]
- World Model on Million-Length Video And Language With RingAttention [Paper] [Code] [Project]
- Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [Paper] [Project]
- WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens [Paper] [Code] [Project]
- MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation [Paper] [Project]
- Latte: Latent Diffusion Transformer for Video Generation [Paper] [Code] [Project]
- Mora: Enabling Generalist Video Generation via A Multi-Agent Framework [Paper] [Code]
- StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [Paper] [Code] [Project] [Video]
- VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [Paper]
- StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [Paper] [Code] [Project] [Demo]
- Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [Paper] [Code] [Project]
- ControlNeXt: Powerful and Efficient Control for Image and Video Generation [Paper] [Code] [Project]
- FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [Paper] [Project]
- Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data [Paper] [Code]
- Fine-gained Zero-shot Video Sampling [Paper] [Project]
- Training-free Long Video Generation with Chain of Diffusion Model Experts [Paper]
- ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model [Paper] [Code] [Project] [Video]
- ConFiner: Training-free Long Video Generation with Chain of Diffusion Model Experts [Paper] [Code]
- 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation [Paper] [Code] [Project]
- DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [Paper] [Code] [Project]
- Others
- Sora: Video Generation Models as World Simulators [Paper]
- CVPR
- Year 2023
- CVPR
- Align your Latents: High-resolution Video Synthesis with Latent Diffusion Models [Paper] [Project] [Reproduced code]
- Text2Video-Zero: Text-to-image Diffusion Models are Zero-shot Video Generators [Paper] [Code] [Demo] [Project]
- Video Probabilistic Diffusion Models in Projected Latent Space [Paper] [Code]
- ICCV
- NeurIPS
- ICLR
- CogVideo: Large-scale Pretraining for Text-to-video Generation via Transformers [Paper] [Code] [Demo]
- Make-A-Video: Text-to-video Generation without Text-video Data [Paper] [Project] [Reproduced code]
- Phenaki: Variable Length Video Generation From Open Domain Textual Description [Paper] [Reproduced Code]
- arXiv
- Control-A-Video: Controllable Text-to-video Generation with Diffusion Models [Paper] [Code] [Demo] [Project]
- ControlVideo: Training-free Controllable Text-to-video Generation [Paper] [Code]
- Imagen Video: High Definition Video Generation with Diffusion Models [Paper]
- Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-video Generation [Paper] [Project]
- LAVIE: High-quality Video Generation with Cascaded Latent Diffusion Models [Paper] [Code] [Project]
- Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-video Generation [Paper] [Code] [Project]
- Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [Paper] [Code] [Project]
- VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-video Generation [Paper] [Dataset]
- VideoGen: A Reference-guided Latent Diffusion Approach for High Definition Text-to-video Generation [Paper] [Code]
- InstructVideo: Instructing Video Diffusion Models with Human Feedback [Paper] [Code] [Project]
- SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [Paper] [Code] [Project]
- VideoLCM: Video Latent Consistency Model [Paper]
- ModelScope Text-to-Video Technical Report [Paper] [Code]
- LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [Paper] [Code] [Project]
- STG: Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling [Paper] [Code] [Project]
- Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation [Paper] [Project]
- NOVA: Autoregressive Video Generation without Vector Quantization Topics [Paper] [Code] [Project]
- CVPR
- Year 2022
- Year 2021
-
Year 2024
- CVPR
- ECCV
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [Paper]
- PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation [Paper] [Code] [Project]
- MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model [Paper] [Code] [Project]
- AAAI
- Decouple Content and Motion for Conditional Image-to-Video Generation [Paper]
- arXiv
- ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [Paper] [Code] [Project]
- I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [Paper] [Code]
- Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts [Paper] [Code] [Project]
- AtomoVideo: High Fidelity Image-to-Video Generation [Paper] [Project] [Video]
- Pix2Gif: Motion-Guided Diffusion for GIF Generation [Paper] [Code] [Project]
- ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [Paper] [Code] [Project]
- Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation [Paper] [Project]
- MegActor-Σ: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer [Paper] [Code]
- LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis [Paper] [Code] [Project] [Demo]
-
Year 2023
- CVPR
- arXiv
- I2VGen-XL: High-quality Image-to-video Synthesis via Cascaded Diffusion Models [Paper] [Code] [Project]
- DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [Paper] [Code] [Project]
- DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [Paper] [Project] [Code] [Video] [Demo]
- AnimateDiff: Animate Your Personalized Text-to-image Diffusion Models without Specific Tuning [Paper] [Project]
-
Year 2022
-
Year 2021
- Year 2024
- Year 2023
- Year 2024
- CVPR
- ECCV
- arXiv
- Year 2023
- arXiv
- FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention [Paper] [Code] [Demo]
- Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance [Paper] [Project]
- DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [Paper] [Project]
- arXiv
- Year 2024
- CVPR
- VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [Paper] [Code] [Project]
- Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis [Paper] [Project]
- CCEdit: Creative and Controllable Video Editing via Diffusion Models [Paper] [Code] [Project] [Video]
- DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing [Paper] [Project] [Video]
- Video-P2P: Video Editing with Cross-attention Control [Paper] [Code] [Project]
- A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing [Paper] [Code] [Project]
- MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [Paper] [Code] [Project]
- VidToMe: Video Token Merging for Zero-Shot Video Editing [Paper] [Code] [Project] [Video]
- Towards Language-Driven Video Inpainting via Multimodal Large Language Models [Paper] [Code] [Project] [Dataset]
- AVID: Any-Length Video Inpainting with Diffusion Model [Paper] [Code] [Project]
- CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing [Paper] [Code]
- Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer [Paper] [Code] [Project]
- FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [Paper] [Code] [Project]
- MotionEditor: Editing Video Motion via Content-Aware Diffusion [Paper] [Code] [Project]
- ECCV
- DragVideo: Interactive Drag-style Video Editing [Paper]
- Video Editing via Factorized Diffusion Distillation [Paper]
- OCD: Object-Centric Diffusion for Efficient Video Editing [Paper] [Project]
- DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing [Paper] [Project]
- WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing [Paper] [Project]
- DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency [Paper]
- SAVE: Protagonist Diversification with Structure Agnostic Video Editing [Paper] [Code]
- Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion [Paper] [Code] [Project]
- ICLR
- Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [Paper] [Code] [Project]
- TokenFlow: Consistent Diffusion Features for Consistent Video Editing [Paper] [Code] [Project]
- Consistent Video-to-Video Transfer Using Synthetic Dataset [Paper] [Code]
- FLATTEN: Optical FLow-guided ATTENtion for Consistent Text-to-Video Editing [Paper] [Code] [Project]
- SIGGRAPH
- arXiv
- Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [Paper] [Code] [Project]
- UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [Paper] [Code] [Project]
- DragAnything: Motion Control for Anything using Entity Representation [Paper] [Code] [Project]
- AnyV2V: A Plug-and-Play Framework for Any Video-to-Video Editing Tasks [Paper] [Code] [Project]
- CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility [Paper] [Code] [Project]
- VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [Paper]
- StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [Paper] [Code] [Project] [Dataset]
- Motion Inversion for Video Customization [Paper] [Code] [Demo]
- CVPR
- Year 2023
- CVPR
- ICCV
- NeurIPS
- Towards Consistent Video Editing with Text-to-Image Diffusion Models [Paper]
- SIGGRAPH
- arXiv
- Year 2022
- [arXiv 2012] UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild [Paper] [Dataset]
- [arXiv 2017] DAVIS: The 2017 DAVIS Challenge on Video Object Segmentation [Paper] [Dataset]
- [ICCV 2019] FaceForensics++: Learning to Detect Manipulated Facial Images [Paper] [Code]
- [NeurIPS 2019] TaiChi-HD: First Order Motion Model for Image Animation [Paper] [Dataset]
- [ECCV 2020] SkyTimeLapse: DTVNet: Dynamic Time-lapse Video Generation via Single Still Image [Paper] [Code]
- [ICCV 2021] WebVid-10M: Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval [Paper] [Dataset] [Code] [Project]
- [ICCV 2021] WebVid-10M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [Paper] [Dataset] [Project]
- [ECCV 2022] ROS: Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining [Paper] [Code] [Dataset]
- [arXiv 2023] HD-VG-130M: VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-video Generation [Paper] [Dataset]
- [NeurIPS 2023] FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation [Paper] [Code]
- [ICLR 2024] InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [Paper] [Dataset]
- [CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers [Paper] [Dataset] [Project]
- [arXiv 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Dataset]
- [CVPR 2024] VBench: Comprehensive Benchmark Suite for Video Generative Models [Paper] [Code]
- [ICCV 2023] DOVER: Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives [Paper] [Code]
- [ICLR 2019] FVD: A New Metric for Video Generation [Paper] [Code]
- Q: The conference sequence of this paper list?
- This paper list is organized according to the following sequence:
- CVPR
- ICCV
- ECCV
- NeurIPS
- ICLR
- AAAI
- ACM MM
- SIGGRAPH
- arXiv
- Others
- This paper list is organized according to the following sequence:
- Q: What does
Others
refers to?- Some of the following studies (e.g.,
Sora
) does not publish their technical report on arXiv. Instead, they tend to write a blog in their official websites. TheOthers
category refers to such kind of studies.
- Some of the following studies (e.g.,
The reference.bib
file summarizes bibtex references of up-to-date image inpainting papers, widely used datasets, and toolkits.
Based on the original references, I have made the following modifications to make their results look nice in the LaTeX
manuscripts:
- Refereces are normally constructed in the form of
author-etal-year-nickname
. Particularly, references of datasets and toolkits are directly constructed asnickname
, e.g.,imagenet
. - In each reference, all names of conferences/journals are converted into abbreviations, e.g.,
Computer Vision and Pattern Recognition -> CVPR
. - The
url
,doi
,publisher
,organization
,editor
,series
in all references are removed. - The
pages
of all references are added if they are missing. - All paper names are in title case. Besides, I have added an additional
{}
to make sure that the title case would also work well in some particular templates.
If you have other demands of reference formats, you may refer to the original references of papers by searching their names in DBLP or Google Scholar.