the Navigation of CVPR 2021 papers
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | XMC-GAN | Cross-Modal Contrastive Learning for Text-to-Image Generation | paper | CVPR 2021 | Google Research |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | MVDNet | Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals | paper code | CVPR 2021 | University of California SanDiego |
| 1 | - | Multi-Modal Fusion Transformer for End-to-End Autonomous Driving | paper | CVPR 2021 | Max Planck Institute for Intelligent Systems |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | VLN | Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals | paper code | CVPR 2021 | University of California SanDiego |
| 1 | SSM | Structured Scene Memory for Vision-Language Navigation | paper | CVPR 2021 | Beijing Institute of Technology |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | - | Semantic-Aware Video Text Detection | paper | CVPR 2021 | National Laboratory of Pattern Recognition |
| 1 | TRBA | What If We Only Use Real Datasets for Scene Text Recognition?Toward Scene Text Recognition With Fewer Labels | paper code | CVPR 2021 | The University of Tokyo |
| 2 | Multiplexed TextSpotter | A Multiplexed Network for End-to-End, Multilingual OCR | paper | CVPR 2021 | Facebook AI |
| 3 | STKM | Self-attention based Text Knowledge Mining for Text Detection | paper | CVPR 2021 | Shenzhen University |
| 4 | TextOCR | TextOCR: Towards large-scale end-to-end reasoningfor arbitrary-shaped scene text | CVPR 2021 | Facebook AI Research |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | - | Multi-Modal Relational Graph for Cross-Modal Video Moment Retrieval | paper | CVPR 2021 | Hunan University |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | How2Sign: A Large-scale Multimodal Datasetfor Continuous American Sign Language | paper dataset | CVPR 2021 | Universitat Polit`ecnica de Catalunya |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | Image Change Captioning by Learning from an Auxiliary Task | paper | CVPR 2021 | University of Manitoba | |
| 1 | UC^2 | UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training | paper | CVPR 2021 | University of California, Davis |
| 2 | - | How Transferable are Reasoning Patterns in VQA? | paper code | CVPR 2021 | INSA Lyon |
| 3 | M3p | M3P: Learning Universal Representations via Multitask MultilingualMultimodal Pre-training | paper | CVPR 2021 | HiT |
| 4 | CC12M | Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts | paper | CVPR 2021 | Google Research |
| 5 | - | Separatin Skills and Concepts for Novel Visual Questions Answering | paper | CVPR 2021 | UIUC |
| 6 | VinVL | VinVL: Revisiting Visual Representations in Vision-Language Models | paper code | CVPR 2021 | Microsoft |
| 7 | - | Domain-robus VQA with diverse datasets and methods but no target labels | paper | CVPR 2021 | University of Pittsburgh |
| 8 | PCME | Probabilistic Embeddings for Cross-Modal Retrieval | paper code | CVPR 2021 | NAVER AI Lab |
| 9 | - | Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers | paper | CVPR 2021 | DeepMind |
| 10 | TAP | TAP: Text-Aware Pre-training for Text-VQA and Text-Caption | paper | CVPR 2021 | University of Rochester |
| 11 | Causal Attention | Causal Attention for Vision-Language Tasks | paper code | CVPR 2021 | Nanyang Technological University,Singapore |
| 12 | VirTex | VirTex: Learning Visual Representations from Textual Annotations | paper | CVPR 2021 | University of Michigan |
| 13 | - | Predicting Human Scanpaths in Visual Question Answering | paper | CVPR 2021 | Univeristy of Minnesota |
| 14 | Kaleido-BERT | Kaleido-BERT: Vision-Language Pre-training on Fashion Domain | paper code | CVPR 2021 | Alibaba Group |
| 15 | - | Seeing Out of tHe bOx:End-to-End Pre-training for Vision-Language Representation Learning | paper | CVPR 2021 | Univeristy of Science and Technology Beijing |
| 16 | - | Learning by Planning: Language-Guided Global Image Editing | paper code | CVPR 2021 | University of Rochester |
| 17 | KRISP | KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA | paper code | ||
| 18 | - | Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval | paper | CVPR 2021 | Peking University |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | ClipBERT | Less Is More: ClipBERT for Video-and-Language Learning via Sparse Sampling | paper code | CVPR 2021 | UNC |
| 1 | - | SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Networkfor Video Reasoning over Traffic Events | paper code | Singapore University of Technology and Design | |
| 2 | - | Open-book Video Captioning with Retrieve-Copy-Generate Network | paper | CVPR 2021 | institute of Automation, Chinese Academy of Sciences |
| 3 | NExT-QA | NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions | paper code | CVPR 2021 | National University of Singapore |
| 4 | AGQA | AGQA: A Benchmark for Compositional Spatio-Temporal Reasoning | paper | CVPR 2021 | Stanford University |
| 5 | - | Bridge to Answer: Structure-aware Graph Interaction Network for Video Question Answering | paper | CVPR 2021 | Yonsei University, Souch Korea |
| 6 | - | Look Before you Speak: Visually Contextualized Utterances | paper | CVPR 2021 | Google Research |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | - | Cross-Modal Center Loss for 3D Cross-Modal Retrieval | paper | CVPR 2021 | The City University of New York |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | Vx2Text | VX2TEXT: End-to-End Learning of Video-Based Text GenerationFrom Multimodal Inputs | paper | CVPR 2021 | Columbia University |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | cINNs | Stochastic Image-to-Video Synthesis using cINNs | paper | CVPR 2021 | Heidelberg University |
| 1 | Understanding Object Dynamics for Interactive Image-to-Video Synthesis | paper code | CVPR 2021 | Heidelberg University |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | - | Can audio-visual integration strengthen robustnessunder multimodal attacks? | paper | CVPR 2021 | University of Rochester |
| 1 | - | Audio-Visual Instance Discrimination with Cross-Modal Agreement | paper | CVPR 2021 | UC San Diego |
| 2 | - | VISUALVOICE: Audio-Visual Speech Separation with Cross-Modal Consistency | paper code | CVPR 2021 | The University of Texas at Austin |
| No. | Model Name | Title | Links | Pub. | Organization |
|---|---|---|---|---|---|
| 0 | - | Collaborative Spatial-Temporal Modeling for Language-QueriedVideo Actor Segmentation | paper | CVPR 2021 | Chinese Academy of Sciences |