A list of materials related to text2scene
-
Learning the Visual Interpretation of Sentences (ICCV-2013)
- Target: text -> Cartoon-like
- Method: Statistical learning - Conditional Random Field (CRF)
- Dataset: Abstract Scene Dataset
- Supplementary Material
-
Text2Scene: Generating Compositional Scenes from Textual Descriptions (CVPR-2019) #thorough
- Target: text -> Cartoon-like scenes & Object layouts & Synthetic scenes
- Method: End2end deep learning (Recurrent CNN + attention); Unified framework
- Dataset: Abstract Scene Dataset; COCO
- Code: Text2Scene
-
Predicting Object Dynamics in Scenes (CVPR-2014)
- Target: scene -> next scene
- Dataset: Abstract Scene Dataset
-
Visual Abstraction for Zero-Shot Learning (ECCV-2014)
- Target: learn concepts involving individual poses and interactions between two people
- Dataset: Abstract scenes depicting fine-grained iteractions between two people
- Webpage
-
Learning common sense through visual abstraction (ICCV-2015)
- Target: Assess the plausibility of the interaction in a scene
- Dataset: Second Generation Abstract Scene Dataset
- Webpage
-
Learning Spatial Knowledge for Text to 3D Scene Generation (EMNLP-2014) #thorough
- Target: Text -> 3D scene - room layout
- Method: Mostly rule-based + Bayesian + NLP
- Dataset: Collected dataset of spatial relation descriptions
- Learned spatial relation mapping
-
Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation (ACL-2014-Workshop)
- Target: Text -> 3D scene - room layout
- Method: Interactive learning
-
Text to 3D Scene Generation with Rich Lexical Grounding (ACL-2015) #thorough
- Target: Text -> 3D scene - room layout
- Method: Mostly rule-based + Supervised learning -> learning lexical grounding (i.e. object match)
- Dataset: Collected dataset of scene-description pairs
- Text2Shape: Generating Shapes from Natural Language by Learning Joint Embeddings (CVPR-2018)
- Target: text -> colored 3D shapes of tables and chairs
- Method: Joint metric learning to capture many-to-many relations between text and properties of 3D shapes
- Dataset: ShapeNet and manually collected text descriptions
- Code: text2shape
-
Generative Adversarial Text to Image Synthesis (ICML-2016)
- Target: text -> photographic image (Bird & flower)
- Method: Both generator and discriminator conditioned on text feature
- Dataset: CUB dataset of bird images; Oxford-102 dataset of flower images
- Code: Generative Adversarial Text-to-Image Synthesis
-
StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks (ICCV-2017)
- Target: text -> photographic image (Bird & flower)
- Method: two-stage GANs
- Dataset: CUB; Oxford-102; MS-COCO
- Code: StackGAN
-
Semi-parametric Image Synthesis (CVPR-2018)
- Target: Semantic layout -> Photographic image
- Method:
- Parametric + Non-parametric (segment retrieval)
- Segment database -> retrieve -> composite -> resolve occlusion -> post-process
- Dataset: Cityscapes; NYU; ADE20K (See Datasets in this paper)
- Code: SIMS
- Demo: Semi-parametric Image Synthesis
-
Image Generation from Scene Graphs (CVPR-2018)
- Target: Scene graphs -> Photographic images
- Method:
- groud-truth object positions -> scene graphs
- Graph processing: graph convolution network
- symbolic graph -> scene layout: bounding box & segmentation prediction
- scene layout -> image: cascaded refinement network (CRN)
- image -> realistic image: adversarial training
- Dataset:
- Visual Genome: Human annotated scene graphs provided
- COCO-Stuff: COCO with pixel-level stuff annotations
-
Inferring semantic layout for hierarchical text-to-image synthesis (CVPR-2018)
- Target: text -> photographic image
- Method: text -> semantic layout (box layout & shape) -> image
- Dataset: COCO
-
Image Ranking and Retrieval based on Multi-Attribute Queries (CVPR-2011)
-
Image Retrieval Using Scene Graphs (CVPR-2015)
- Target: Textual query -> Semantically related image
- Method: Scene graph; Conditional random field
- Dataset: real-world scene graphs: manually labeled YFCC100m & COCO images
-
Generating Videos with Scene Dynamics (NIPS-2016)
- Target: video generation (unlabeled)
- Method: Scene decomposition model: Foreground + Background + Mask (GAN*3)
- Dataset: A large amount of unlabeled video downloaded from Flickr
- Code: videoGAN
-
Visual Dynamics: Stochastic Future Generation via Layered Cross Convolutional Networks (NIPS-2016)
- Target: frame -> next frame
- Method: Probabilistic
-
To Create What You Tell: Generating Videos from Captions (ACM-2017)
- Target: caption -> video
- Method: Conditional GAN: LSTM caption encoder + Convolutional generator + 3 discriminator conditioned on the caption
- Dataset:
- Synthesized videos of handwritten digits bouncing
- Video snippets from YouTube about cooking
-
Imagine This! Scripts to Compositions to Videos (ECCV-2018)
- Target: text -> scene video
- Method: Entity & Background retrieval + Layout composer
- Dataset: FLINTSTONES: richly-annotated video-caption dataset
- Demo: CRAFT
-
Video Generation from Text (AAAI-2018)
- Target: text -> video
- Method: text -> gist -> video (VAE + GAN)
- Dataset: Videos crawled from YouTube along with titles and descriptions
-
MoCoGAN: Decomposing Motion and Content for Video Generation (CVPR-2018)
- Target: video generation
- Method: Content + motion GAM
-
TFGAN: IMPROVING CONDITIONING FOR TEXT-TO-VIDEO SYNTHESIS (2018) #Withdrawn
-
Generating Animated Videos of Human Activities from Natural Language Descriptions (NIPS-2018)
- Target: text -> a sequence of 3D human skeletal poses
- Method:
- Autoencoder: Learn a representation of human motions without text
- Seq2seq: map text into motion representation
- Dataset: The KIT Motion-Language Dataset
-
Language2Pose: Natural Language Grounded Pose Forecasting (2019)
- Target: text -> pose animation
- Method: learn a joint embedding of text and pose using curriculum learning
- Dataset: The KIT Motion-Language Dataset
-
A Pipeline for Creative Visual Storytelling (2018)
- Target: video -> a sequence of text (Pipeline proposed)
-
Video Storytelling (2018)
- Target: video -> a sequence of coherent and succinct text
- Method:
- Contextual multimodal embedding: Residual Bidirectional RNN
- Narrator: Reinforcement learning