-
Notifications
You must be signed in to change notification settings - Fork 0
/
Answer for PaLM-E.txt
26 lines (12 loc) · 1.46 KB
/
Answer for PaLM-E.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Answer for PaLM-E
1. What is grounding?
Connecting those representations to real-world visual and physical sensor modalities is essential to solving a wider range of grounded real-world problems in computer vision and robotics.
2. What is zero-shot multimodal chain-of-thought reasoning? Use Figure 2 for examples.
3. What are the main building blocks of PaLM-E? What are the inputs and outputs?
The inputs to PaLM-E consist of text and (multiple) continuous observations. The multimodal tokens corresponding to these observations are interleaved with the text to form multi-modal sentences. An example of such a multi-modal sentence is Q: What happened between <img 1> and <img 2>? where <img i> represents an embedding of an image. The output of PaLM-E is text generated auto-regressively by the model, which could be an answer to a question, or a sequence of decisions produced by PaLM-E in textual form that should be executed by a robot.
4. Explain the meaning of eq. (1) and how it relates to eq. (2).
5. Describe the different types of input representations tested in PaLM-E and their purpose.
6. What does Figure 3 illustrate?
7. Describe the experiment shown in Figure 5.
8. What are the motivations to use large language models in robotics?
In contrast, PaLM-E is trained to generate plans directly without relying on auxiliary models for grounding. This in turn enables direct integration of the rich semantic knowledge stored in pretrained LLMs into the planning process.