Skip to content

Commit 2c6e0f0

Browse files
authored
Merge pull request #18 from MIT-Emerging-Talent/test_prompts
Milestone 3: updated test prompts
2 parents 7b7a41d + 37a71b6 commit 2c6e0f0

File tree

8 files changed

+230
-55
lines changed

8 files changed

+230
-55
lines changed

test_dataset_apollo11/.DS_Store

6 KB
Binary file not shown.

test_dataset_apollo11/RATIONALE.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,8 @@ The excerpted length balances comprehensiveness with practical testability.
4242

4343
## Why These Excerpted Passages?
4444

45+
![image](images/test-selection.png)
46+
4547
**Continuous Narrative:**
4648

4749
Selected passages flow from descent through surface activities, forming a natural
@@ -64,7 +66,7 @@ and analytical reasoning.
6466

6567
**Verified Coverage:**
6668

67-
All 15 test prompts confirmed answerable with excerpted passages through
69+
All 21 test prompts confirmed answerable with excerpted passages through
6870
preliminary testing.
6971

7072
**Length Management:**

test_dataset_apollo11/README.md

Lines changed: 62 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@ This is the unified test dataset for comparing different AI models (commercial,
66
distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
77

88
The dataset consists of selected passages from Wikipedia's Apollo 11 article,
9-
accompanied by 15 standardized prompts testing summarization, reasoning, and
10-
retrieval-augmented generation capabilities.
9+
accompanied by 21 standardized prompts testing summarization, reasoning,
10+
retrieval, paraphrasing, and creative generation capabilities.
1111

1212
---
1313

@@ -16,7 +16,7 @@ retrieval-augmented generation capabilities.
1616
- **[README.md][readme]** - This file (overview and instructions)
1717
- **[source_text.txt][source]** - Apollo 11 excerpted text
1818
(~1,400 words, plain text)
19-
- **[test_prompts.md][prompts]** - 15 test prompts (readable format)
19+
- **[test_prompts.md][prompts]** - All test prompts (readable format)
2020
- **[test_data.json][json]** - Complete dataset (structured format for automated
2121
testing)
2222
- **[RATIONALE.md][rationale]** - Detailed explanation of selection decisions
@@ -56,7 +56,7 @@ operations"
5656

5757
- Individual sentences are unchanged; some paragraphs omitted for length management.
5858
- Complete original sections total ~3,800 words; excerpted to ~1,400 words for
59-
practical testing while maintaining all information necessary for the 15 test prompts.
59+
practical testing while maintaining all information necessary for the 21 test prompts.
6060

6161
📌 See [source_text.txt][source] for the complete excerpted text.
6262

@@ -72,63 +72,95 @@ technical terms)
7272
be tested
7373
-**Narrative structure** - Clear sequence from descent through surface
7474
activities
75-
-**All prompts answerable** - 15 test prompts verified to work with selected
75+
-**All prompts answerable** - 21 test prompts verified to work with selected
7676
passages
7777

7878
The excerpts cover the dramatic descent and landing sequence, followed by
7979
moonwalk activities, ensuring comprehensive testing across summarization,
80-
reasoning, and RAG tasks.
80+
reasoning, RAG, paraphrasing and creative generation tasks.
8181

8282
📌 See [RATIONALE.md][rationale] for detailed selection methodology.
8383

8484
---
8585

8686
## 📝 Test Structure
8787

88-
**15 Standardized Prompts** across three categories:
88+
![image](images/evaluation-process.png)
8989

90-
### Summarization (5 prompts)
90+
The test includes **21 standardized prompts** distributed across **five categories**.
91+
In addition, a **Master Instruction** and **task-specific guidance prompts** are
92+
provided to ensure consistency and clarity across all tasks.
9193

92-
Tests model's ability to condense and extract key information
94+
### Prompt Delivery Overview
95+
96+
The test follows this sequence:
97+
98+
**1. The Master Instruction** is used **once at the beginning** of the test.
99+
**2.** Before each category, a **task-specific guidance prompt** clarifies how
100+
the model should approach that task type (e.g., reasoning, summarization, retrieval).
101+
**3.** Then, the **individual prompts for that category** are presented in order
102+
of increasing difficulty.
103+
104+
### Prompt Categories
105+
106+
#### 1. Summarization (5 prompts)
107+
108+
Tests model's ability to condense and extract key information.
93109

94110
**Difficulty:** Easy → Medium → Hard
95-
**Examples:** Main events, challenges faced, activities performed, equipment
96-
deployed
111+
**Examples:** Main events, challenges faced, activities performed, equipment deployed
97112

98-
### Reasoning (5 prompts)
113+
#### 2. Reasoning (5 prompts)
99114

100-
Tests model's ability to analyze, infer, and make connections
115+
Tests model's ability to analyze, infer, and make connections.
101116

102-
**Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
103-
analysis
117+
**Types:** Causal reasoning, hypothetical scenarios, interpretation,
118+
deep analysis
104119
**Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
105120
manual control? What does Margaret Hamilton's statement reveal?
106121

107-
### RAG - Retrieval (5 prompts)
122+
#### 3. RAG Retrieval (5 prompts)
108123

109-
Tests model's ability to retrieve specific information from source text
124+
Tests model's ability to retrieve specific information from source text.
110125

111126
**Types:** Times, quotes, numbers, lists, complex multi-part facts
112127
**Examples:** Landing time? Material collected? Scientific instruments deployed?
113128

114-
📌 See [test_prompts.md][prompts] for the readable format, or [test_data.json][json]
115-
for its structured data version.
129+
#### 4. Paraphrasing (3 prompts)
130+
131+
Tests model's ability to restate information in its own words.
132+
133+
**Difficulty:** Easy → Medium
134+
**Examples:** Describe computer alarms, Armstrong’s teamwork, or sample collection.
135+
136+
#### 5. Creative Generation (3 prompts)
137+
138+
Tests model's interpretive and imaginative capabilities.
139+
140+
**Difficulty:** Easy → Medium
141+
**Examples:** Imagine being in Mission Control. What does landing show about courage?
142+
How did it change Earth?
143+
144+
📌 See [test_prompts.md][prompts] for the readable version with full prompt texts,
145+
or [test_data.json][json] for its structured data version.
116146

117147
---
118148

119149
## 🔧 How to Use
120150

121151
### General Instructions
122152

123-
- **All 15 prompts** should be tested across all models to ensure a fair comparison.
153+
- **All 21 prompts** should be tested across all models to ensure a fair comparison.
154+
- The **Master Instruction** and any **task-specific guidance prompts** should
155+
be applied as described in the Test Structure section.
124156
- Some prompts can be more challenging for smaller models,
125157
but attempting all prompts provides comprehensive evaluation data.
126158

127159
**Testing Protocol:**
128160

129161
**1.** Use the source text from **[source_text.txt][source]**
130162
exactly as provided
131-
**2.** Use all 15 prompts from **[test_prompts.md][prompts]** without
163+
**2.** Use all prompts from **[test_prompts.md][prompts]** without
132164
modification
133165
**3.** *(Optional)* Use **[test_data.json][json]** for automated or scripted
134166
testing workflows
@@ -144,9 +176,15 @@ For each prompt, record:
144176
**1. Accuracy** - Is the answer factually correct?
145177
**2. Completeness** - Are all key points covered?
146178
**3. Specificity** - Are specific details included (times, names, numbers)?
147-
**4. Reasoning Quality** - For reasoning prompts, is the logic sound and
148-
well-supported?
149-
179+
**4. Reasoning Quality** - Is the logic sound and well-supported?
180+
**5. Paraphrasing Quality** - Is information reworded(not copied)
181+
while maintaining accuracy?
182+
**6. Creative Generation Quality** - Is the response coherent, relevant, and text-inspired?
183+
**7. Instruction Following** - Does the model follow the master or task-spesific
184+
instructions (no source mentions, concise, natural)?
185+
186+
**Note:** Creative generation prompts have no single correct answer. Evaluate
187+
based on coherence, relevance to text, and quality of reasoning.
150188
Maintain consistent evaluation criteria across all models for fair comparison.
151189

152190
---
343 KB
Loading
235 KB
Loading
381 KB
Loading

0 commit comments

Comments
 (0)