@@ -6,8 +6,8 @@ This is the unified test dataset for comparing different AI models (commercial,
66distilled, SLM, and RAG systems) in the ELO2 - Green AI project.
77
88The dataset consists of selected passages from Wikipedia's Apollo 11 article,
9- accompanied by 15 standardized prompts testing summarization, reasoning, and
10- retrieval-augmented generation capabilities.
9+ accompanied by 21 standardized prompts testing summarization, reasoning,
10+ retrieval, paraphrasing, and creative generation capabilities.
1111
1212---
1313
@@ -16,7 +16,7 @@ retrieval-augmented generation capabilities.
1616- ** [ README.md] [ readme ] ** - This file (overview and instructions)
1717- ** [ source_text.txt] [ source ] ** - Apollo 11 excerpted text
1818(~ 1,400 words, plain text)
19- - ** [ test_prompts.md] [ prompts ] ** - 15 test prompts (readable format)
19+ - ** [ test_prompts.md] [ prompts ] ** - All test prompts (readable format)
2020- ** [ test_data.json] [ json ] ** - Complete dataset (structured format for automated
2121 testing)
2222- ** [ RATIONALE.md] [ rationale ] ** - Detailed explanation of selection decisions
@@ -56,7 +56,7 @@ operations"
5656
5757- Individual sentences are unchanged; some paragraphs omitted for length management.
5858- Complete original sections total ~ 3,800 words; excerpted to ~ 1,400 words for
59- practical testing while maintaining all information necessary for the 15 test prompts.
59+ practical testing while maintaining all information necessary for the 21 test prompts.
6060
6161📌 See [ source_text.txt] [ source ] for the complete excerpted text.
6262
@@ -72,63 +72,95 @@ technical terms)
7272be tested
7373- ✅ ** Narrative structure** - Clear sequence from descent through surface
7474activities
75- - ✅ ** All prompts answerable** - 15 test prompts verified to work with selected
75+ - ✅ ** All prompts answerable** - 21 test prompts verified to work with selected
7676passages
7777
7878The excerpts cover the dramatic descent and landing sequence, followed by
7979moonwalk activities, ensuring comprehensive testing across summarization,
80- reasoning, and RAG tasks.
80+ reasoning, RAG, paraphrasing and creative generation tasks.
8181
8282📌 See [ RATIONALE.md] [ rationale ] for detailed selection methodology.
8383
8484---
8585
8686## 📝 Test Structure
8787
88- ** 15 Standardized Prompts ** across three categories:
88+ ![ image ] ( images/evaluation-process.png )
8989
90- ### Summarization (5 prompts)
90+ The test includes ** 21 standardized prompts** distributed across ** five categories** .
91+ In addition, a ** Master Instruction** and ** task-specific guidance prompts** are
92+ provided to ensure consistency and clarity across all tasks.
9193
92- Tests model's ability to condense and extract key information
94+ ### Prompt Delivery Overview
95+
96+ The test follows this sequence:
97+
98+ ** 1. The Master Instruction** is used ** once at the beginning** of the test.
99+ ** 2.** Before each category, a ** task-specific guidance prompt** clarifies how
100+ the model should approach that task type (e.g., reasoning, summarization, retrieval).
101+ ** 3.** Then, the ** individual prompts for that category** are presented in order
102+ of increasing difficulty.
103+
104+ ### Prompt Categories
105+
106+ #### 1. Summarization (5 prompts)
107+
108+ Tests model's ability to condense and extract key information.
93109
94110** Difficulty:** Easy → Medium → Hard
95- ** Examples:** Main events, challenges faced, activities performed, equipment
96- deployed
111+ ** Examples:** Main events, challenges faced, activities performed, equipment deployed
97112
98- ### Reasoning (5 prompts)
113+ #### 2. Reasoning (5 prompts)
99114
100- Tests model's ability to analyze, infer, and make connections
115+ Tests model's ability to analyze, infer, and make connections.
101116
102- ** Types:** Causal reasoning, hypothetical scenarios, interpretation, deep
103- analysis
117+ ** Types:** Causal reasoning, hypothetical scenarios, interpretation,
118+ deep analysis
104119** Examples:** Why did computer alarms occur? What if Armstrong hadn't taken
105120manual control? What does Margaret Hamilton's statement reveal?
106121
107- ### RAG - Retrieval (5 prompts)
122+ #### 3. RAG – Retrieval (5 prompts)
108123
109- Tests model's ability to retrieve specific information from source text
124+ Tests model's ability to retrieve specific information from source text.
110125
111126** Types:** Times, quotes, numbers, lists, complex multi-part facts
112127** Examples:** Landing time? Material collected? Scientific instruments deployed?
113128
114- 📌 See [ test_prompts.md] [ prompts ] for the readable format, or [ test_data.json] [ json ]
115- for its structured data version.
129+ #### 4. Paraphrasing (3 prompts)
130+
131+ Tests model's ability to restate information in its own words.
132+
133+ ** Difficulty:** Easy → Medium
134+ ** Examples:** Describe computer alarms, Armstrong’s teamwork, or sample collection.
135+
136+ #### 5. Creative Generation (3 prompts)
137+
138+ Tests model's interpretive and imaginative capabilities.
139+
140+ ** Difficulty:** Easy → Medium
141+ ** Examples:** Imagine being in Mission Control. What does landing show about courage?
142+ How did it change Earth?
143+
144+ 📌 See [ test_prompts.md] [ prompts ] for the readable version with full prompt texts,
145+ or [ test_data.json] [ json ] for its structured data version.
116146
117147---
118148
119149## 🔧 How to Use
120150
121151### General Instructions
122152
123- - ** All 15 prompts** should be tested across all models to ensure a fair comparison.
153+ - ** All 21 prompts** should be tested across all models to ensure a fair comparison.
154+ - The ** Master Instruction** and any ** task-specific guidance prompts** should
155+ be applied as described in the Test Structure section.
124156- Some prompts can be more challenging for smaller models,
125157but attempting all prompts provides comprehensive evaluation data.
126158
127159** Testing Protocol:**
128160
129161** 1.** Use the source text from ** [ source_text.txt] [ source ] **
130162exactly as provided
131- ** 2.** Use all 15 prompts from ** [ test_prompts.md] [ prompts ] ** without
163+ ** 2.** Use all prompts from ** [ test_prompts.md] [ prompts ] ** without
132164modification
133165** 3.** * (Optional)* Use ** [ test_data.json] [ json ] ** for automated or scripted
134166 testing workflows
@@ -144,9 +176,15 @@ For each prompt, record:
144176** 1. Accuracy** - Is the answer factually correct?
145177** 2. Completeness** - Are all key points covered?
146178** 3. Specificity** - Are specific details included (times, names, numbers)?
147- ** 4. Reasoning Quality** - For reasoning prompts, is the logic sound and
148- well-supported?
149-
179+ ** 4. Reasoning Quality** - Is the logic sound and well-supported?
180+ ** 5. Paraphrasing Quality** - Is information reworded(not copied)
181+ while maintaining accuracy?
182+ ** 6. Creative Generation Quality** - Is the response coherent, relevant, and text-inspired?
183+ ** 7. Instruction Following** - Does the model follow the master or task-spesific
184+ instructions (no source mentions, concise, natural)?
185+
186+ ** Note:** Creative generation prompts have no single correct answer. Evaluate
187+ based on coherence, relevance to text, and quality of reasoning.
150188Maintain consistent evaluation criteria across all models for fair comparison.
151189
152190---
0 commit comments