Skip to content

Commit 568d25e

Browse files
authored
Merge pull request #20 from pfilipovich/section_2_fix_comments
Section 2 fix comments
2 parents ffef4c4 + d4032d4 commit 568d25e

File tree

1 file changed

+27
-4
lines changed

1 file changed

+27
-4
lines changed

Design_Doc_Examples/Magic_shaperoint_design.md

+27-4
Original file line numberDiff line numberDiff line change
@@ -121,8 +121,20 @@ Every month:
121121

122122
**i. Metrics**
123123

124-
The task could be split into independent subtasks: data extraction (OCR) and data retrieval and answer generation. These parts can be evaluated independently to prioritize improvements based on the source of errors as well as overall solution performance.
124+
The task could be split into independent subtasks: OCR, Intent classification -> Search/Retrieval -> answer generation. These parts can be evaluated independently to prioritize improvements based on the source of errors as well as overall solution performance.
125125

126+
127+
***Question Intent Classification Metrics:***
128+
129+
Pre-requirements: Define all possible labels (intents). Prepare dataset of questions/sentences and appropriate labels.
130+
131+
It shows how accurately a model can classify the intent of questions, which is crucial for the whole processing pipeline. Below metrics could be used for QIC (macro + per class):
132+
133+
1. Precision
134+
2. Recall
135+
3. F1
136+
137+
126138
***Data Extraction Metrics:***
127139

128140
Pre-requirements: Dataset of scanned documents and appropriate them texts. (As a work around: readable documents could be scanned manually, which gives both - scanned image and ground truth text values)
@@ -133,14 +145,21 @@ It’s reasonable to measure OCR quality separately, as in the case of poor OCR
133145

134146
It operates on the word level and shows the percentage of words that are misspelled in the OCR output compared to the ground truth. LLMs usually cope well with misprints; however, there still could be errors in important data, so WER could be used as a quick check of OCR quality. The lower the better. It could be replaced with the Character Error Rate.
135147

148+
$`word\_error\_rate = \frac{amount\_of\_misspelled\_words}{total\_amount\_of\_words}`$
149+
136150
**b. Formula Error Rate**
137151

138152
Formulas could be presented in the document as well. They are not a typical OCR problem, so if they are not recognized well, it degrades the system performance. The formula error rate could be measured as the percentage of incorrect OCRed formulas to the total amount of formulas.
139153

154+
$`formula\_error\_rate = \frac{amount\_of\_misspelled\_formulas}{total\_amount\_of\_formulas}`$
155+
140156
**c. Cell Error Rate**
141157

142158
As it is important to extract table-structured data as well, the percentage of incorrectly detected table cells compared to the ground truth could be used as one of the metrics.
143159

160+
$`cell\_error\_rate = \frac{amount\_of\_incorrectly\_detected\_cells}{total\_amount\_of\_cells}`$
161+
162+
144163
***Retrieval Metrics:***
145164

146165
Pre-requirements: Dataset of queries collected from experts and list of N most relevant chunks for each of the query.
@@ -158,8 +177,8 @@ NDCG is a metric that calculates the average of DCGs for a given set of results,
158177

159178
Measures how well the generated answers match the context and query.
160179
There are several approaches to calculate that metric:
161-
- automatically with framework (detailed description provided in section IV. Validation Schema)
162-
- with other llms paper to consider https://arxiv.org/pdf/2305.06311
180+
- automatically with framework ragas [https://docs.ragas.io/en/stable/] (detailed description provided in section IV. Validation Schema)
181+
- with other llms (paper to consider "Automatic Evaluation of Attribution by Large Language Models" [https://arxiv.org/pdf/2305.06311])
163182
- manually based on experts output (approach is provided in section IX. Measuring and reporting)
164183

165184

@@ -171,14 +190,18 @@ How to calculate:
171190

172191
- Manually: prepare dataset of queries (including queries without answer in dataset) + expected responses; calculate by comparing expected response to provided
173192
- Finetune smaller llms to detect hallucination
174-
- Add guardrails https://github.com/NVIDIA/NeMo-Guardrails - it will not only improve reponse, but also helps to calculate amount of times, model attempts to hallucinate
193+
- Add guardrails [https://github.com/NVIDIA/NeMo-Guardrails] - it will not only improve reponse, but also helps to calculate amount of times, model attempts to hallucinate
194+
195+
$`hallucination\_rate = \frac{amount\_of\_hallucinated\_responses}{total\_amount\_of\_responses}`$
196+
175197

176198
**h. Clarification Capability**
177199

178200
Pre-requirements: Dataset of queries (ideally with unambiguous answer) + expected response, domain experts to evaluate the metric manually.
179201

180202
As one of the requirements is the ability to automatically request more details if an insufficient answer is generated, the average number of interactions or follow-up questions needed to clarify or correct an answer could be calculated to measure clarification capability and average relevance of follow-up questions. This metric helps to check the system’s ability to provide comprehensive answers initially or minimise the number of interactions needed for detalization.
181203

204+
$`clarification\_capability = \frac{number\_of\_clarification\_questions}{total\_amount\_of\_queries}`$
182205

183206

184207
*Metrics to pick:*

0 commit comments

Comments
 (0)