You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Design_Doc_Examples/Magic_shaperoint_design.md
+27-4
Original file line number
Diff line number
Diff line change
@@ -121,8 +121,20 @@ Every month:
121
121
122
122
**i. Metrics**
123
123
124
-
The task could be split into independent subtasks: data extraction (OCR) and data retrieval and answer generation. These parts can be evaluated independently to prioritize improvements based on the source of errors as well as overall solution performance.
124
+
The task could be split into independent subtasks: OCR, Intent classification -> Search/Retrieval -> answer generation. These parts can be evaluated independently to prioritize improvements based on the source of errors as well as overall solution performance.
125
125
126
+
127
+
***Question Intent Classification Metrics:***
128
+
129
+
Pre-requirements: Define all possible labels (intents). Prepare dataset of questions/sentences and appropriate labels.
130
+
131
+
It shows how accurately a model can classify the intent of questions, which is crucial for the whole processing pipeline. Below metrics could be used for QIC (macro + per class):
132
+
133
+
1. Precision
134
+
2. Recall
135
+
3. F1
136
+
137
+
126
138
***Data Extraction Metrics:***
127
139
128
140
Pre-requirements: Dataset of scanned documents and appropriate them texts. (As a work around: readable documents could be scanned manually, which gives both - scanned image and ground truth text values)
@@ -133,14 +145,21 @@ It’s reasonable to measure OCR quality separately, as in the case of poor OCR
133
145
134
146
It operates on the word level and shows the percentage of words that are misspelled in the OCR output compared to the ground truth. LLMs usually cope well with misprints; however, there still could be errors in important data, so WER could be used as a quick check of OCR quality. The lower the better. It could be replaced with the Character Error Rate.
Formulas could be presented in the document as well. They are not a typical OCR problem, so if they are not recognized well, it degrades the system performance. The formula error rate could be measured as the percentage of incorrect OCRed formulas to the total amount of formulas.
As it is important to extract table-structured data as well, the percentage of incorrectly detected table cells compared to the ground truth could be used as one of the metrics.
Pre-requirements: Dataset of queries collected from experts and list of N most relevant chunks for each of the query.
@@ -158,8 +177,8 @@ NDCG is a metric that calculates the average of DCGs for a given set of results,
158
177
159
178
Measures how well the generated answers match the context and query.
160
179
There are several approaches to calculate that metric:
161
-
- automatically with framework (detailed description provided in section IV. Validation Schema)
162
-
- with other llms paper to consider https://arxiv.org/pdf/2305.06311
180
+
- automatically with framework ragas [https://docs.ragas.io/en/stable/](detailed description provided in section IV. Validation Schema)
181
+
- with other llms (paper to consider "Automatic Evaluation of Attribution by Large Language Models" [https://arxiv.org/pdf/2305.06311])
163
182
- manually based on experts output (approach is provided in section IX. Measuring and reporting)
164
183
165
184
@@ -171,14 +190,18 @@ How to calculate:
171
190
172
191
- Manually: prepare dataset of queries (including queries without answer in dataset) + expected responses; calculate by comparing expected response to provided
173
192
- Finetune smaller llms to detect hallucination
174
-
- Add guardrails https://github.com/NVIDIA/NeMo-Guardrails - it will not only improve reponse, but also helps to calculate amount of times, model attempts to hallucinate
193
+
- Add guardrails [https://github.com/NVIDIA/NeMo-Guardrails] - it will not only improve reponse, but also helps to calculate amount of times, model attempts to hallucinate
Pre-requirements: Dataset of queries (ideally with unambiguous answer) + expected response, domain experts to evaluate the metric manually.
179
201
180
202
As one of the requirements is the ability to automatically request more details if an insufficient answer is generated, the average number of interactions or follow-up questions needed to clarify or correct an answer could be calculated to measure clarification capability and average relevance of follow-up questions. This metric helps to check the system’s ability to provide comprehensive answers initially or minimise the number of interactions needed for detalization.
0 commit comments