Skip to content

Commit ffef4c4

Browse files
authored
Merge pull request #19 from pfilipovich/section-9-measuring-and-reporting
fix_issues_and_elaborate_evaluation
2 parents a5b19de + c60b392 commit ffef4c4

File tree

1 file changed

+29
-21
lines changed

1 file changed

+29
-21
lines changed

Design_Doc_Examples/Magic_shaperoint_design.md

+29-21
Original file line numberDiff line numberDiff line change
@@ -800,79 +800,87 @@ The **Sparse Encoded Retrieval Baseline** serves as a straightforward search eng
800800

801801
**Evaluation approach**
802802

803-
Evaluating the relevance of responses to user queries can be challenging. For this purpose, we could use a crowdsourcing platform. Assessors will be provided with a series of prompts and answers to evaluate their relevance. The assessors will use a 5-point scale where 5 is a full match, and 1 indicates no relevance. We consider the following metrics:
803+
Evaluating the relevance of responses to user queries can be challenging. For this purpose, we could use a crowdsourcing platform. Assessors will be provided with a series of prompts and answers not only to assess relevance but also to detect hallucinations. We consider the following metrics:
804804

805805
- **Average Relevance Score** of the direct questions.
806806
- **Average Relevance Score** of following-up questions.
807+
- **Hallucination Rate**: This new metric quantifies the percentage of responses that contain hallucinated content. Responses are considered hallucinated if they include information not supported by facts or the input prompt.
807808

808-
For example, let’s say we use Yandex.Toloka (or Amazon Mechanical Turk, etc.) as a crowdsourcing platform.
809+
**Assessment Method**:
810+
- Assessors will be provided with a series of prompts and answers to evaluate both their relevance and accuracy. Alongside the 5-point scale for relevance, assessors will use a binary scale (Yes/No) to indicate whether each response contains hallucinated information.
811+
- For nuanced analysis, we can further categorize hallucinations by severity, with minor inaccuracies noted separately from outright fabrications.
809812

810-
Approximate settings for the crowdsourcing platform:
813+
**Platform and Settings**:
811814

815+
- **Platform Choice**: Yandex.Toloka or Amazon Mechanical Turk, etc.
812816
- **Total Assessors:** 100
813817
- **Query-Answer Pairs for Direct Questions:** 500
814818
- **Query-Answer Pairs for Follow-up Questions:** 500
815819

816-
Terminology:
820+
**Terminology**:
817821
- **Task**: Defined as one Query-Answer Pair, which is a single item for assessment.
818822
- **Pool**: Described as a page with multiple tasks for assessors to evaluate.
819823
- **Overlap**: Indicates how many different assessors evaluate the same task, ensuring accurate data by having multiple reviewers.
820824

821-
Cost Calculation Example:
825+
**Cost Calculation**:
822826

823827
- **Pool Price:** $0.05
824828
- **Total Tasks**: 1000 (500 for direct questions and 500 for following-up questions)
825829
- **Tasks Per Pool:** 5
826830
- **Overlap:** 5
827831

828-
Expense formula: pool_price*(total_tasks/tasks_per_pool)*overlap=expense
832+
**Expense Formula**: `expense = pool_price * (total_tasks / tasks_per_pool) * overlap`
829833

830-
Cost of direct questions: 0.05*(500/5)*5=$25
834+
- Cost of direct questions: $25
835+
- Cost of following-up questions: $25
836+
- Total Cost with Hallucination Assessment: $50
831837

832-
Cost of following-up questions: 0.05*(500/5)*5=$25
838+
**Budget Adjustments**:
833839

834-
**Total Cost of Direct and Follow-up Questions**: $50
835-
836-
The settings can be adjusted according to a budget.
840+
The settings can be adapted based on the budget, with potential increases to accommodate the additional complexity of assessing hallucinations.
837841

838842
**Special Considerations for Niche Domains**: The evaluation approach works well for well-known domains. For specific domains, we can use local experts who are familiar with the context.
839843

840844
#### ii. A/B Tests
841845

842846
**Hypothesis**
843847

844-
Based on offline metrics and evaluation with a crowdsourcing platform, we expect to improve **Average Relevance Score**.
848+
- **Primary Hypothesis**: Based on offline metrics and evaluation with a crowdsourcing platform, we expect to improve **Average Relevance Score**.
849+
- **Secondary Hypothesis**: The system will deliver responses within an average of 1 minute, supporting efficient user interaction without sacrificing quality or accuracy.
845850

846851
**Termination Criteria.**
847852

848-
- The system must deliver responses within an average of 1 minutes.
853+
- The system must deliver responses within an average of 1.5 minutes.
849854
- The percentage of reports with offensive or improper responses must be below 1%.
850855

851856
If the termination criteria are met, the experiment will be paused and resumed after corrections.
852857

853858
**Key Metrics**
854859

855-
- **Time to Retrieve (TTR)**: Measures the average time taken by the system to fetch and display results after a query is submitted.
856860
- **Average Relevance Score**: Calculates the mean score of how relevant the answers provided by the system are to the queries (positive/negative feedback).
861+
- **Offensive or Improper Responses**: This metric tracks the rate at which the system produces inappropriate or offensive content (offesnive or improper responses reports).
862+
863+
**Control metrics**
864+
865+
- **Time to Retrieve (TTR)**: Measures the average time taken by the system to fetch and display results after a query is submitted.
857866
- **Average amount of clarification questions**: Tracks the average number of additional questions the system needs to ask users to clarify their initial queries.
858867
- **Average time of dialogue**: Measures the average duration of an interaction session between the user and the system. This includes the time from the initial query to the final response.
859868

860-
**Additional metrics**
869+
**Auxiliary metrics**
861870

862871
- Total Document Count
863872
- Daily New Documents
864873
- Total User Count
865874
- New Users per Day
866875
- Session Count per Day
867-
- Offensive or Improper Responses
868876

869-
**Splitting Strategy.** Users will be split into two groups by their IDs. Groups will be swapped after a certain period of time.
870-
871-
**Statistical Criteria.** Statistical significance will be determined using Welch’s t-test, with a significance level set at 5% and the type II error at 10%.
877+
**Splitting Strategy.** Users will be split into two groups by their IDs.
872878

873879
**Experiment Duration.**
874880

875-
The experiment will last two weeks. Each group will experience both the baseline and the new solution for one week.
881+
The experiment will last two weeks. After one week, groups will swap configurations to mitigate any biases introduced by variable user experiences and external factors.
882+
883+
**Statistical Criteria.** Statistical significance will be determined using Welch’s t-test, with a significance level set at 5% and the type II error at 10%.
876884

877885
**Future Steps for Experiment Improvement.**
878886

@@ -882,7 +890,7 @@ To further validate our experimental setup, we propose incorporating an A/A test
882890

883891
At the end of the experiment, a comprehensive report will be generated. This will include:
884892

885-
- Key and auxiliary metric results with a 95% confidence interval.
893+
- Key, control and auxiliary metric results with a 95% confidence interval.
886894
- Distribution plots showing metric trends over time.
887895
- Absolute numbers for all collected data.
888896
- Detailed descriptions of each tested approach with links to full documentation.

0 commit comments

Comments
 (0)