You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Design_Doc_Examples/Magic_shaperoint_design.md
+29-21
Original file line number
Diff line number
Diff line change
@@ -800,79 +800,87 @@ The **Sparse Encoded Retrieval Baseline** serves as a straightforward search eng
800
800
801
801
**Evaluation approach**
802
802
803
-
Evaluating the relevance of responses to user queries can be challenging. For this purpose, we could use a crowdsourcing platform. Assessors will be provided with a series of prompts and answers to evaluate their relevance. The assessors will use a 5-point scale where 5 is a full match, and 1 indicates no relevance. We consider the following metrics:
803
+
Evaluating the relevance of responses to user queries can be challenging. For this purpose, we could use a crowdsourcing platform. Assessors will be provided with a series of prompts and answers not only to assess relevance but also to detect hallucinations. We consider the following metrics:
804
804
805
805
-**Average Relevance Score** of the direct questions.
806
806
-**Average Relevance Score** of following-up questions.
807
+
-**Hallucination Rate**: This new metric quantifies the percentage of responses that contain hallucinated content. Responses are considered hallucinated if they include information not supported by facts or the input prompt.
807
808
808
-
For example, let’s say we use Yandex.Toloka (or Amazon Mechanical Turk, etc.) as a crowdsourcing platform.
809
+
**Assessment Method**:
810
+
- Assessors will be provided with a series of prompts and answers to evaluate both their relevance and accuracy. Alongside the 5-point scale for relevance, assessors will use a binary scale (Yes/No) to indicate whether each response contains hallucinated information.
811
+
- For nuanced analysis, we can further categorize hallucinations by severity, with minor inaccuracies noted separately from outright fabrications.
809
812
810
-
Approximate settings for the crowdsourcing platform:
813
+
**Platform and Settings**:
811
814
815
+
-**Platform Choice**: Yandex.Toloka or Amazon Mechanical Turk, etc.
812
816
-**Total Assessors:** 100
813
817
-**Query-Answer Pairs for Direct Questions:** 500
814
818
-**Query-Answer Pairs for Follow-up Questions:** 500
815
819
816
-
Terminology:
820
+
**Terminology**:
817
821
-**Task**: Defined as one Query-Answer Pair, which is a single item for assessment.
818
822
-**Pool**: Described as a page with multiple tasks for assessors to evaluate.
819
823
-**Overlap**: Indicates how many different assessors evaluate the same task, ensuring accurate data by having multiple reviewers.
820
824
821
-
Cost Calculation Example:
825
+
**Cost Calculation**:
822
826
823
827
-**Pool Price:** $0.05
824
828
-**Total Tasks**: 1000 (500 for direct questions and 500 for following-up questions)
Cost of following-up questions: 0.05*(500/5)*5=$25
838
+
**Budget Adjustments**:
833
839
834
-
**Total Cost of Direct and Follow-up Questions**: $50
835
-
836
-
The settings can be adjusted according to a budget.
840
+
The settings can be adapted based on the budget, with potential increases to accommodate the additional complexity of assessing hallucinations.
837
841
838
842
**Special Considerations for Niche Domains**: The evaluation approach works well for well-known domains. For specific domains, we can use local experts who are familiar with the context.
839
843
840
844
#### ii. A/B Tests
841
845
842
846
**Hypothesis**
843
847
844
-
Based on offline metrics and evaluation with a crowdsourcing platform, we expect to improve **Average Relevance Score**.
848
+
-**Primary Hypothesis**: Based on offline metrics and evaluation with a crowdsourcing platform, we expect to improve **Average Relevance Score**.
849
+
-**Secondary Hypothesis**: The system will deliver responses within an average of 1 minute, supporting efficient user interaction without sacrificing quality or accuracy.
845
850
846
851
**Termination Criteria.**
847
852
848
-
- The system must deliver responses within an average of 1 minutes.
853
+
- The system must deliver responses within an average of 1.5 minutes.
849
854
- The percentage of reports with offensive or improper responses must be below 1%.
850
855
851
856
If the termination criteria are met, the experiment will be paused and resumed after corrections.
852
857
853
858
**Key Metrics**
854
859
855
-
-**Time to Retrieve (TTR)**: Measures the average time taken by the system to fetch and display results after a query is submitted.
856
860
-**Average Relevance Score**: Calculates the mean score of how relevant the answers provided by the system are to the queries (positive/negative feedback).
861
+
-**Offensive or Improper Responses**: This metric tracks the rate at which the system produces inappropriate or offensive content (offesnive or improper responses reports).
862
+
863
+
**Control metrics**
864
+
865
+
-**Time to Retrieve (TTR)**: Measures the average time taken by the system to fetch and display results after a query is submitted.
857
866
-**Average amount of clarification questions**: Tracks the average number of additional questions the system needs to ask users to clarify their initial queries.
858
867
-**Average time of dialogue**: Measures the average duration of an interaction session between the user and the system. This includes the time from the initial query to the final response.
859
868
860
-
**Additional metrics**
869
+
**Auxiliary metrics**
861
870
862
871
- Total Document Count
863
872
- Daily New Documents
864
873
- Total User Count
865
874
- New Users per Day
866
875
- Session Count per Day
867
-
- Offensive or Improper Responses
868
876
869
-
**Splitting Strategy.** Users will be split into two groups by their IDs. Groups will be swapped after a certain period of time.
870
-
871
-
**Statistical Criteria.** Statistical significance will be determined using Welch’s t-test, with a significance level set at 5% and the type II error at 10%.
877
+
**Splitting Strategy.** Users will be split into two groups by their IDs.
872
878
873
879
**Experiment Duration.**
874
880
875
-
The experiment will last two weeks. Each group will experience both the baseline and the new solution for one week.
881
+
The experiment will last two weeks. After one week, groups will swap configurations to mitigate any biases introduced by variable user experiences and external factors.
882
+
883
+
**Statistical Criteria.** Statistical significance will be determined using Welch’s t-test, with a significance level set at 5% and the type II error at 10%.
876
884
877
885
**Future Steps for Experiment Improvement.**
878
886
@@ -882,7 +890,7 @@ To further validate our experimental setup, we propose incorporating an A/A test
882
890
883
891
At the end of the experiment, a comprehensive report will be generated. This will include:
884
892
885
-
- Key and auxiliary metric results with a 95% confidence interval.
893
+
- Key, control and auxiliary metric results with a 95% confidence interval.
886
894
- Distribution plots showing metric trends over time.
887
895
- Absolute numbers for all collected data.
888
896
- Detailed descriptions of each tested approach with links to full documentation.
0 commit comments