You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Code from webVoyager https://github.com/MinorJerry/WebVoyager/blob/main/evaluation/auto_eval.py
2
+
importargparse
3
+
importos
4
+
importjson
5
+
importtime
6
+
importre
7
+
importbase64
8
+
9
+
fromopenaiimportOpenAI
10
+
11
+
SYSTEM_PROMPT="""As an evaluator, you will be presented with three primary components to assist you in your role:
12
+
13
+
1. Web Task Instruction: This is a clear and specific directive provided in natural language, detailing the online activity to be carried out. These requirements may include conducting searches, verifying information, comparing prices, checking availability, or any other action relevant to the specified web service (such as Amazon, Apple, ArXiv, BBC News, Booking etc).
14
+
15
+
2. Result Screenshots: This is a visual representation of the screen showing the result or intermediate state of performing a web task. It serves as visual proof of the actions taken in response to the instruction.
16
+
17
+
3. Result Response: This is a textual response obtained after the execution of the web task. It serves as textual result in response to the instruction.
18
+
19
+
-- You DO NOT NEED to interact with web pages or perform actions such as booking flights or conducting searches on websites.
20
+
-- You SHOULD NOT make assumptions based on information not presented in the screenshot when comparing it to the instructions.
21
+
-- Your primary responsibility is to conduct a thorough assessment of the web task instruction against the outcome depicted in the screenshot and in the response, evaluating whether the actions taken align with the given instructions.
22
+
-- NOTE that the instruction may involve more than one task, for example, locating the garage and summarizing the review. Failing to complete either task, such as not providing a summary, should be considered unsuccessful.
23
+
-- NOTE that the screenshot is authentic, but the response provided by LLM is generated at the end of web browsing, and there may be discrepancies between the text and the screenshots.
24
+
-- Note the difference: 1) Result response may contradict the screenshot, then the content of the screenshot prevails, 2) The content in the Result response is not mentioned on the screenshot, choose to believe the content.
25
+
26
+
You should elaborate on how you arrived at your final evaluation and then provide a definitive verdict on whether the task has been successfully accomplished, either as 'SUCCESS' or 'NOT SUCCESS'."""
0 commit comments