Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add evaluation service module for RAG and Agent #2070

Merged
merged 10 commits into from
Oct 18, 2024

Conversation

Aries-ckt
Copy link
Collaborator

@Aries-ckt Aries-ckt commented Oct 15, 2024

Close #2023

Description

add evaluation service module for RAG and Agent
more reference http://docs.dbgpt.cn/docs/api/evaluation

How Has This Been Tested?

Evaluate for RAG recall score

  • Curl
SPACE_ID={YOUR_SPACE_ID}

curl --location --request POST 'http://localhost:5670/api/v2/serve/evaluate/evaluation' \
--header 'Content-Type: application/json' \
-d'{
  "scene_key": "recall",
  "scene_value": "'$SPACE_ID'",
  "context":{"top_k":5},
  "evaluate_metrics":["RetrieverHitRateMetric","RetrieverMRRMetric","RetrieverSimilarityMetric"],
  "datasets": [{
            "query": "what awel talked about",
            "doc_name":"awel.md"
        }]
}'
  • Python
from dbgpt.client import Client
from dbgpt.client.evaluation import run_evaluation
from dbgpt.serve.evaluate.api.schemas import EvaluateServeRequest

DBGPT_API_KEY = "dbgpt"
client = Client(api_key=DBGPT_API_KEY)
SPACE_ID={YOUR_SPACE_ID}
request = EvaluateServeRequest(
    # The scene type of the evaluation, e.g. support app, recall
    scene_key="recall",
    # e.g. app id(when scene_key is app), space id(when scene_key is recall)
    scene_value=SPACE_ID,
    context={"top_k": 5},
    evaluate_metrics=[
        "RetrieverHitRateMetric",
        "RetrieverMRRMetric",
        "RetrieverSimilarityMetric",
    ],
    datasets=[
        {
            "query": "what awel talked about",
            "doc_name": "awel.md",
        }
    ],
)
data = await run_evaluation(client, request=request)

Evaluate for Agent answer score

  • Curl
APP_ID={YOUR_APP_ID}
PROMPT_ID = {YOUR_PROMPT_ID}

curl --location --request POST 'http://localhost:5670/api/v2/serve/evaluate/evaluation' \
--header 'Authorization: Bearer dbgpt' \
--header 'Content-Type: application/json' \
-d '{
  "scene_key": "app",
  "scene_value": "'$APP_ID'",
  "context":{"top_k":5, "prompt":"'$PROMPT_ID'","model":"zhipu_proxyllm"},
  "evaluate_metrics":["AnswerRelevancyMetric"],
  "datasets": [{
            "query": "what awel talked about",
            "doc_name":"awel.md"
        }]
}'
  • Python
from dbgpt.client import Client
from dbgpt.client.evaluation import run_evaluation
from dbgpt.serve.evaluate.api.schemas import EvaluateServeRequest

DBGPT_API_KEY = "dbgpt"
client = Client(api_key=DBGPT_API_KEY)
request = EvaluateServeRequest(
    # The scene type of the evaluation, e.g. support app, recall
     scene_key="app",
   # e.g. app id(when scene_key is app), space id(when scene_key is recall)
     scene_value="2c76eea2-83b6-11ef-b482-acde48001122",
    context={
           "top_k": 5,
            # e.g. prompt id
           "prompt": "942acd7e33b54ce28565f89f9b278044",
              # e.g. llm model
            "model": "zhipu_proxyllm",
      },
     evaluate_metrics=[
          "AnswerRelevancyMetric",
     ],
     datasets=[
         {
              "query": "what awel talked about",
                    "doc_name": "awel.md",
           }
            ],
        )
data = await run_evaluation(client, request=request)

Snapshots:

Include snapshots for easier review.

Checklist:

  • My code follows the style guidelines of this project
  • I have already rebased the commits and make the commit message conform to the project standard.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • Any dependent changes have been merged and published in downstream modules

@github-actions github-actions bot added the enhancement New feature or request label Oct 15, 2024
@fangyinc
Copy link
Collaborator

There is an error here:

image

export SPACE_ID="awel_tutorial"

curl --location --request POST 'http://localhost:5670/api/v2/serve/evaluate/evaluation' \
--header 'Content-Type: application/json' \
-d'{
  "scene_key": "recall",
  "scene_value": "'"$SPACE_ID"'",
  "context":{"top_k":5},
  "evaluate_metrics":["RetrieverHitRateMetric","RetrieverMRRMetric","RetrieverSimilarityMetric"],
  "datasets": [{
            "query": "what awel talked about",
            "doc_name":"awel.md"
        }]
}'
{"success":false,"err_code":"E0003","err_msg":"Service.get_chunk_list() missing 2 required positional arguments: 'page' and 'page_size'","data":null}%

@csunny
Copy link
Collaborator

csunny commented Oct 18, 2024

  1. The assignment of space_id is confusing. When I open the browser and i see knowledge_id="xx", but the actual value is the name's value.

http://127.0.0.1:5670/chat?scene=chat_knowledge&id=8cf40c88-8d1b-11ef-b405-3ea07eeef889&knowledge_id=%E6%B5%8B%E8%AF%95

maybe it should be http://127.0.0.1:5670/chat?scene=chat_knowledge&id=8cf40c88-8d1b-11ef-b405-3ea07eeef889&space_id=20

  1. oss2
    oss is a business layer dependency and is not recommended as a must dependency

@Aries-ckt
Copy link
Collaborator Author

{
    "success": true,
    "err_code": null,
    "err_msg": null,
    "data": [
        [
            {
                "prediction": "用户问题: what awel talked about\n\n提取的文本内容:\n资源-资源1中提供了关于AWEL的详细描述。AWEL是专为大型模型应用开发设计的智能代理工作流表达语言,它采用分层API设计,分为操作符层、AgentFream层和DSL层。操作符层包括LLM应用开发过程中的基本操作原子,如检索、向量化、模型交互、提示处理等。AgentFream层进一步封装操作符,并能基于操作符进行链式计算。DSL层提供了一套标准的结构化表示语言,通过编写DSL语句可以完成AgentFream和操作符的运算,使得围绕数据编写大型模型应用更加确定。\n\n总结:\nAWEL是一种为大型模型应用开发设计的智能代理工作流表达语言,它通过分层API设计,包括操作符层、AgentFream层和DSL层,提供了功能性和灵活性。这使得开发者可以专注于LLM应用的业务逻辑开发,而无需关注模型和环境的细节,同时通过DSL语言使得应用编程更加确定。\n\n输出内容: AWEL主要讨论的是一种为大型模型应用开发设计的智能代理工作流表达语言,其通过分层API设计,包括操作符层、AgentFream层和DSL层,来提供开发灵活性和确定性。",
                "contexts": [
                    "\"What is AWEL?\": Agentic Workflow Expression Language(AWEL) is a set of intelligent agent workflow expression language specially designed for large model application\ndevelopment. It provides great functionality and flexibility. Through the AWEL API, you can focus on the development of business logic for LLMs applications\nwithout paying attention to cumbersome model and environment details.  \nAWEL adopts a layered API design. AWEL's layered API design architecture is shown in the figure below.  \n<p align=\"left\">\n<img src={'/img/awel.png'} width=\"480px\"/>\n</p>",
                    "\"What is AWEL?-Operators-Example of API-RAG-Example of LLM + cache\": <p align=\"left\">\n<img src={'/img/awel_cache_flow.png'} width=\"360px\" />\n</p>",
                    "\"What is AWEL?-AWEL Design\": AWEL is divided into three levels in deign, namely the operator layer, AgentFream layer and DSL layer. The following is a brief introduction\nto the three levels.  \n- **Operator layer**\nThe operator layer refers to the most basic operation atoms in the LLM application development process,\nsuch as when developing a RAG application. Retrieval, vectorization, model interaction, prompt processing, etc.\nare all basic operators. In the subsequent development, the framework will further abstract and standardize the design of operators.\nA set of operators can be quickly implemented based on standard APIs  \n- **AgentFream layer**\nThe AgentFream layer further encapsulates operators and can perform chain calculations based on operators.\nThis layer of chain computing also supports distribution, supporting a set of chain computing operations such as filter, join, map, reduce, etc. More calculation logic will be supported in the future.  \n- **DSL layer**\nThe DSL layer provides a set of standard structured representation languages, which can complete the operations of AgentFream and operators by writing DSL statements, making it more deterministic to write large model applications around data, avoiding the uncertainty of writing in natural language, and making it easier to write around data. Application programming with large models becomes deterministic application programming.",
                    "\"What is AWEL?-Examples\": The preliminary version of AWEL has alse been released, and we have provided some built-in usage examples.",
                    "\"What is AWEL?-Executable environment\": - Stand-alone environment\n- Ray environment",
                    "\"What is AWEL?-Operators-DSL Example\": ``` python\nCREATE WORKFLOW RAG AS\nBEGIN\nDATA requestData = RECEIVE REQUEST FROM\nhttp_source(\"/examples/rags\", method = \"post\");  \nDATA processedData = TRANSFORM requestData USING embedding(model = \"text2vec\");\nDATA retrievedData = RETRIEVE DATA\nFROM vstore(database = \"chromadb\", key = processedData)\nON ERROR FAIL;  \nDATA modelResult = APPLY LLM \"vicuna-13b\"\nWITH DATA retrievedData AND PARAMETERS (temperature = 0.7)\nON ERROR RETRY 2 TIMES;  \nRESPOND TO http_source WITH modelResult\nON ERROR LOG \"Failed to respond to request\";\nEND;\n```",
                    "\"What is AWEL?-Operators-Example of API-RAG\": You can find [source code](https://github.com/eosphoros-ai/DB-GPT/blob/main/examples/awel/simple_rag_example.py) from `examples/awel/simple_rag_example.py`\n```python\nwith DAG(\"simple_rag_example\") as dag:\ntrigger_task = HttpTrigger(\n\"/examples/simple_rag\", methods=\"POST\", request_body=ConversationVo\n)\nreq_parse_task = RequestParseOperator()\n# TODO should register prompt template first\nprompt_task = PromptManagerOperator()\nhistory_storage_task = ChatHistoryStorageOperator()\nhistory_task = ChatHistoryOperator()\nembedding_task = EmbeddingEngingOperator()\nchat_task = BaseChatOperator()\nmodel_task = ModelOperator()\noutput_parser_task = MapOperator(lambda out: out.to_dict()[\"text\"])  \n(\ntrigger_task\n>> req_parse_task\n>> prompt_task\n>> history_storage_task\n>> history_task\n>> embedding_task\n>> chat_task\n>> model_task\n>> output_parser_task\n)  \n```\nBit operations will arrange the entire process in the form of DAG  \n<p align=\"left\">\n<img src={'/img/awel_dag_flow.png'} width=\"360px\" />\n</p>",
                    "\"What is AWEL?-Operators-AgentFream Example\": ```python\naf = AgentFream(HttpSource(\"/examples/run_code\", method = \"post\"))\nresult = (\naf\n.text2vec(model=\"text2vec\")\n.filter(vstore, store = \"chromadb\", db=\"default\")\n.llm(model=\"vicuna-13b\", temperature=0.7)\n.map(code_parse_func)\n.map(run_sql_func)\n.reduce(lambda a, b: a + b)\n)\nresult.write_to_sink(type='source_slink')\n```",
                    "\"What is AWEL?-Currently supported operators\": - **Basic Operators**\n- BaseOperator\n- JoinOperator\n- ReduceOperator\n- MapOperator\n- BranchOperator\n- InputOperator\n- TriggerOperator\n- **Stream Operators**\n- StreamifyAbsOperator\n- UnstreamifyAbsOperator\n- TransformStreamAbsOperator",
                    "\"\": <font style=\"color:#24292E;\">OceanBase通过Root Service管理各个节点间的负载均衡。不同类型的副本需求的资源各不相同,Root Service在执行分区管理操作时需要考虑的因素包括每台ObServer上的CPU、磁盘使用量、内存使用量、IOPS使用情况、避免同一张表格的分区全部落到少数几台ObServer,等等。让耗内存多的副本和耗内存少的副本位于同一台机器上,让占磁盘空间多的副本和占磁盘空间少的副本位于同一台机器上。经过负载均衡,最终会使得所有机器的各类型资源占用都处于一种比较均衡的状态,充分利用每台机器的所有资源。</font>  \n<font style=\"color:#24292E;\">负载均衡分机器、unit两个粒度,前者负责机器之间的均衡,选择一些 unit 整体从负载高的机器迁移到负载低的机器上;后者负责两个unit之间的均衡,从负载高的 unit 搬迁副本到负载低的 unit。</font>  \n|  | **机器负载均衡** | **unit负载均衡** |\n| --- | --- | --- |\n| 均衡对象 | 机器 | unit |\n| 搬迁内容 | unit | 副本 |\n| 均衡粒度 | 粗 | 细 |\n| 均衡范围 | zone 内均衡 | zone 内均衡 |  \n<font style=\"color:#24292E;\">一个租户拥有若干个资源池,这些资源池的集合描述了这个租户所能使用的所有资源。一个资源池由具有相同资源规格(Unit Config)的若干个UNIT(资源单元)组成。每个UNIT描述了位于一个Server上的一组计算和存储资源,可以视为一个轻量级虚拟机,包括若干CPU资源,内存资源,磁盘资源等。一个资源池只能属于一个租户,一个租户在同一个Server上最多有一个UNIT。对于每个Zone,根据UNIT的动态调度,达到均衡的策略。</font>  \n+ <font style=\"color:#24292E;\">属于同一个租户的若干个UNIT,会均匀分散在不同的server上</font>\n+ <font style=\"color:#24292E;\">属于同一个租户组的若干个UNIT,会尽量均匀分散在不同的server上</font>\n+ <font style=\"color:#24292E;\">当一个Zone内机器整体磁盘使用率超过一定阈值时,通过交换或迁移UNIT降低磁盘水位线</font>\n+ <font style=\"color:#24292E;\">否则,根据UNIT的CPU和内存规格,通过交换或迁移UNIT降低CPU和内存的平均水位线</font>"
                ],
                "score": 4.0,
                "passing": true,
                "metric_name": "AnswerRelevancyMetric",
                "prediction_cost": -15,
                "query": "what awel talked about",
                "raw_dataset": {
                    "query": "what awel talked about",
                    "doc_name": "awel.md",
                    "factual": [
                        "\"What is AWEL?\": Agentic Workflow Expression Language(AWEL) is a set of intelligent agent workflow expression language specially designed for large model application\ndevelopment. It provides great functionality and flexibility. Through the AWEL API, you can focus on the development of business logic for LLMs applications\nwithout paying attention to cumbersome model and environment details.  \nAWEL adopts a layered API design. AWEL's layered API design architecture is shown in the figure below.  \n<p align=\"left\">\n<img src={'/img/awel.png'} width=\"480px\"/>\n</p>",
                        "\"What is AWEL?-Operators-Example of API-RAG-Example of LLM + cache\": <p align=\"left\">\n<img src={'/img/awel_cache_flow.png'} width=\"360px\" />\n</p>",
                        "\"What is AWEL?-AWEL Design\": AWEL is divided into three levels in deign, namely the operator layer, AgentFream layer and DSL layer. The following is a brief introduction\nto the three levels.  \n- **Operator layer**\nThe operator layer refers to the most basic operation atoms in the LLM application development process,\nsuch as when developing a RAG application. Retrieval, vectorization, model interaction, prompt processing, etc.\nare all basic operators. In the subsequent development, the framework will further abstract and standardize the design of operators.\nA set of operators can be quickly implemented based on standard APIs  \n- **AgentFream layer**\nThe AgentFream layer further encapsulates operators and can perform chain calculations based on operators.\nThis layer of chain computing also supports distribution, supporting a set of chain computing operations such as filter, join, map, reduce, etc. More calculation logic will be supported in the future.  \n- **DSL layer**\nThe DSL layer provides a set of standard structured representation languages, which can complete the operations of AgentFream and operators by writing DSL statements, making it more deterministic to write large model applications around data, avoiding the uncertainty of writing in natural language, and making it easier to write around data. Application programming with large models becomes deterministic application programming.",
                        "\"What is AWEL?-Examples\": The preliminary version of AWEL has alse been released, and we have provided some built-in usage examples.",
                        "\"What is AWEL?-Executable environment\": - Stand-alone environment\n- Ray environment",
                        "\"What is AWEL?-Operators-DSL Example\": ``` python\nCREATE WORKFLOW RAG AS\nBEGIN\nDATA requestData = RECEIVE REQUEST FROM\nhttp_source(\"/examples/rags\", method = \"post\");  \nDATA processedData = TRANSFORM requestData USING embedding(model = \"text2vec\");\nDATA retrievedData = RETRIEVE DATA\nFROM vstore(database = \"chromadb\", key = processedData)\nON ERROR FAIL;  \nDATA modelResult = APPLY LLM \"vicuna-13b\"\nWITH DATA retrievedData AND PARAMETERS (temperature = 0.7)\nON ERROR RETRY 2 TIMES;  \nRESPOND TO http_source WITH modelResult\nON ERROR LOG \"Failed to respond to request\";\nEND;\n```",
                        "\"What is AWEL?-Operators-Example of API-RAG\": You can find [source code](https://github.com/eosphoros-ai/DB-GPT/blob/main/examples/awel/simple_rag_example.py) from `examples/awel/simple_rag_example.py`\n```python\nwith DAG(\"simple_rag_example\") as dag:\ntrigger_task = HttpTrigger(\n\"/examples/simple_rag\", methods=\"POST\", request_body=ConversationVo\n)\nreq_parse_task = RequestParseOperator()\n# TODO should register prompt template first\nprompt_task = PromptManagerOperator()\nhistory_storage_task = ChatHistoryStorageOperator()\nhistory_task = ChatHistoryOperator()\nembedding_task = EmbeddingEngingOperator()\nchat_task = BaseChatOperator()\nmodel_task = ModelOperator()\noutput_parser_task = MapOperator(lambda out: out.to_dict()[\"text\"])  \n(\ntrigger_task\n>> req_parse_task\n>> prompt_task\n>> history_storage_task\n>> history_task\n>> embedding_task\n>> chat_task\n>> model_task\n>> output_parser_task\n)  \n```\nBit operations will arrange the entire process in the form of DAG  \n<p align=\"left\">\n<img src={'/img/awel_dag_flow.png'} width=\"360px\" />\n</p>",
                        "\"What is AWEL?-Operators-AgentFream Example\": ```python\naf = AgentFream(HttpSource(\"/examples/run_code\", method = \"post\"))\nresult = (\naf\n.text2vec(model=\"text2vec\")\n.filter(vstore, store = \"chromadb\", db=\"default\")\n.llm(model=\"vicuna-13b\", temperature=0.7)\n.map(code_parse_func)\n.map(run_sql_func)\n.reduce(lambda a, b: a + b)\n)\nresult.write_to_sink(type='source_slink')\n```",
                        "\"What is AWEL?-Currently supported operators\": - **Basic Operators**\n- BaseOperator\n- JoinOperator\n- ReduceOperator\n- MapOperator\n- BranchOperator\n- InputOperator\n- TriggerOperator\n- **Stream Operators**\n- StreamifyAbsOperator\n- UnstreamifyAbsOperator\n- TransformStreamAbsOperator",
                        "\"\": <font style=\"color:#24292E;\">OceanBase通过Root Service管理各个节点间的负载均衡。不同类型的副本需求的资源各不相同,Root Service在执行分区管理操作时需要考虑的因素包括每台ObServer上的CPU、磁盘使用量、内存使用量、IOPS使用情况、避免同一张表格的分区全部落到少数几台ObServer,等等。让耗内存多的副本和耗内存少的副本位于同一台机器上,让占磁盘空间多的副本和占磁盘空间少的副本位于同一台机器上。经过负载均衡,最终会使得所有机器的各类型资源占用都处于一种比较均衡的状态,充分利用每台机器的所有资源。</font>  \n<font style=\"color:#24292E;\">负载均衡分机器、unit两个粒度,前者负责机器之间的均衡,选择一些 unit 整体从负载高的机器迁移到负载低的机器上;后者负责两个unit之间的均衡,从负载高的 unit 搬迁副本到负载低的 unit。</font>  \n|  | **机器负载均衡** | **unit负载均衡** |\n| --- | --- | --- |\n| 均衡对象 | 机器 | unit |\n| 搬迁内容 | unit | 副本 |\n| 均衡粒度 | 粗 | 细 |\n| 均衡范围 | zone 内均衡 | zone 内均衡 |  \n<font style=\"color:#24292E;\">一个租户拥有若干个资源池,这些资源池的集合描述了这个租户所能使用的所有资源。一个资源池由具有相同资源规格(Unit Config)的若干个UNIT(资源单元)组成。每个UNIT描述了位于一个Server上的一组计算和存储资源,可以视为一个轻量级虚拟机,包括若干CPU资源,内存资源,磁盘资源等。一个资源池只能属于一个租户,一个租户在同一个Server上最多有一个UNIT。对于每个Zone,根据UNIT的动态调度,达到均衡的策略。</font>  \n+ <font style=\"color:#24292E;\">属于同一个租户的若干个UNIT,会均匀分散在不同的server上</font>\n+ <font style=\"color:#24292E;\">属于同一个租户组的若干个UNIT,会尽量均匀分散在不同的server上</font>\n+ <font style=\"color:#24292E;\">当一个Zone内机器整体磁盘使用率超过一定阈值时,通过交换或迁移UNIT降低磁盘水位线</font>\n+ <font style=\"color:#24292E;\">否则,根据UNIT的CPU和内存规格,通过交换或迁移UNIT降低CPU和内存的平均水位线</font>"
                    ]
                },
                "feedback": "生成的答案与相关的参考内容非常相关,并且完全正确地总结了AWEL的特点和优势,详细地解释了其分层API设计,包括操作符层、AgentFream层和DSL层,以及其对大型模型应用开发的重要性。回答已充分满足用户的问题,因此给予4分。"
            }
        ]
    ]
}

Copy link
Collaborator

@csunny csunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+

Copy link
Collaborator

@yhjun1026 yhjun1026 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r+

@csunny csunny merged commit 811ce63 into eosphoros-ai:main Oct 18, 2024
2 checks passed
@Aries-ckt Aries-ckt added the hacktoberfest-accepted hacktoberfest-accepted label Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request hacktoberfest hacktoberfest-accepted hacktoberfest-accepted
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature][Evaluate] Agent and RAG evaluate module
4 participants