Skip to content

Commit

Permalink
Updating Evaluator Notebook to LM Eval Harness (NVIDIA#185)
Browse files Browse the repository at this point in the history
Co-authored-by: Chris Alexiuk <chris@alexiuk.ca>
  • Loading branch information
chrisalexiuk-nvidia and chris-alexiuk-1 authored Sep 18, 2024
1 parent e85ddc1 commit 9720d61
Showing 1 changed file with 14 additions and 53 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -67,13 +67,13 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Baseline Evaluation of Llama 3.1 8B Instruct with BigBench\n",
"## Baseline Evaluation of Llama 3.1 8B Instruct with LM Evaluation Harness\n",
"\n",
"The Nemo Evaluator microservice allows users to run a number of academic benchmarks, all of which are accessible through the Nemo Evaluator API.\n",
"\n",
"> NOTE: For more details on what evaluations are available, please head to the [Evaluation documentation](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html)\n",
"\n",
"For this notebook, we will be running the BigBench evaluation (details available [here](https://developer.nvidia.com/docs/nemo-microservices/evaluation/source/evaluations.html#bigbench))! This benchmark consists of 200+ tasks for evaluating LLMs."
"For this notebook, we will be running the LM Evaluation Harness evaluation!"
]
},
{
Expand All @@ -90,7 +90,6 @@
"outputs": [],
"source": [
"model_config = {\n",
" \"llm_type\": \"nvidia-nemo-nim\",\n",
" \"llm_name\": \"my-customized-model\",\n",
" \"inference_url\": \"MY_NIM_URL/v1\",\n",
" \"use_chat_endpoint\": False,\n",
Expand All @@ -103,11 +102,9 @@
"source": [
"Now we can initialize our evaluation config, which is how we communicate which benchmark tasks, subtasks, etc. to use during evaluation. \n",
"\n",
"For this evaluation, we'll focus on a small subset of BigBench by choosing the `intent_recognition` task. \n",
"For this evaluation, we'll focus on the [GSM8K](https://arxiv.org/abs/2110.14168) evaluation which uses Eleuther AI's [LM Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.3) as a backend. \n",
"\n",
"`intent_recognition` is a task specifically tailored to determine if the model is good at recognizing a given utterance's intent. More details available [here](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/intent_recognition).\n",
"\n",
"We'll also select the `tydiqa_goldp.en` task to see how Llama 3 8B Instruct stacks up on the English subset of the `TyDi QA` benchmark. More details available [here](https://github.com/google-research-datasets/tydiqa)."
"The LM Evaluation Harness supports more than 60 standard academic benchmarks for LLMs!"
]
},
{
Expand All @@ -118,49 +115,17 @@
"source": [
"evaluation_config = {\n",
" \"eval_type\": \"automatic\",\n",
" \"eval_subtype\": \"bigbench\",\n",
" \"standard_tasks\": [\n",
" \"intent_recognition\",\n",
" ],\n",
" \"tydiqa_tasks\": [\n",
" \"tydiqa_goldp.en\",\n",
" ],\n",
" \"standard_tasks_args\": \"--max_length=64 --json_shots='0,2'\",\n",
" \"tydiqa_tasks_args\": \"--max_length=16 --json_shots='1,8'\",\n",
" \"few_shot_example_separator_override\": {\n",
" \"standard_tasks\": {\n",
" \"default\": None\n",
" },\n",
" \"tydiqa_tasks\": {\n",
" \"default\": None\n",
" }\n",
" },\n",
" \"example_input_prefix_override\": {\n",
" \"standard_tasks\": {\n",
" \"default\": None\n",
" },\n",
" \"tydiqa_tasks\": {\n",
" \"default\": None\n",
" }\n",
" },\n",
" \"example_output_prefix_override\": {\n",
" \"standard_tasks\": {\n",
" \"default\": None,\n",
" \"abstract_narrative_understanding\": None\n",
" },\n",
" \"tydiqa_tasks\": {\n",
" \"default\": None\n",
" }\n",
" },\n",
" \"stop_string_override\": {\n",
" \"standard_tasks\": {\n",
" \"default\": None,\n",
" \"abstract_narrative_understanding\": None\n",
" },\n",
" \"tydiqa_tasks\": {\n",
" \"default\": None\n",
" \"eval_subtype\": \"lm_eval_harness\",\n",
" \"tasks\": [\n",
" {\n",
" \"task_name\" : \"gsm8k\",\n",
" \"task_config\" : None,\n",
" \"num_fewshot\" : 5,\n",
" \"batch_size\" : 16,\n",
" \"bootstrap_iters\" : 1000,\n",
" \"limit\" : -1\n",
" }\n",
" }\n",
" ]\n",
"}"
]
},
Expand Down Expand Up @@ -375,9 +340,7 @@
"outputs": [],
"source": [
"model_config = {\n",
" \"llm_type\" : \"nvidia-nemo-nim\",\n",
" \"llm_name\" : \"my-customized-model\",\n",
" \"container\" : \"my-customized-container\",\n",
" \"inference_url\" : \"my-customized-inference-url\",\n",
" \"use_chat_endpoint\" : False,\n",
"}"
Expand Down Expand Up @@ -522,9 +485,7 @@
"outputs": [],
"source": [
"model_config = {\n",
" \"llm_type\" : \"nvidia-nemo-nim\",\n",
" \"llm_name\" : \"my-customized-model\",\n",
" \"container\" : \"my-customized-container\",\n",
" \"inference_url\" : \"my-customized-inference-url\",\n",
" \"use_chat_endpoint\" : False,\n",
"}"
Expand Down

0 comments on commit 9720d61

Please sign in to comment.