Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(route): add the hamel blog #17428

Merged
merged 1 commit into from
Nov 5, 2024
Merged

feat(route): add the hamel blog #17428

merged 1 commit into from
Nov 5, 2024

Conversation

liyaozhong
Copy link
Contributor

Involved Issue / 该 PR 相关 Issue

Close #

Example for the Proposed Route(s) / 路由地址示例

/hamel/blog

New RSS Route Checklist / 新 RSS 路由检查表

  • New Route / 新的路由
  • Anti-bot or rate limit / 反爬/频率限制
    • If yes, do your code reflect this sign? / 如果有, 是否有对应的措施?
  • Date and time / 日期和时间
    • Parsed / 可以解析
    • Correct time zone / 时区正确
  • New package added / 添加了新的包
  • Puppeteer

Note / 说明

Hamel Husain's blog (hamel.dev) delves into AI, data science, and machine learning with a strong focus on practical applications and open-source contributions. Hamel, a prominent figure in AI and GitHub development, shares tutorials, insights on AI tooling, and real-world case studies, offering valuable guidance for developers and data scientists alike.

Hamel Husain 的博客(hamel.dev)探讨了 AI、数据科学和机器学习领域,特别关注实际应用和开源项目。Hamel 作为 AI 和 GitHub 开发的知名人物,分享了工具教程、AI 实践见解和真实案例,为开发者和数据科学家提供了宝贵的指导。

@github-actions github-actions bot added the Route label Nov 3, 2024
Copy link
Contributor

github-actions bot commented Nov 3, 2024

Successfully generated as following:

http://localhost:1200/hamel/blog - Success ✔️
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Hamel&#39;s Blog</title>
    <link>https://hamel.dev</link>
    <atom:link href="http://localhost:1200/hamel/blog" rel="self" type="application/rss+xml"></atom:link>
    <description>Hamel&#39;s Blog - Powered by RSSHub</description>
    <generator>RSSHub</generator>
    <webMaster>contact@rsshub.app (RSSHub)</webMaster>
    <language>en</language>
    <lastBuildDate>Sun, 03 Nov 2024 08:19:36 GMT</lastBuildDate>
    <ttl>5</ttl>
    <item>
      <title>Creating a LLM-as-a-Judge That Drives Business Results</title>
      <description>&lt;header id=&quot;title-block-header&quot; class=&quot;quarto-title-block default&quot;&gt;
        &lt;div class=&quot;quarto-title&quot;&gt;
        &lt;h1 class=&quot;title&quot;&gt;Creating a LLM-as-a-Judge That Drives Business Results&lt;/h1&gt;
        &lt;div class=&quot;quarto-categories&quot;&gt;
        &lt;div class=&quot;quarto-category&quot;&gt;LLMs&lt;/div&gt;
        &lt;div class=&quot;quarto-category&quot;&gt;evals&lt;/div&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div&gt;
        &lt;div class=&quot;description&quot;&gt;
        A step-by-step guide with my learnings from 30+ AI implementations.
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;quarto-title-meta&quot;&gt;
        &lt;div&gt;
        &lt;div class=&quot;quarto-title-meta-heading&quot;&gt;Author&lt;/div&gt;
        &lt;div class=&quot;quarto-title-meta-contents&quot;&gt;
        &lt;p&gt;Hamel Husain &lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div&gt;
        &lt;div class=&quot;quarto-title-meta-heading&quot;&gt;Published&lt;/div&gt;
        &lt;div class=&quot;quarto-title-meta-contents&quot;&gt;
        &lt;p class=&quot;date&quot;&gt;October 29, 2024&lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;/header&gt;
        &lt;nav id=&quot;TOC-body&quot; role=&quot;doc-toc&quot;&gt;
        &lt;h2 id=&quot;toc-title&quot;&gt;Table Of Contents&lt;/h2&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-problem-ai-teams-are-drowning-in-data&quot; id=&quot;toc-the-problem-ai-teams-are-drowning-in-data&quot;&gt;The Problem: AI Teams Are Drowning in Data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-1-find-the-principal-domain-expert&quot; id=&quot;toc-step-1-find-the-principal-domain-expert&quot;&gt;Step 1: Find &lt;em&gt;The&lt;/em&gt; Principal Domain Expert&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#next-steps&quot; id=&quot;toc-next-steps&quot;&gt;Next Steps&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-2-create-a-dataset&quot; id=&quot;toc-step-2-create-a-dataset&quot;&gt;Step 2: Create a Dataset&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#why-a-diverse-dataset-matters&quot; id=&quot;toc-why-a-diverse-dataset-matters&quot;&gt;Why a Diverse Dataset Matters&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#dimensions-for-structuring-your-dataset&quot; id=&quot;toc-dimensions-for-structuring-your-dataset&quot;&gt;Dimensions for Structuring Your Dataset&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#examples-of-features-scenarios-and-personas&quot; id=&quot;toc-examples-of-features-scenarios-and-personas&quot;&gt;Examples of Features, Scenarios, and Personas&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#this-taxonomy-is-not-universal&quot; id=&quot;toc-this-taxonomy-is-not-universal&quot;&gt;This taxonomy is not universal&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#generating-data&quot; id=&quot;toc-generating-data&quot;&gt;Generating Data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#example-llm-prompts-for-generating-user-inputs&quot; id=&quot;toc-example-llm-prompts-for-generating-user-inputs&quot;&gt;Example LLM Prompts for Generating User Inputs&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#generating-synthetic-data&quot; id=&quot;toc-generating-synthetic-data&quot;&gt;Generating Synthetic Data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#next-steps-1&quot; id=&quot;toc-next-steps-1&quot;&gt;Next Steps&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot; id=&quot;toc-step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot;&gt;Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#why-are-simple-passfail-metrics-important&quot; id=&quot;toc-why-are-simple-passfail-metrics-important&quot;&gt;Why are simple pass/fail metrics important?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-role-of-critiques&quot; id=&quot;toc-the-role-of-critiques&quot;&gt;The Role of Critiques&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#examples-of-good-critiques&quot; id=&quot;toc-examples-of-good-critiques&quot;&gt;Examples of Good Critiques&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#dont-stray-from-binary-passfail-judgments-when-starting-out&quot; id=&quot;toc-dont-stray-from-binary-passfail-judgments-when-starting-out&quot;&gt;Don’t stray from binary pass/fail judgments when starting out&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#make-it-easy-for-the-domain-expert-to-review-data&quot; id=&quot;toc-make-it-easy-for-the-domain-expert-to-review-data&quot;&gt;Make it easy for the domain expert to review data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-many-examples-do-you-need&quot; id=&quot;toc-how-many-examples-do-you-need&quot;&gt;How many examples do you need?&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-4-fix-errors&quot; id=&quot;toc-step-4-fix-errors&quot;&gt;Step 4: Fix Errors&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-5-build-your-llm-as-a-judge-iteratively&quot; id=&quot;toc-step-5-build-your-llm-as-a-judge-iteratively&quot;&gt;Step 5: Build Your LLM as A Judge, Iteratively&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-hidden-power-of-critiques&quot; id=&quot;toc-the-hidden-power-of-critiques&quot;&gt;The Hidden Power of Critiques&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#start-with-expert-examples&quot; id=&quot;toc-start-with-expert-examples&quot;&gt;Start with Expert Examples&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#keep-iterating-on-the-prompt-until-convergence-with-domain-expert&quot; id=&quot;toc-keep-iterating-on-the-prompt-until-convergence-with-domain-expert&quot;&gt;Keep Iterating on the Prompt Until Convergence With Domain Expert&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-to-optimize-the-llm-judge-prompt&quot; id=&quot;toc-how-to-optimize-the-llm-judge-prompt&quot;&gt;How to Optimize the LLM Judge Prompt?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-human-side-of-the-process&quot; id=&quot;toc-the-human-side-of-the-process&quot;&gt;The Human Side of the Process&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-often-should-you-evaluate&quot; id=&quot;toc-how-often-should-you-evaluate&quot;&gt;How Often Should You Evaluate?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#what-if-this-doesnt-work&quot; id=&quot;toc-what-if-this-doesnt-work&quot;&gt;What if this doesn’t work?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#mistakes-ive-noticed-in-llm-judge-prompts&quot; id=&quot;toc-mistakes-ive-noticed-in-llm-judge-prompts&quot;&gt;Mistakes I’ve noticed in LLM judge prompts&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-6-perform-error-analysis&quot; id=&quot;toc-step-6-perform-error-analysis&quot;&gt;Step 6: Perform Error Analysis&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#classify-traces&quot; id=&quot;toc-classify-traces&quot;&gt;Classify Traces&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#an-interactive-walkthrough-of-error-analysis&quot; id=&quot;toc-an-interactive-walkthrough-of-error-analysis&quot;&gt;An Interactive Walkthrough of Error Analysis&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#fix-your-errors-again&quot; id=&quot;toc-fix-your-errors-again&quot;&gt;Fix Your Errors, Again&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#doing-this-well-requires-data-literacy&quot; id=&quot;toc-doing-this-well-requires-data-literacy&quot;&gt;Doing this well requires data literacy&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-7-create-more-specialized-llm-judges-if-needed&quot; id=&quot;toc-step-7-create-more-specialized-llm-judges-if-needed&quot;&gt;Step 7: Create More Specialized LLM Judges (if needed)&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#recap-of-critique-shadowing&quot; id=&quot;toc-recap-of-critique-shadowing&quot;&gt;Recap of Critique Shadowing&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#its-not-the-judge-that-created-value-afterall&quot; id=&quot;toc-its-not-the-judge-that-created-value-afterall&quot;&gt;It’s Not The Judge That Created Value, Afterall&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#do-you-really-need-this&quot; id=&quot;toc-do-you-really-need-this&quot;&gt;Do You Really Need This?&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#faq&quot; id=&quot;toc-faq&quot;&gt;FAQ&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#if-i-have-a-good-judge-llm-isnt-that-also-the-llm-id-also-want-to-use&quot; id=&quot;toc-if-i-have-a-good-judge-llm-isnt-that-also-the-llm-id-also-want-to-use&quot;&gt;If I have a good judge LLM, isn’t that also the LLM I’d also want to use?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#do-you-recommend-fine-tuning-judges&quot; id=&quot;toc-do-you-recommend-fine-tuning-judges&quot;&gt;Do you recommend fine-tuning judges?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#whats-wrong-with-off-the-shelf-llm-judges&quot; id=&quot;toc-whats-wrong-with-off-the-shelf-llm-judges&quot;&gt;What’s wrong with off-the-shelf LLM judges?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-evaluate-the-llm-judge&quot; id=&quot;toc-how-do-you-evaluate-the-llm-judge&quot;&gt;How Do you evaluate the LLM judge?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#what-model-do-you-use-for-the-llm-judge&quot; id=&quot;toc-what-model-do-you-use-for-the-llm-judge&quot;&gt;What model do you use for the LLM judge?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#what-about-guardrails&quot; id=&quot;toc-what-about-guardrails&quot;&gt;What about guardrails?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#im-using-llm-as-a-judge-and-getting-tremendous-value-but-i-didnt-follow-this-approach.&quot; id=&quot;toc-im-using-llm-as-a-judge-and-getting-tremendous-value-but-i-didnt-follow-this-approach.&quot;&gt;I’m using LLM as a judge, and getting tremendous value but I didn’t follow this approach.&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-choose-between-traditional-ml-techniques-llm-as-a-judge-and-human-annotations&quot; id=&quot;toc-how-do-you-choose-between-traditional-ml-techniques-llm-as-a-judge-and-human-annotations&quot;&gt;How do you choose between traditional ML techniques, LLM-as-a-judge and human annotations?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#can-you-make-judges-from-small-models&quot; id=&quot;toc-can-you-make-judges-from-small-models&quot;&gt;Can you make judges from small models?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-ensure-consistency-when-updating-your-llm-model&quot; id=&quot;toc-how-do-you-ensure-consistency-when-updating-your-llm-model&quot;&gt;How do you ensure consistency when updating your LLM model?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-phase-out-human-in-the-loop-to-scale-this&quot; id=&quot;toc-how-do-you-phase-out-human-in-the-loop-to-scale-this&quot;&gt;How do you phase out human in the loop to scale this?&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#resources&quot; id=&quot;toc-resources&quot;&gt;Resources&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#stay-connected&quot; id=&quot;toc-stay-connected&quot;&gt;Stay Connected&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;/nav&gt;
        &lt;p&gt;Earlier this year, I wrote &lt;a href=&quot;https://hamel.dev/blog/posts/evals/&quot;&gt;Your AI product needs evals&lt;/a&gt;. Many of you asked, “How do I get started with LLM-as-a-judge?” This guide shares what I’ve learned after helping over &lt;a href=&quot;https://parlance-labs.com/&quot;&gt;30 companies&lt;/a&gt; set up their evaluation systems.&lt;/p&gt;
        &lt;section id=&quot;the-problem-ai-teams-are-drowning-in-data&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;the-problem-ai-teams-are-drowning-in-data&quot;&gt;The Problem: AI Teams Are Drowning in Data&lt;/h2&gt;
        &lt;p&gt;Ever spend weeks building an AI system, only to realize you have no idea if it’s actually working? You’re not alone. I’ve noticed teams repeat the same mistakes when using LLMs to evaluate AI outputs:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;Too Many Metrics&lt;/strong&gt;: Creating numerous measurements that become unmanageable.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Arbitrary Scoring Systems&lt;/strong&gt;: Using uncalibrated scales (like 1-5) across multiple dimensions, where the difference between scores is unclear and subjective. What makes something a 3 versus a 4? Nobody knows, and different evaluators often interpret these scales differently.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Ignoring Domain Experts&lt;/strong&gt;: Not involving the people who understand the subject matter deeply.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Unvalidated Metrics&lt;/strong&gt;: Using measurements that don’t truly reflect what matters to the users or the business.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;p&gt;The result? Teams end up buried under mountains of metrics or data they don’t trust and can’t use. Progress grinds to a halt. Everyone gets frustrated.&lt;/p&gt;
        &lt;p&gt;For example, it’s not uncommon for me to see dashboards that look like this:&lt;/p&gt;
        &lt;div class=&quot;quarto-figure quarto-figure-center&quot;&gt;
        &lt;figure class=&quot;figure&quot;&gt;
        &lt;p&gt;&lt;img src=&quot;https://hamel.dev/blog/posts/llm-judge/blog_header.png&quot; class=&quot;img-fluid figure-img&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/p&gt;
        &lt;figcaption&gt;An illustrative example of a bad eval dashboard&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;/div&gt;
        &lt;p&gt;Tracking a bunch of scores on a 1-5 scale is often a sign of a bad eval process (I’ll discuss why later). In this post, I’ll show you how to avoid these pitfalls. The solution is to use a technique that I call &lt;strong&gt;“Critique Shadowing”&lt;/strong&gt;. Here’s how to do it, step by step.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-1-find-the-principal-domain-expert&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-1-find-the-principal-domain-expert&quot;&gt;Step 1: Find &lt;em&gt;The&lt;/em&gt; Principal Domain Expert&lt;/h2&gt;
        &lt;p&gt;In most organizations there is usually one (maybe two) key individuals whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users. Identifying and involving this &lt;strong&gt;Principal Domain Expert&lt;/strong&gt; early in the process is critical.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Why is finding the right domain expert so important?&lt;/strong&gt;&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;They Set the Standard&lt;/strong&gt;: This person not only defines what is acceptable technically, but also helps you understand if you’re building something users actually want.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Capture Unspoken Expectations&lt;/strong&gt;: By involving them, you uncover their preferences and expectations, which they might not be able to fully articulate upfront. Through the evaluation process, you help them clarify what a “passable” AI interaction looks like.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Consistency in Judgment&lt;/strong&gt;: People in your organization may have different opinions about the AI’s performance. Focusing on the principal expert ensures that evaluations are consistent and aligned with the most critical standards.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Sense of Ownership&lt;/strong&gt;: Involving the expert gives them a stake in the AI’s development. They feel invested because they’ve had a hand in shaping it. In the end, they are more likely to approve of the AI.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Examples of Principal Domain Experts:&lt;/strong&gt;&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;A &lt;strong&gt;psychologist&lt;/strong&gt; for a mental health AI assistant.&lt;/li&gt;
        &lt;li&gt;A &lt;strong&gt;lawyer&lt;/strong&gt; for an AI that analyzes legal documents.&lt;/li&gt;
        &lt;li&gt;A &lt;strong&gt;customer service director&lt;/strong&gt; for a support chatbot.&lt;/li&gt;
        &lt;li&gt;A &lt;strong&gt;lead teacher or curriculum developer&lt;/strong&gt; for an educational AI tool.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;div class=&quot;callout callout-style-default callout-note callout-titled&quot;&gt;
        &lt;div class=&quot;callout-header d-flex align-content-center&quot;&gt;
        &lt;div class=&quot;callout-icon-container&quot;&gt;
        &lt;i class=&quot;callout-icon&quot;&gt;&lt;/i&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-title-container flex-fill&quot;&gt;
        Exceptions
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-body-container callout-body&quot;&gt;
        &lt;p&gt;In a smaller company, this might be the CEO or founder. If you are an independent developer, you should be the domain expert (but be honest with yourself about your expertise).&lt;/p&gt;
        &lt;p&gt;If you must rely on leadership, you should regularly validate their assumptions against real user feedback.&lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;p&gt;Many developers attempt to act as the domain expert themselves, or find a convenient proxy (ex: their superior). This is a recipe for disaster. People will have varying opinions about what is acceptable, and you can’t make everyone happy. What’s important is that your principal domain expert is satisfied.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; This doesn’t have to take a lot of the domain expert’s time. Later in this post, I’ll discuss how you can make the process efficient. Their involvement is absolutely critical to the AI’s success.&lt;/p&gt;
        &lt;section id=&quot;next-steps&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;next-steps&quot;&gt;Next Steps&lt;/h3&gt;
        &lt;p&gt;Once you’ve found your expert, we need to give them the right data to review. Let’s talk about how to do that next.&lt;/p&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-2-create-a-dataset&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-2-create-a-dataset&quot;&gt;Step 2: Create a Dataset&lt;/h2&gt;
        &lt;p&gt;With your principal domain expert on board, the next step is to build a dataset that captures problems that your AI will encounter. It’s important that the dataset is diverse and represents the types of interactions that your AI will have in production.&lt;/p&gt;
        &lt;section id=&quot;why-a-diverse-dataset-matters&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;why-a-diverse-dataset-matters&quot;&gt;Why a Diverse Dataset Matters&lt;/h3&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;strong&gt;Comprehensive Testing&lt;/strong&gt;: Ensures your AI is evaluated across a wide range of situations.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Realistic Interactions&lt;/strong&gt;: Reflects actual user behavior for more relevant evaluations.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Identifies Weaknesses&lt;/strong&gt;: Helps uncover areas where the AI may struggle or produce errors.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;/section&gt;
        &lt;section id=&quot;dimensions-for-structuring-your-dataset&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;dimensions-for-structuring-your-dataset&quot;&gt;Dimensions for Structuring Your Dataset&lt;/h3&gt;
        &lt;p&gt;You want to define dimensions that make sense for your use case. For example, here are ones that I often use for B2C applications:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;Features&lt;/strong&gt;: Specific functionalities of your AI product.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Scenarios&lt;/strong&gt;: Situations or problems the AI may encounter and needs to handle.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Personas&lt;/strong&gt;: Representative user profiles with distinct characteristics and needs.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;/section&gt;
        &lt;section id=&quot;examples-of-features-scenarios-and-personas&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;examples-of-features-scenarios-and-personas&quot;&gt;Examples of Features, Scenarios, and Personas&lt;/h3&gt;
        &lt;section id=&quot;features&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;features&quot;&gt;Features&lt;/h4&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 30%&quot;&gt;
        &lt;col style=&quot;width: 69%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Email Summarization&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Condensing lengthy emails into key points.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Meeting Scheduler&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Automating the scheduling of meetings across time zones.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Order Tracking&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Providing shipment status and delivery updates.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Contact Search&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Finding and retrieving contact information from a database.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Language Translation&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Translating text between languages.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Content Recommendation&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Suggesting articles or products based on user interests.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;section id=&quot;scenarios&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;scenarios&quot;&gt;Scenarios&lt;/h4&gt;
        &lt;p&gt;Scenarios are situations the AI needs to handle, (not based on the outcome of the AI’s response).&lt;/p&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 31%&quot;&gt;
        &lt;col style=&quot;width: 68%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Multiple Matches Found&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User’s request yields multiple results that need narrowing down. For example: User asks “Where’s my order?” but has three active orders (#123, #124, #125). AI must help identify which specific order they’re asking about.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;No Matches Found&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User’s request yields no results, requiring alternatives or corrections. For example: User searches for order #ABC-123 which doesn’t exist. AI should explain valid order formats and suggest checking their confirmation email.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Ambiguous Request&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User input lacks necessary specificity. For example: User says “I need to change my delivery” without specifying which order or what aspect of delivery (date, address, etc.) they want to change.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Invalid Data Provided&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User provides incorrect data type or format. For example: User tries to track a return using a regular order number instead of a return authorization (RMA) number.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;System Errors&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Technical issues prevent normal operation. For example: While looking up an order, the inventory database is temporarily unavailable. AI needs to explain the situation and provide alternatives.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Incomplete Information&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User omits required details. For example: User wants to initiate a return but hasn’t provided the order number or reason. AI needs to collect this information step by step.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Unsupported Feature&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User requests functionality that doesn’t exist. For example: User asks to change payment method after order has shipped. AI must explain why this isn’t possible and suggest alternatives.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;section id=&quot;personas&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;personas&quot;&gt;Personas&lt;/h4&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 27%&quot;&gt;
        &lt;col style=&quot;width: 72%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;Persona&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;New User&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Unfamiliar with the system; requires guidance.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Expert User&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Experienced; expects efficiency and advanced features.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Non-Native Speaker&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;May have language barriers; uses non-standard expressions.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Busy Professional&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Values quick, concise responses; often multitasking.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Technophobe&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Uncomfortable with technology; needs simple instructions.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Elderly User&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;May not be tech-savvy; requires patience and clear guidance.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;this-taxonomy-is-not-universal&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;this-taxonomy-is-not-universal&quot;&gt;This taxonomy is not universal&lt;/h3&gt;
        &lt;p&gt;This taxonomy (features, scenarios, personas) is not universal. For example, it may not make sense to even have personas if users aren’t directly engaging with your AI. The idea is you should outline dimensions that make sense for your use case and generate data that covers them. You’ll likely refine these after the first round of evaluations.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;generating-data&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;generating-data&quot;&gt;Generating Data&lt;/h3&gt;
        &lt;p&gt;To build your dataset, you can:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;strong&gt;Use Existing Data&lt;/strong&gt;: Sample real user interactions or behaviors from your AI system.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Generate Synthetic Data&lt;/strong&gt;: Use LLMs to create realistic user inputs covering various features, scenarios, and personas.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Often, you’ll do a combination of both to ensure comprehensive coverage. Synthetic data is not as good as real data, but it’s a good starting point. Also, we are only using LLMs to generate the user inputs, not the LLM responses or internal system behavior.&lt;/p&gt;
        &lt;p&gt;Regardless of whether you use existing data or synthetic data, you want good coverage across the dimensions you’ve defined.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Incorporating System Information&lt;/strong&gt;&lt;/p&gt;
        &lt;p&gt;When making test data, use your APIs and databases where appropriate. This will create realistic data and trigger the right scenarios. Sometimes you’ll need to write simple programs to get this information. That’s what the “Assumptions” column is referring to in the examples below.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;example-llm-prompts-for-generating-user-inputs&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;example-llm-prompts-for-generating-user-inputs&quot;&gt;Example LLM Prompts for Generating User Inputs&lt;/h3&gt;
        &lt;p&gt;Here are some example prompts that illustrate how to use an LLM to generate synthetic &lt;strong&gt;user inputs&lt;/strong&gt; for different combinations of features, scenarios, and personas:&lt;/p&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 3%&quot;&gt;
        &lt;col style=&quot;width: 11%&quot;&gt;
        &lt;col style=&quot;width: 11%&quot;&gt;
        &lt;col style=&quot;width: 10%&quot;&gt;
        &lt;col style=&quot;width: 35%&quot;&gt;
        &lt;col style=&quot;width: 26%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;ID&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Persona&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;LLM Prompt to Generate User Input&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;Assumptions (not directly in the prompt)&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Order Tracking&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Invalid Data Provided&lt;/td&gt;
        &lt;td&gt;Frustrated Customer&lt;/td&gt;
        &lt;td&gt;“Generate a user input from someone who is clearly irritated and impatient, using short, terse language to demand information about their order status for order number &lt;strong&gt;#1234567890&lt;/strong&gt;. Include hints of previous negative experiences.”&lt;/td&gt;
        &lt;td&gt;Order number &lt;strong&gt;#1234567890&lt;/strong&gt; does &lt;strong&gt;not&lt;/strong&gt; exist in the system.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;2&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Contact Search&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Multiple Matches Found&lt;/td&gt;
        &lt;td&gt;New User&lt;/td&gt;
        &lt;td&gt;“Create a user input from someone who seems unfamiliar with the system, using hesitant language and asking for help to find contact information for a person named ‘Alex’. The user should appear unsure about what information is needed.”&lt;/td&gt;
        &lt;td&gt;Multiple contacts named ‘Alex’ exist in the system.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;3&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Meeting Scheduler&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Ambiguous Request&lt;/td&gt;
        &lt;td&gt;Busy Professional&lt;/td&gt;
        &lt;td&gt;“Simulate a user input from someone who is clearly in a hurry, using abbreviated language and minimal details to request scheduling a meeting. The message should feel rushed and lack specific information.”&lt;/td&gt;
        &lt;td&gt;N/A&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;4&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Content Recommendation&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;No Matches Found&lt;/td&gt;
        &lt;td&gt;Expert User&lt;/td&gt;
        &lt;td&gt;“Produce a user input from someone who demonstrates in-depth knowledge of their industry, using specific terminology to request articles on sustainable supply chain management. Use the information in this article involving sustainable supply chain management to formulate a plausible query: {{article}}”&lt;/td&gt;
        &lt;td&gt;No articles on ‘Emerging trends in sustainable supply chain management’ exist in the system.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;section id=&quot;generating-synthetic-data&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;generating-synthetic-data&quot;&gt;Generating Synthetic Data&lt;/h3&gt;
        &lt;p&gt;When generating synthetic data, you only need to create the user inputs. You then feed these inputs into your AI system to generate the AI’s responses. It’s important that you log everything so you can evaluate your AI. To recap, here’s the process:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;Generate User Inputs&lt;/strong&gt;: Use the LLM prompts to create realistic user inputs.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Feed Inputs into Your AI System&lt;/strong&gt;: Input the user interactions into your AI as it currently exists.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Capture AI Responses&lt;/strong&gt;: Record the AI’s responses to form complete interactions.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Organize the Interactions&lt;/strong&gt;: Create a table to store the user inputs, AI responses, and relevant metadata.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;section id=&quot;how-much-data-should-you-generate&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;how-much-data-should-you-generate&quot;&gt;How much data should you generate?&lt;/h4&gt;
        &lt;p&gt;There is no right answer here. At a minimum, you want to generate enough data so that you have examples for each combination of dimensions (in this toy example: features, scenarios, and personas). However, you also want to keep generating more data until you feel like you have stopped seeing new failure modes. The amount of data I generate varies significantly depending on the use case.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;does-synthetic-data-actually-work&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;does-synthetic-data-actually-work&quot;&gt;Does synthetic data actually work?&lt;/h4&gt;
        &lt;p&gt;You might be skeptical of using synthetic data. After all, it’s not real data, so how can it be a good proxy? In my experience, it works surprisingly well. Some of my favorite AI products, like &lt;a href=&quot;https://hex.tech/&quot;&gt;Hex&lt;/a&gt; use synthetic data to power their evals:&lt;/p&gt;
        &lt;blockquote class=&quot;blockquote&quot;&gt;
        &lt;p&gt;“LLMs are surprisingly good at generating excellent - and diverse - examples of user prompts. This can be relevant for powering application features, and sneakily, for building Evals. If this sounds a bit like the Large Language Snake is eating its tail, I was just as surprised as you! All I can say is: it works, ship it.” &lt;em&gt;&lt;a href=&quot;https://www.linkedin.com/in/bryan-bischof/&quot;&gt;Bryan Bischof&lt;/a&gt;, Head of AI Engineering at Hex&lt;/em&gt;&lt;/p&gt;
        &lt;/blockquote&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;next-steps-1&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;next-steps-1&quot;&gt;Next Steps&lt;/h3&gt;
        &lt;p&gt;With your dataset ready, now comes the most important part: getting your principal domain expert to evaluate the interactions.&lt;/p&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot;&gt;Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques&lt;/h2&gt;
        &lt;p&gt;The domain expert’s job is to focus on one thing: &lt;strong&gt;“Did the AI achieve the desired outcome?”&lt;/strong&gt; No complex scoring scales or multiple metrics. Just a clear &lt;strong&gt;pass or fail&lt;/strong&gt; decision. In addition to the pass/fail decision, the domain expert should write a critique that explains their reasoning.&lt;/p&gt;
        &lt;section id=&quot;why-are-simple-passfail-metrics-important&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;why-are-simple-passfail-metrics-important&quot;&gt;Why are simple pass/fail metrics important?&lt;/h3&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Clarity and Focus&lt;/strong&gt;: A binary decision forces everyone to consider what truly matters. It simplifies the evaluation to a single, crucial question.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Actionable Insights&lt;/strong&gt;: Pass/fail judgments are easy to interpret and act upon. They help you quickly identify whether the AI meets the user’s needs.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Forces Articulation of Expectations&lt;/strong&gt;: When domain experts must decide if an interaction passes or fails, they are compelled to articulate their expectations clearly. This process uncovers nuances and unspoken assumptions about how the AI should behave.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficient Use of Resources&lt;/strong&gt;: Keeps the evaluation process manageable, especially when starting out. You avoid getting bogged down in detailed metrics that might not be meaningful yet.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;/section&gt;
        &lt;section id=&quot;the-role-of-critiques&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;the-role-of-critiques&quot;&gt;The Role of Critiques&lt;/h3&gt;
        &lt;p&gt;Alongside a binary pass/fail judgment, it’s important to write a detailed critique of the LLM-generated output. These critiques:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Captures Nuances&lt;/strong&gt;: The critique allows you to note if something was mostly correct but had areas for improvement.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Guide Improvement&lt;/strong&gt;: Detailed feedback provides specific insights into how the AI can be enhanced.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Balance Simplicity with Depth&lt;/strong&gt;: While the pass/fail offers a clear verdict, the critique offers the depth needed to understand the reasoning behind the judgment.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Why Write Critiques?:&lt;/strong&gt;&lt;/p&gt;
        &lt;p&gt;In practice, domain experts may not have fully internalized all the judgment criteria. By forcing them to make a pass/fail decision and explain their reasoning, they clarify their expectations and provide valuable guidance for refining the AI.&lt;/p&gt;
        &lt;p&gt;&lt;em&gt;“But my problem is complex!”&lt;/em&gt; Trust me—starting simple forces you to focus on what truly matters. You can introduce more complexity later if needed.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;examples-of-good-critiques&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;examples-of-good-critiques&quot;&gt;Examples of Good Critiques&lt;/h3&gt;
        &lt;p&gt;To illustrate how simple pass/fail judgments combined with detailed critiques work in practice, here’s a table showcasing examples of user interactions with an AI assistant. The table includes both &lt;strong&gt;passes&lt;/strong&gt; and &lt;strong&gt;fails&lt;/strong&gt;, with critiques explaining why the AI received that judgment. In cases where the AI passed despite critical issues, the critique highlights these aspects and justifies why it still passed overall. For failed interactions, the critique explains the critical elements that led to the failure.&lt;/p&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 51%&quot;&gt;
        &lt;col style=&quot;width: 24%&quot;&gt;
        &lt;col style=&quot;width: 24%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;User Interaction with AI&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Judgment&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Critique&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “I need to cancel my flight booking for next week.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Your flight booking for next week has been canceled. A confirmation email has been sent to you.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI successfully executed the user’s request by canceling the flight and sending a confirmation, achieving the primary goal. &lt;strong&gt;However&lt;/strong&gt;, it should have asked for confirmation before proceeding to prevent accidental cancellations. Despite this critical oversight, the AI fulfilled the user’s request, so it &lt;strong&gt;passes&lt;/strong&gt;, but adding a confirmation step would improve user safety.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “What’s the capital of France?”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “The capital of France is Paris.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI correctly answered the user’s question, providing accurate information promptly. &lt;strong&gt;Although&lt;/strong&gt; the response is sufficient, the AI could enhance the interaction by offering additional interesting facts or asking if the user has more questions. Despite this opportunity for added engagement, the AI met the user’s primary need, so it &lt;strong&gt;passes&lt;/strong&gt;.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “I can’t log into my account; it says my password is incorrect.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Please reset your password by clicking ‘Forgot Password’.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI provided a generic solution without acknowledging the user’s frustration or checking for other issues like account lockout. It failed to offer empathetic support or additional assistance. Due to the lack of personalized help and poor user experience, it &lt;strong&gt;fails&lt;/strong&gt;.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “Book a table for two at an Italian restaurant tonight at 7 PM.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Table for two at Luigi’s Italian Restaurant booked for tonight at 7 PM. Confirmation number is 12345.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI efficiently completed the booking as requested, providing all necessary details and a confirmation number. &lt;strong&gt;However&lt;/strong&gt;, it could have asked if the user had seating preferences or dietary restrictions. The AI accomplished the primary task successfully, so it &lt;strong&gt;passes&lt;/strong&gt;, with minor opportunities for enhanced service.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “I need help with my billing statement.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Please provide your account number and full name.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI requested sensitive information without verifying the user’s identity through secure means, potentially compromising security. It should guide the user through a secure authentication process first. Due to this critical oversight in user data protection, it &lt;strong&gt;fails&lt;/strong&gt;.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;p&gt;These examples demonstrate how the AI can receive both &lt;strong&gt;“Pass”&lt;/strong&gt; and &lt;strong&gt;“Fail”&lt;/strong&gt; judgments. In the critiques:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;For &lt;strong&gt;passes&lt;/strong&gt;, we explain why the AI succeeded in meeting the user’s primary need, even if there were critical aspects that could be improved. We highlight these areas for enhancement while justifying the overall passing judgment.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;For &lt;strong&gt;fails&lt;/strong&gt;, we identify the critical elements that led to the failure, explaining why the AI did not meet the user’s main objective or compromised important factors like user experience or security.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Most importantly, &lt;strong&gt;the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge&lt;/strong&gt;. In other words, it should be detailed enough that a new employee could understand it. Being too terse is a common mistake.&lt;/p&gt;
        &lt;p&gt;Note that the example user interactions with the AI are simplified for brevity - but you might need to give the domain expert more context to make a judgement. More on that later.&lt;/p&gt;
        &lt;div class=&quot;callout callout-style-default callout-note callout-titled&quot;&gt;
        &lt;div class=&quot;callout-header d-flex align-content-center&quot;&gt;
        &lt;div class=&quot;callout-icon-container&quot;&gt;
        &lt;i class=&quot;callout-icon&quot;&gt;&lt;/i&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-title-container flex-fill&quot;&gt;
        Note
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-body-container callout-body&quot;&gt;
        &lt;p&gt;At this point, you don’t need to perform a root cause analysis into the technical reasons behind why the AI failed. Many times, it’s useful to get a sense of overall behavior before diving into the weeds.&lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;/section&gt;
        &lt;section id=&quot;dont-stray-from-binary-passfail-judgments-when-starting-out&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;dont-stray-from-binary-passfail-judgments-when-starting-out&quot;&gt;Don’t stray from binary pass/fail judgments when starting out&lt;/h3&gt;
        &lt;p&gt;A common mistake is straying from binary pass/fail judgments. Let’s revisit the dashboard from earlier:&lt;/p&gt;
        &lt;p&gt;&lt;img src=&quot;https://hamel.dev/blog/posts/llm-judge/dashboard.png&quot; class=&quot;img-fluid&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/p&gt;
        &lt;p&gt;If your evaluations consist of a bunch of metrics that LLMs score on a 1-5 scale (or any other scale), you’re doing it wrong. Let’s unpack why.&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;It’s not actionable&lt;/strong&gt;: People don’t know what to do with a 3 or 4. It’s not immediately obvious how this number is better than a 2. You need to be able to say “this interaction passed because…” and “this interaction failed because…”.&lt;/li&gt;
        &lt;li&gt;More often than not, &lt;strong&gt;these metrics do not matter&lt;/strong&gt;. Every time I’ve analyzed data on domain expert judgments, they tend not to correlate with these kind of metrics. By having a domain expert make a binary judgment, you can figure out what truly matters.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;p&gt;This is why I hate off the shelf metrics that come with many evaluation frameworks. They tend to lead people astray.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Common Objections to Pass/Fail Judgments:&lt;/strong&gt;&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;“The business said that these 8 dimensions are important, so we need to evaluate all of them.”&lt;/li&gt;
        &lt;li&gt;“We need to be able to say why an interaction passed or failed.”&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;I can guarantee you that if someone says you need to measure 8 things on a 1-5 scale, they don’t know what they are looking for. They are just guessing. You have to let the domain expert drive and make a pass/fail judgment with critiques so you can figure out what truly matters. Stand your ground here.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;make-it-easy-for-the-domain-expert-to-review-data&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;make-it-easy-for-the-domain-expert-to-review-data&quot;&gt;Make it easy for the domain expert to review data&lt;/h3&gt;
        &lt;p&gt;Finally, you need to remove all friction from reviewing data. I’ve written about this &lt;a href=&quot;https://hamel.dev/notes/llm/finetuning/data_cleaning.html&quot;&gt;here&lt;/a&gt;. Sometimes, you can just use a spreadsheet. It’s a judgement call in terms of what is easiest for the domain expert. I found that I often have to provide additional context to help the domain expert understand the user interaction, such as:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;Metadata about the user, such as their location, subscription tier, etc.&lt;/li&gt;
        &lt;li&gt;Additional context about the system, such as the current time, inventory levels, etc.&lt;/li&gt;
        &lt;li&gt;Resources so you can check if the AI’s response is correct (ex: ability to search a database, etc.)&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;All of this data needs to be presented on a single screen so the domain expert can review it without jumping through hoops. That’s why I recommend building &lt;a href=&quot;https://hamel.dev/notes/llm/finetuning/data_cleaning.html&quot;&gt;a simple web app&lt;/a&gt; to review data.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;how-many-examples-do-you-need&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;how-many-examples-do-you-need&quot;&gt;How many examples do you need?&lt;/h3&gt;
        &lt;p&gt;The number of examples you need depends on the complexity of the task. My heuristic is that I start with around 30 examples and keep going until I do not see any new failure modes. From there, I keep going until I’m not learning anything new.&lt;/p&gt;
        &lt;p&gt;Next, we’ll look at how to use this data to build an LLM judge.&lt;/p&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-4-fix-errors&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-4-fix-errors&quot;&gt;Step 4: Fix Errors&lt;/h2&gt;
        &lt;p&gt;After looking at the data, it’s likely you will find errors in your AI system. Instead of plowing ahead and building an LLM judge, you want to fix any obvious errors. Remember, the whole point of the LLM as a judge is to help you find these errors, so it’s totally fine if you find them earlier!&lt;/p&gt;
        &lt;p&gt;If you have already developed &lt;a href=&quot;https://hamel.dev/blog/posts/evals&quot;&gt;Level 1 evals as outlined in my previous post&lt;/a&gt;, you should not have any pervasive errors. However, these errors can sometimes slip through the cracks. If you find pervasive errors, fix them and go back to step 3. Keep iterating until you feel like you have stabilized your system.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-5-build-your-llm-as-a-judge-iteratively&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-5-build-your-llm-as-a-judge-iteratively&quot;&gt;Step 5: Build Your LLM as A Judge, Iteratively&lt;/h2&gt;
        &lt;section id=&quot;the-hidden-power-of-critiques&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;the-hidden-power-of-critiques&quot;&gt;The Hidden Power of Critiques&lt;/h3&gt;
        &lt;p&gt;You cannot write a good judge prompt until you’ve seen the data. &lt;a href=&quot;https://arxiv.org/abs/2404.12272&quot;&gt;The paper from Shankar et al.,&lt;/a&gt; “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences” summarizes this well:&lt;/p&gt;
        &lt;blockquote class=&quot;blockquote&quot;&gt;
        &lt;p&gt;to grade outputs,people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies thatit is impossible to completely determine evaluation criteria prior to human judging of LLM outputs.&lt;/p&gt;
        &lt;/blockquote&gt;
        &lt;/section&gt;
        &lt;section id=&quot;start-with-expert-examples&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;start-with-expert-examples&quot;&gt;Start with Expert Examples&lt;/h3&gt;
        &lt;p&gt;Let me share a real-world example of building an LLM judge you can apply to your own use case. When I was helping Honeycomb build their &lt;a href=&quot;https://www.honeycomb.io/blog/introducing-query-assistant&quot;&gt;Query Assistant feature&lt;/a&gt;, we needed a way to evaluate if the AI was generating good queries. Here’s what our LLM judge prompt looked like, including few-shot examples of critiques from our domain expert, &lt;a href=&quot;https://x.com/_cartermp&quot;&gt;Phillip&lt;/a&gt;:&lt;/p&gt;
        &lt;pre class=&quot;text&quot;&gt;&lt;code&gt;You are a Honeycomb query evaluator with advanced capabilities to judge if a query is good or not.
        You understand the nuances of the Honeycomb query language, including what is likely to be
        most useful from an analytics perspective.
        Here is information about the Honeycomb query language:
        {{query_language_info}}
        Here are some guidelines for evaluating queries:
        {{guidelines}}
        Example evaluations:
        &amp;lt;examples&amp;gt;
        &amp;lt;example-1&amp;gt;
        &amp;lt;nlq&amp;gt;show me traces where ip is 10.0.2.90&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {
        &quot;breakdowns&quot;: [&quot;trace.trace_id&quot;],
        &quot;calculations&quot;: [{&quot;op&quot;: &quot;COUNT&quot;}],
        &quot;filters&quot;: [{&quot;column&quot;: &quot;net.host.ip&quot;, &quot;op&quot;: &quot;=&quot;, &quot;value&quot;: &quot;10.0.2.90&quot;}]
        }
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;
        {
        &quot;critique&quot;: &quot;The query correctly filters for traces with an IP address of 10.0.2.90
        and counts the occurrences of those traces, grouped by trace.trace_id. The response
        is good as it meets the requirement of showing traces from a specific IP address
        without additional complexities.&quot;,
        &quot;outcome&quot;: &quot;good&quot;
        }
        &amp;lt;/critique&amp;gt;
        &amp;lt;/example-1&amp;gt;
        &amp;lt;example-2&amp;gt;
        &amp;lt;nlq&amp;gt;show me slowest trace&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {
        &quot;calculations&quot;: [{&quot;column&quot;: &quot;duration_ms&quot;, &quot;op&quot;: &quot;MAX&quot;}],
        &quot;orders&quot;: [{&quot;column&quot;: &quot;duration_ms&quot;, &quot;op&quot;: &quot;MAX&quot;, &quot;order&quot;: &quot;descending&quot;}],
        &quot;limit&quot;: 1
        }
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;
        {
        &quot;critique&quot;: &quot;While the query attempts to find the slowest trace using MAX(duration_ms)
        and ordering correctly, it fails to group by trace.trace_id. Without this grouping,
        the query only shows the MAX(duration_ms) measurement over time, not the actual
        slowest trace.&quot;,
        &quot;outcome&quot;: &quot;bad&quot;
        }
        &amp;lt;/critique&amp;gt;
        &amp;lt;/example-2&amp;gt;
        &amp;lt;example-3&amp;gt;
        &amp;lt;nlq&amp;gt;count window-hash where window-hash exists per hour&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {
        &quot;breakdowns&quot;: [&quot;window-hash&quot;],
        &quot;calculations&quot;: [{&quot;op&quot;: &quot;COUNT&quot;}],
        &quot;filters&quot;: [{&quot;column&quot;: &quot;window-hash&quot;, &quot;op&quot;: &quot;exists&quot;}],
        &quot;time_range&quot;: 3600
        }
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;
        {
        &quot;critique&quot;: &quot;While the query correctly counts window-hash occurrences, the time_range
        of 3600 seconds (1 hour) is insufficient for per-hour analysis. When we say &#39;per hour&#39;,
        we need a time_range of at least 36000 seconds to show meaningful hourly patterns.&quot;,
        &quot;outcome&quot;: &quot;bad&quot;
        }
        &amp;lt;/critique&amp;gt;
        &amp;lt;/example-3&amp;gt;
        &amp;lt;/examples&amp;gt;
        For the following query, first write a detailed critique explaining your reasoning,
        then provide a pass/fail judgment in the same format as above.
        &amp;lt;nlq&amp;gt;{{user_input}}&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {{generated_query}}
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;&lt;/code&gt;&lt;/pre&gt;
        &lt;p&gt;Notice how each example includes:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;The natural language query (NLQ) in &lt;code&gt;&amp;lt;nlq&amp;gt;&lt;/code&gt; tags&lt;/li&gt;
        &lt;li&gt;The generated query in &lt;code&gt;&amp;lt;query&amp;gt;&lt;/code&gt; tags&lt;/li&gt;
        &lt;li&gt;The critique and outcome in &lt;code&gt;&amp;lt;critique&amp;gt;&lt;/code&gt; tags&lt;/li&gt;
        &lt;/ol&gt;
        &lt;p&gt;In the prompt above, the example critiques are fixed. An advanced approach is to include examples dynamically based upon the item you are judging. You can learn more in &lt;a href=&quot;https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/&quot;&gt;this post about Continual In-Context Learning&lt;/a&gt;.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;keep-iterating-on-the-prompt-until-convergence-with-domain-expert&quot; class=&

@github-actions github-actions bot added the Auto: Route Test Complete Auto route test has finished on given PR label Nov 3, 2024
lib/routes/hamel/index.ts Outdated Show resolved Hide resolved
fix(route): fix pr issue
Copy link
Contributor

github-actions bot commented Nov 5, 2024

Successfully generated as following:

http://localhost:1200/hamel/blog - Success ✔️
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0">
  <channel>
    <title>Hamel&#39;s Blog</title>
    <link>https://hamel.dev</link>
    <atom:link href="http://localhost:1200/hamel/blog" rel="self" type="application/rss+xml"></atom:link>
    <description>Hamel&#39;s Blog - Powered by RSSHub</description>
    <generator>RSSHub</generator>
    <webMaster>contact@rsshub.app (RSSHub)</webMaster>
    <language>en</language>
    <lastBuildDate>Tue, 05 Nov 2024 01:21:46 GMT</lastBuildDate>
    <ttl>5</ttl>
    <item>
      <title>Creating a LLM-as-a-Judge That Drives Business Results</title>
      <description>&lt;header id=&quot;title-block-header&quot; class=&quot;quarto-title-block default&quot;&gt;
        &lt;div class=&quot;quarto-title&quot;&gt;
        &lt;h1 class=&quot;title&quot;&gt;Creating a LLM-as-a-Judge That Drives Business Results&lt;/h1&gt;
        &lt;div class=&quot;quarto-categories&quot;&gt;
        &lt;div class=&quot;quarto-category&quot;&gt;LLMs&lt;/div&gt;
        &lt;div class=&quot;quarto-category&quot;&gt;evals&lt;/div&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div&gt;
        &lt;div class=&quot;description&quot;&gt;
        A step-by-step guide with my learnings from 30+ AI implementations.
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;quarto-title-meta&quot;&gt;
        &lt;div&gt;
        &lt;div class=&quot;quarto-title-meta-heading&quot;&gt;Author&lt;/div&gt;
        &lt;div class=&quot;quarto-title-meta-contents&quot;&gt;
        &lt;p&gt;Hamel Husain &lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div&gt;
        &lt;div class=&quot;quarto-title-meta-heading&quot;&gt;Published&lt;/div&gt;
        &lt;div class=&quot;quarto-title-meta-contents&quot;&gt;
        &lt;p class=&quot;date&quot;&gt;October 29, 2024&lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;/header&gt;
        &lt;nav id=&quot;TOC-body&quot; role=&quot;doc-toc&quot;&gt;
        &lt;h2 id=&quot;toc-title&quot;&gt;Table Of Contents&lt;/h2&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-problem-ai-teams-are-drowning-in-data&quot; id=&quot;toc-the-problem-ai-teams-are-drowning-in-data&quot;&gt;The Problem: AI Teams Are Drowning in Data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-1-find-the-principal-domain-expert&quot; id=&quot;toc-step-1-find-the-principal-domain-expert&quot;&gt;Step 1: Find &lt;em&gt;The&lt;/em&gt; Principal Domain Expert&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#next-steps&quot; id=&quot;toc-next-steps&quot;&gt;Next Steps&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-2-create-a-dataset&quot; id=&quot;toc-step-2-create-a-dataset&quot;&gt;Step 2: Create a Dataset&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#why-a-diverse-dataset-matters&quot; id=&quot;toc-why-a-diverse-dataset-matters&quot;&gt;Why a Diverse Dataset Matters&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#dimensions-for-structuring-your-dataset&quot; id=&quot;toc-dimensions-for-structuring-your-dataset&quot;&gt;Dimensions for Structuring Your Dataset&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#examples-of-features-scenarios-and-personas&quot; id=&quot;toc-examples-of-features-scenarios-and-personas&quot;&gt;Examples of Features, Scenarios, and Personas&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#this-taxonomy-is-not-universal&quot; id=&quot;toc-this-taxonomy-is-not-universal&quot;&gt;This taxonomy is not universal&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#generating-data&quot; id=&quot;toc-generating-data&quot;&gt;Generating Data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#example-llm-prompts-for-generating-user-inputs&quot; id=&quot;toc-example-llm-prompts-for-generating-user-inputs&quot;&gt;Example LLM Prompts for Generating User Inputs&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#generating-synthetic-data&quot; id=&quot;toc-generating-synthetic-data&quot;&gt;Generating Synthetic Data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#next-steps-1&quot; id=&quot;toc-next-steps-1&quot;&gt;Next Steps&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot; id=&quot;toc-step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot;&gt;Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#why-are-simple-passfail-metrics-important&quot; id=&quot;toc-why-are-simple-passfail-metrics-important&quot;&gt;Why are simple pass/fail metrics important?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-role-of-critiques&quot; id=&quot;toc-the-role-of-critiques&quot;&gt;The Role of Critiques&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#examples-of-good-critiques&quot; id=&quot;toc-examples-of-good-critiques&quot;&gt;Examples of Good Critiques&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#dont-stray-from-binary-passfail-judgments-when-starting-out&quot; id=&quot;toc-dont-stray-from-binary-passfail-judgments-when-starting-out&quot;&gt;Don’t stray from binary pass/fail judgments when starting out&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#make-it-easy-for-the-domain-expert-to-review-data&quot; id=&quot;toc-make-it-easy-for-the-domain-expert-to-review-data&quot;&gt;Make it easy for the domain expert to review data&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-many-examples-do-you-need&quot; id=&quot;toc-how-many-examples-do-you-need&quot;&gt;How many examples do you need?&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-4-fix-errors&quot; id=&quot;toc-step-4-fix-errors&quot;&gt;Step 4: Fix Errors&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-5-build-your-llm-as-a-judge-iteratively&quot; id=&quot;toc-step-5-build-your-llm-as-a-judge-iteratively&quot;&gt;Step 5: Build Your LLM as A Judge, Iteratively&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-hidden-power-of-critiques&quot; id=&quot;toc-the-hidden-power-of-critiques&quot;&gt;The Hidden Power of Critiques&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#start-with-expert-examples&quot; id=&quot;toc-start-with-expert-examples&quot;&gt;Start with Expert Examples&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#keep-iterating-on-the-prompt-until-convergence-with-domain-expert&quot; id=&quot;toc-keep-iterating-on-the-prompt-until-convergence-with-domain-expert&quot;&gt;Keep Iterating on the Prompt Until Convergence With Domain Expert&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-to-optimize-the-llm-judge-prompt&quot; id=&quot;toc-how-to-optimize-the-llm-judge-prompt&quot;&gt;How to Optimize the LLM Judge Prompt?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#the-human-side-of-the-process&quot; id=&quot;toc-the-human-side-of-the-process&quot;&gt;The Human Side of the Process&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-often-should-you-evaluate&quot; id=&quot;toc-how-often-should-you-evaluate&quot;&gt;How Often Should You Evaluate?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#what-if-this-doesnt-work&quot; id=&quot;toc-what-if-this-doesnt-work&quot;&gt;What if this doesn’t work?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#mistakes-ive-noticed-in-llm-judge-prompts&quot; id=&quot;toc-mistakes-ive-noticed-in-llm-judge-prompts&quot;&gt;Mistakes I’ve noticed in LLM judge prompts&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-6-perform-error-analysis&quot; id=&quot;toc-step-6-perform-error-analysis&quot;&gt;Step 6: Perform Error Analysis&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#classify-traces&quot; id=&quot;toc-classify-traces&quot;&gt;Classify Traces&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#an-interactive-walkthrough-of-error-analysis&quot; id=&quot;toc-an-interactive-walkthrough-of-error-analysis&quot;&gt;An Interactive Walkthrough of Error Analysis&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#fix-your-errors-again&quot; id=&quot;toc-fix-your-errors-again&quot;&gt;Fix Your Errors, Again&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#doing-this-well-requires-data-literacy&quot; id=&quot;toc-doing-this-well-requires-data-literacy&quot;&gt;Doing this well requires data literacy&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#step-7-create-more-specialized-llm-judges-if-needed&quot; id=&quot;toc-step-7-create-more-specialized-llm-judges-if-needed&quot;&gt;Step 7: Create More Specialized LLM Judges (if needed)&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#recap-of-critique-shadowing&quot; id=&quot;toc-recap-of-critique-shadowing&quot;&gt;Recap of Critique Shadowing&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#its-not-the-judge-that-created-value-afterall&quot; id=&quot;toc-its-not-the-judge-that-created-value-afterall&quot;&gt;It’s Not The Judge That Created Value, Afterall&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#do-you-really-need-this&quot; id=&quot;toc-do-you-really-need-this&quot;&gt;Do You Really Need This?&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#faq&quot; id=&quot;toc-faq&quot;&gt;FAQ&lt;/a&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#if-i-have-a-good-judge-llm-isnt-that-also-the-llm-id-also-want-to-use&quot; id=&quot;toc-if-i-have-a-good-judge-llm-isnt-that-also-the-llm-id-also-want-to-use&quot;&gt;If I have a good judge LLM, isn’t that also the LLM I’d also want to use?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#do-you-recommend-fine-tuning-judges&quot; id=&quot;toc-do-you-recommend-fine-tuning-judges&quot;&gt;Do you recommend fine-tuning judges?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#whats-wrong-with-off-the-shelf-llm-judges&quot; id=&quot;toc-whats-wrong-with-off-the-shelf-llm-judges&quot;&gt;What’s wrong with off-the-shelf LLM judges?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-evaluate-the-llm-judge&quot; id=&quot;toc-how-do-you-evaluate-the-llm-judge&quot;&gt;How Do you evaluate the LLM judge?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#what-model-do-you-use-for-the-llm-judge&quot; id=&quot;toc-what-model-do-you-use-for-the-llm-judge&quot;&gt;What model do you use for the LLM judge?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#what-about-guardrails&quot; id=&quot;toc-what-about-guardrails&quot;&gt;What about guardrails?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#im-using-llm-as-a-judge-and-getting-tremendous-value-but-i-didnt-follow-this-approach.&quot; id=&quot;toc-im-using-llm-as-a-judge-and-getting-tremendous-value-but-i-didnt-follow-this-approach.&quot;&gt;I’m using LLM as a judge, and getting tremendous value but I didn’t follow this approach.&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-choose-between-traditional-ml-techniques-llm-as-a-judge-and-human-annotations&quot; id=&quot;toc-how-do-you-choose-between-traditional-ml-techniques-llm-as-a-judge-and-human-annotations&quot;&gt;How do you choose between traditional ML techniques, LLM-as-a-judge and human annotations?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#can-you-make-judges-from-small-models&quot; id=&quot;toc-can-you-make-judges-from-small-models&quot;&gt;Can you make judges from small models?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-ensure-consistency-when-updating-your-llm-model&quot; id=&quot;toc-how-do-you-ensure-consistency-when-updating-your-llm-model&quot;&gt;How do you ensure consistency when updating your LLM model?&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#how-do-you-phase-out-human-in-the-loop-to-scale-this&quot; id=&quot;toc-how-do-you-phase-out-human-in-the-loop-to-scale-this&quot;&gt;How do you phase out human in the loop to scale this?&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#resources&quot; id=&quot;toc-resources&quot;&gt;Resources&lt;/a&gt;&lt;/li&gt;
        &lt;li&gt;&lt;a href=&quot;https://hamel.dev/blog/posts/llm-judge/index.html#stay-connected&quot; id=&quot;toc-stay-connected&quot;&gt;Stay Connected&lt;/a&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;/nav&gt;
        &lt;p&gt;Earlier this year, I wrote &lt;a href=&quot;https://hamel.dev/blog/posts/evals/&quot;&gt;Your AI product needs evals&lt;/a&gt;. Many of you asked, “How do I get started with LLM-as-a-judge?” This guide shares what I’ve learned after helping over &lt;a href=&quot;https://parlance-labs.com/&quot;&gt;30 companies&lt;/a&gt; set up their evaluation systems.&lt;/p&gt;
        &lt;section id=&quot;the-problem-ai-teams-are-drowning-in-data&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;the-problem-ai-teams-are-drowning-in-data&quot;&gt;The Problem: AI Teams Are Drowning in Data&lt;/h2&gt;
        &lt;p&gt;Ever spend weeks building an AI system, only to realize you have no idea if it’s actually working? You’re not alone. I’ve noticed teams repeat the same mistakes when using LLMs to evaluate AI outputs:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;Too Many Metrics&lt;/strong&gt;: Creating numerous measurements that become unmanageable.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Arbitrary Scoring Systems&lt;/strong&gt;: Using uncalibrated scales (like 1-5) across multiple dimensions, where the difference between scores is unclear and subjective. What makes something a 3 versus a 4? Nobody knows, and different evaluators often interpret these scales differently.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Ignoring Domain Experts&lt;/strong&gt;: Not involving the people who understand the subject matter deeply.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Unvalidated Metrics&lt;/strong&gt;: Using measurements that don’t truly reflect what matters to the users or the business.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;p&gt;The result? Teams end up buried under mountains of metrics or data they don’t trust and can’t use. Progress grinds to a halt. Everyone gets frustrated.&lt;/p&gt;
        &lt;p&gt;For example, it’s not uncommon for me to see dashboards that look like this:&lt;/p&gt;
        &lt;div class=&quot;quarto-figure quarto-figure-center&quot;&gt;
        &lt;figure class=&quot;figure&quot;&gt;
        &lt;p&gt;&lt;img src=&quot;https://hamel.dev/blog/posts/llm-judge/blog_header.png&quot; class=&quot;img-fluid figure-img&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/p&gt;
        &lt;figcaption&gt;An illustrative example of a bad eval dashboard&lt;/figcaption&gt;
        &lt;/figure&gt;
        &lt;/div&gt;
        &lt;p&gt;Tracking a bunch of scores on a 1-5 scale is often a sign of a bad eval process (I’ll discuss why later). In this post, I’ll show you how to avoid these pitfalls. The solution is to use a technique that I call &lt;strong&gt;“Critique Shadowing”&lt;/strong&gt;. Here’s how to do it, step by step.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-1-find-the-principal-domain-expert&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-1-find-the-principal-domain-expert&quot;&gt;Step 1: Find &lt;em&gt;The&lt;/em&gt; Principal Domain Expert&lt;/h2&gt;
        &lt;p&gt;In most organizations there is usually one (maybe two) key individuals whose judgment is crucial for the success of your AI product. These are the people with deep domain expertise or represent your target users. Identifying and involving this &lt;strong&gt;Principal Domain Expert&lt;/strong&gt; early in the process is critical.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Why is finding the right domain expert so important?&lt;/strong&gt;&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;They Set the Standard&lt;/strong&gt;: This person not only defines what is acceptable technically, but also helps you understand if you’re building something users actually want.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Capture Unspoken Expectations&lt;/strong&gt;: By involving them, you uncover their preferences and expectations, which they might not be able to fully articulate upfront. Through the evaluation process, you help them clarify what a “passable” AI interaction looks like.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Consistency in Judgment&lt;/strong&gt;: People in your organization may have different opinions about the AI’s performance. Focusing on the principal expert ensures that evaluations are consistent and aligned with the most critical standards.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Sense of Ownership&lt;/strong&gt;: Involving the expert gives them a stake in the AI’s development. They feel invested because they’ve had a hand in shaping it. In the end, they are more likely to approve of the AI.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Examples of Principal Domain Experts:&lt;/strong&gt;&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;A &lt;strong&gt;psychologist&lt;/strong&gt; for a mental health AI assistant.&lt;/li&gt;
        &lt;li&gt;A &lt;strong&gt;lawyer&lt;/strong&gt; for an AI that analyzes legal documents.&lt;/li&gt;
        &lt;li&gt;A &lt;strong&gt;customer service director&lt;/strong&gt; for a support chatbot.&lt;/li&gt;
        &lt;li&gt;A &lt;strong&gt;lead teacher or curriculum developer&lt;/strong&gt; for an educational AI tool.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;div class=&quot;callout callout-style-default callout-note callout-titled&quot;&gt;
        &lt;div class=&quot;callout-header d-flex align-content-center&quot;&gt;
        &lt;div class=&quot;callout-icon-container&quot;&gt;
        &lt;i class=&quot;callout-icon&quot;&gt;&lt;/i&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-title-container flex-fill&quot;&gt;
        Exceptions
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-body-container callout-body&quot;&gt;
        &lt;p&gt;In a smaller company, this might be the CEO or founder. If you are an independent developer, you should be the domain expert (but be honest with yourself about your expertise).&lt;/p&gt;
        &lt;p&gt;If you must rely on leadership, you should regularly validate their assumptions against real user feedback.&lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;p&gt;Many developers attempt to act as the domain expert themselves, or find a convenient proxy (ex: their superior). This is a recipe for disaster. People will have varying opinions about what is acceptable, and you can’t make everyone happy. What’s important is that your principal domain expert is satisfied.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Remember:&lt;/strong&gt; This doesn’t have to take a lot of the domain expert’s time. Later in this post, I’ll discuss how you can make the process efficient. Their involvement is absolutely critical to the AI’s success.&lt;/p&gt;
        &lt;section id=&quot;next-steps&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;next-steps&quot;&gt;Next Steps&lt;/h3&gt;
        &lt;p&gt;Once you’ve found your expert, we need to give them the right data to review. Let’s talk about how to do that next.&lt;/p&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-2-create-a-dataset&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-2-create-a-dataset&quot;&gt;Step 2: Create a Dataset&lt;/h2&gt;
        &lt;p&gt;With your principal domain expert on board, the next step is to build a dataset that captures problems that your AI will encounter. It’s important that the dataset is diverse and represents the types of interactions that your AI will have in production.&lt;/p&gt;
        &lt;section id=&quot;why-a-diverse-dataset-matters&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;why-a-diverse-dataset-matters&quot;&gt;Why a Diverse Dataset Matters&lt;/h3&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;strong&gt;Comprehensive Testing&lt;/strong&gt;: Ensures your AI is evaluated across a wide range of situations.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Realistic Interactions&lt;/strong&gt;: Reflects actual user behavior for more relevant evaluations.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Identifies Weaknesses&lt;/strong&gt;: Helps uncover areas where the AI may struggle or produce errors.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;/section&gt;
        &lt;section id=&quot;dimensions-for-structuring-your-dataset&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;dimensions-for-structuring-your-dataset&quot;&gt;Dimensions for Structuring Your Dataset&lt;/h3&gt;
        &lt;p&gt;You want to define dimensions that make sense for your use case. For example, here are ones that I often use for B2C applications:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;Features&lt;/strong&gt;: Specific functionalities of your AI product.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Scenarios&lt;/strong&gt;: Situations or problems the AI may encounter and needs to handle.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Personas&lt;/strong&gt;: Representative user profiles with distinct characteristics and needs.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;/section&gt;
        &lt;section id=&quot;examples-of-features-scenarios-and-personas&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;examples-of-features-scenarios-and-personas&quot;&gt;Examples of Features, Scenarios, and Personas&lt;/h3&gt;
        &lt;section id=&quot;features&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;features&quot;&gt;Features&lt;/h4&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 30%&quot;&gt;
        &lt;col style=&quot;width: 69%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Email Summarization&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Condensing lengthy emails into key points.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Meeting Scheduler&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Automating the scheduling of meetings across time zones.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Order Tracking&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Providing shipment status and delivery updates.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Contact Search&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Finding and retrieving contact information from a database.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Language Translation&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Translating text between languages.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Content Recommendation&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Suggesting articles or products based on user interests.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;section id=&quot;scenarios&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;scenarios&quot;&gt;Scenarios&lt;/h4&gt;
        &lt;p&gt;Scenarios are situations the AI needs to handle, (not based on the outcome of the AI’s response).&lt;/p&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 31%&quot;&gt;
        &lt;col style=&quot;width: 68%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Multiple Matches Found&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User’s request yields multiple results that need narrowing down. For example: User asks “Where’s my order?” but has three active orders (#123, #124, #125). AI must help identify which specific order they’re asking about.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;No Matches Found&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User’s request yields no results, requiring alternatives or corrections. For example: User searches for order #ABC-123 which doesn’t exist. AI should explain valid order formats and suggest checking their confirmation email.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Ambiguous Request&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User input lacks necessary specificity. For example: User says “I need to change my delivery” without specifying which order or what aspect of delivery (date, address, etc.) they want to change.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Invalid Data Provided&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User provides incorrect data type or format. For example: User tries to track a return using a regular order number instead of a return authorization (RMA) number.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;System Errors&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Technical issues prevent normal operation. For example: While looking up an order, the inventory database is temporarily unavailable. AI needs to explain the situation and provide alternatives.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Incomplete Information&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User omits required details. For example: User wants to initiate a return but hasn’t provided the order number or reason. AI needs to collect this information step by step.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Unsupported Feature&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;User requests functionality that doesn’t exist. For example: User asks to change payment method after order has shipped. AI must explain why this isn’t possible and suggest alternatives.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;section id=&quot;personas&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;personas&quot;&gt;Personas&lt;/h4&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 27%&quot;&gt;
        &lt;col style=&quot;width: 72%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;Persona&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;New User&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Unfamiliar with the system; requires guidance.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Expert User&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Experienced; expects efficiency and advanced features.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Non-Native Speaker&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;May have language barriers; uses non-standard expressions.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Busy Professional&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Values quick, concise responses; often multitasking.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Technophobe&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Uncomfortable with technology; needs simple instructions.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;Elderly User&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;May not be tech-savvy; requires patience and clear guidance.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;this-taxonomy-is-not-universal&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;this-taxonomy-is-not-universal&quot;&gt;This taxonomy is not universal&lt;/h3&gt;
        &lt;p&gt;This taxonomy (features, scenarios, personas) is not universal. For example, it may not make sense to even have personas if users aren’t directly engaging with your AI. The idea is you should outline dimensions that make sense for your use case and generate data that covers them. You’ll likely refine these after the first round of evaluations.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;generating-data&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;generating-data&quot;&gt;Generating Data&lt;/h3&gt;
        &lt;p&gt;To build your dataset, you can:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;strong&gt;Use Existing Data&lt;/strong&gt;: Sample real user interactions or behaviors from your AI system.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Generate Synthetic Data&lt;/strong&gt;: Use LLMs to create realistic user inputs covering various features, scenarios, and personas.&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Often, you’ll do a combination of both to ensure comprehensive coverage. Synthetic data is not as good as real data, but it’s a good starting point. Also, we are only using LLMs to generate the user inputs, not the LLM responses or internal system behavior.&lt;/p&gt;
        &lt;p&gt;Regardless of whether you use existing data or synthetic data, you want good coverage across the dimensions you’ve defined.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Incorporating System Information&lt;/strong&gt;&lt;/p&gt;
        &lt;p&gt;When making test data, use your APIs and databases where appropriate. This will create realistic data and trigger the right scenarios. Sometimes you’ll need to write simple programs to get this information. That’s what the “Assumptions” column is referring to in the examples below.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;example-llm-prompts-for-generating-user-inputs&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;example-llm-prompts-for-generating-user-inputs&quot;&gt;Example LLM Prompts for Generating User Inputs&lt;/h3&gt;
        &lt;p&gt;Here are some example prompts that illustrate how to use an LLM to generate synthetic &lt;strong&gt;user inputs&lt;/strong&gt; for different combinations of features, scenarios, and personas:&lt;/p&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 3%&quot;&gt;
        &lt;col style=&quot;width: 11%&quot;&gt;
        &lt;col style=&quot;width: 11%&quot;&gt;
        &lt;col style=&quot;width: 10%&quot;&gt;
        &lt;col style=&quot;width: 35%&quot;&gt;
        &lt;col style=&quot;width: 26%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;ID&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Feature&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Persona&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;LLM Prompt to Generate User Input&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;Assumptions (not directly in the prompt)&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;1&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Order Tracking&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Invalid Data Provided&lt;/td&gt;
        &lt;td&gt;Frustrated Customer&lt;/td&gt;
        &lt;td&gt;“Generate a user input from someone who is clearly irritated and impatient, using short, terse language to demand information about their order status for order number &lt;strong&gt;#1234567890&lt;/strong&gt;. Include hints of previous negative experiences.”&lt;/td&gt;
        &lt;td&gt;Order number &lt;strong&gt;#1234567890&lt;/strong&gt; does &lt;strong&gt;not&lt;/strong&gt; exist in the system.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;2&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Contact Search&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Multiple Matches Found&lt;/td&gt;
        &lt;td&gt;New User&lt;/td&gt;
        &lt;td&gt;“Create a user input from someone who seems unfamiliar with the system, using hesitant language and asking for help to find contact information for a person named ‘Alex’. The user should appear unsure about what information is needed.”&lt;/td&gt;
        &lt;td&gt;Multiple contacts named ‘Alex’ exist in the system.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;3&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Meeting Scheduler&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;Ambiguous Request&lt;/td&gt;
        &lt;td&gt;Busy Professional&lt;/td&gt;
        &lt;td&gt;“Simulate a user input from someone who is clearly in a hurry, using abbreviated language and minimal details to request scheduling a meeting. The message should feel rushed and lack specific information.”&lt;/td&gt;
        &lt;td&gt;N/A&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;4&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Content Recommendation&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;No Matches Found&lt;/td&gt;
        &lt;td&gt;Expert User&lt;/td&gt;
        &lt;td&gt;“Produce a user input from someone who demonstrates in-depth knowledge of their industry, using specific terminology to request articles on sustainable supply chain management. Use the information in this article involving sustainable supply chain management to formulate a plausible query: {{article}}”&lt;/td&gt;
        &lt;td&gt;No articles on ‘Emerging trends in sustainable supply chain management’ exist in the system.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;/section&gt;
        &lt;section id=&quot;generating-synthetic-data&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;generating-synthetic-data&quot;&gt;Generating Synthetic Data&lt;/h3&gt;
        &lt;p&gt;When generating synthetic data, you only need to create the user inputs. You then feed these inputs into your AI system to generate the AI’s responses. It’s important that you log everything so you can evaluate your AI. To recap, here’s the process:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;Generate User Inputs&lt;/strong&gt;: Use the LLM prompts to create realistic user inputs.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Feed Inputs into Your AI System&lt;/strong&gt;: Input the user interactions into your AI as it currently exists.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Capture AI Responses&lt;/strong&gt;: Record the AI’s responses to form complete interactions.&lt;/li&gt;
        &lt;li&gt;&lt;strong&gt;Organize the Interactions&lt;/strong&gt;: Create a table to store the user inputs, AI responses, and relevant metadata.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;section id=&quot;how-much-data-should-you-generate&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;how-much-data-should-you-generate&quot;&gt;How much data should you generate?&lt;/h4&gt;
        &lt;p&gt;There is no right answer here. At a minimum, you want to generate enough data so that you have examples for each combination of dimensions (in this toy example: features, scenarios, and personas). However, you also want to keep generating more data until you feel like you have stopped seeing new failure modes. The amount of data I generate varies significantly depending on the use case.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;does-synthetic-data-actually-work&quot; class=&quot;level4&quot;&gt;
        &lt;h4 class=&quot;anchored&quot; data-anchor-id=&quot;does-synthetic-data-actually-work&quot;&gt;Does synthetic data actually work?&lt;/h4&gt;
        &lt;p&gt;You might be skeptical of using synthetic data. After all, it’s not real data, so how can it be a good proxy? In my experience, it works surprisingly well. Some of my favorite AI products, like &lt;a href=&quot;https://hex.tech/&quot;&gt;Hex&lt;/a&gt; use synthetic data to power their evals:&lt;/p&gt;
        &lt;blockquote class=&quot;blockquote&quot;&gt;
        &lt;p&gt;“LLMs are surprisingly good at generating excellent - and diverse - examples of user prompts. This can be relevant for powering application features, and sneakily, for building Evals. If this sounds a bit like the Large Language Snake is eating its tail, I was just as surprised as you! All I can say is: it works, ship it.” &lt;em&gt;&lt;a href=&quot;https://www.linkedin.com/in/bryan-bischof/&quot;&gt;Bryan Bischof&lt;/a&gt;, Head of AI Engineering at Hex&lt;/em&gt;&lt;/p&gt;
        &lt;/blockquote&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;next-steps-1&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;next-steps-1&quot;&gt;Next Steps&lt;/h3&gt;
        &lt;p&gt;With your dataset ready, now comes the most important part: getting your principal domain expert to evaluate the interactions.&lt;/p&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-3-direct-the-domain-expert-to-make-passfail-judgments-with-critiques&quot;&gt;Step 3: Direct The Domain Expert to Make Pass/Fail Judgments with Critiques&lt;/h2&gt;
        &lt;p&gt;The domain expert’s job is to focus on one thing: &lt;strong&gt;“Did the AI achieve the desired outcome?”&lt;/strong&gt; No complex scoring scales or multiple metrics. Just a clear &lt;strong&gt;pass or fail&lt;/strong&gt; decision. In addition to the pass/fail decision, the domain expert should write a critique that explains their reasoning.&lt;/p&gt;
        &lt;section id=&quot;why-are-simple-passfail-metrics-important&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;why-are-simple-passfail-metrics-important&quot;&gt;Why are simple pass/fail metrics important?&lt;/h3&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Clarity and Focus&lt;/strong&gt;: A binary decision forces everyone to consider what truly matters. It simplifies the evaluation to a single, crucial question.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Actionable Insights&lt;/strong&gt;: Pass/fail judgments are easy to interpret and act upon. They help you quickly identify whether the AI meets the user’s needs.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Forces Articulation of Expectations&lt;/strong&gt;: When domain experts must decide if an interaction passes or fails, they are compelled to articulate their expectations clearly. This process uncovers nuances and unspoken assumptions about how the AI should behave.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Efficient Use of Resources&lt;/strong&gt;: Keeps the evaluation process manageable, especially when starting out. You avoid getting bogged down in detailed metrics that might not be meaningful yet.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;/section&gt;
        &lt;section id=&quot;the-role-of-critiques&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;the-role-of-critiques&quot;&gt;The Role of Critiques&lt;/h3&gt;
        &lt;p&gt;Alongside a binary pass/fail judgment, it’s important to write a detailed critique of the LLM-generated output. These critiques:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Captures Nuances&lt;/strong&gt;: The critique allows you to note if something was mostly correct but had areas for improvement.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Guide Improvement&lt;/strong&gt;: Detailed feedback provides specific insights into how the AI can be enhanced.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;&lt;strong&gt;Balance Simplicity with Depth&lt;/strong&gt;: While the pass/fail offers a clear verdict, the critique offers the depth needed to understand the reasoning behind the judgment.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;&lt;strong&gt;Why Write Critiques?:&lt;/strong&gt;&lt;/p&gt;
        &lt;p&gt;In practice, domain experts may not have fully internalized all the judgment criteria. By forcing them to make a pass/fail decision and explain their reasoning, they clarify their expectations and provide valuable guidance for refining the AI.&lt;/p&gt;
        &lt;p&gt;&lt;em&gt;“But my problem is complex!”&lt;/em&gt; Trust me—starting simple forces you to focus on what truly matters. You can introduce more complexity later if needed.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;examples-of-good-critiques&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;examples-of-good-critiques&quot;&gt;Examples of Good Critiques&lt;/h3&gt;
        &lt;p&gt;To illustrate how simple pass/fail judgments combined with detailed critiques work in practice, here’s a table showcasing examples of user interactions with an AI assistant. The table includes both &lt;strong&gt;passes&lt;/strong&gt; and &lt;strong&gt;fails&lt;/strong&gt;, with critiques explaining why the AI received that judgment. In cases where the AI passed despite critical issues, the critique highlights these aspects and justifies why it still passed overall. For failed interactions, the critique explains the critical elements that led to the failure.&lt;/p&gt;
        &lt;table class=&quot;caption-top table&quot;&gt;
        &lt;colgroup&gt;
        &lt;col style=&quot;width: 51%&quot;&gt;
        &lt;col style=&quot;width: 24%&quot;&gt;
        &lt;col style=&quot;width: 24%&quot;&gt;
        &lt;/colgroup&gt;
        &lt;thead&gt;
        &lt;tr class=&quot;header&quot;&gt;
        &lt;th&gt;&lt;strong&gt;User Interaction with AI&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Judgment&lt;/strong&gt;&lt;/th&gt;
        &lt;th&gt;&lt;strong&gt;Critique&lt;/strong&gt;&lt;/th&gt;
        &lt;/tr&gt;
        &lt;/thead&gt;
        &lt;tbody&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “I need to cancel my flight booking for next week.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Your flight booking for next week has been canceled. A confirmation email has been sent to you.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI successfully executed the user’s request by canceling the flight and sending a confirmation, achieving the primary goal. &lt;strong&gt;However&lt;/strong&gt;, it should have asked for confirmation before proceeding to prevent accidental cancellations. Despite this critical oversight, the AI fulfilled the user’s request, so it &lt;strong&gt;passes&lt;/strong&gt;, but adding a confirmation step would improve user safety.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “What’s the capital of France?”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “The capital of France is Paris.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI correctly answered the user’s question, providing accurate information promptly. &lt;strong&gt;Although&lt;/strong&gt; the response is sufficient, the AI could enhance the interaction by offering additional interesting facts or asking if the user has more questions. Despite this opportunity for added engagement, the AI met the user’s primary need, so it &lt;strong&gt;passes&lt;/strong&gt;.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “I can’t log into my account; it says my password is incorrect.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Please reset your password by clicking ‘Forgot Password’.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI provided a generic solution without acknowledging the user’s frustration or checking for other issues like account lockout. It failed to offer empathetic support or additional assistance. Due to the lack of personalized help and poor user experience, it &lt;strong&gt;fails&lt;/strong&gt;.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;even&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “Book a table for two at an Italian restaurant tonight at 7 PM.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Table for two at Luigi’s Italian Restaurant booked for tonight at 7 PM. Confirmation number is 12345.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Pass&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI efficiently completed the booking as requested, providing all necessary details and a confirmation number. &lt;strong&gt;However&lt;/strong&gt;, it could have asked if the user had seating preferences or dietary restrictions. The AI accomplished the primary task successfully, so it &lt;strong&gt;passes&lt;/strong&gt;, with minor opportunities for enhanced service.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;tr class=&quot;odd&quot;&gt;
        &lt;td&gt;&lt;strong&gt;User:&lt;/strong&gt; “I need help with my billing statement.”&lt;br&gt;&lt;strong&gt;AI:&lt;/strong&gt; “Please provide your account number and full name.”&lt;/td&gt;
        &lt;td&gt;&lt;strong&gt;Fail&lt;/strong&gt;&lt;/td&gt;
        &lt;td&gt;The AI requested sensitive information without verifying the user’s identity through secure means, potentially compromising security. It should guide the user through a secure authentication process first. Due to this critical oversight in user data protection, it &lt;strong&gt;fails&lt;/strong&gt;.&lt;/td&gt;
        &lt;/tr&gt;
        &lt;/tbody&gt;
        &lt;/table&gt;
        &lt;p&gt;These examples demonstrate how the AI can receive both &lt;strong&gt;“Pass”&lt;/strong&gt; and &lt;strong&gt;“Fail”&lt;/strong&gt; judgments. In the critiques:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;&lt;p&gt;For &lt;strong&gt;passes&lt;/strong&gt;, we explain why the AI succeeded in meeting the user’s primary need, even if there were critical aspects that could be improved. We highlight these areas for enhancement while justifying the overall passing judgment.&lt;/p&gt;&lt;/li&gt;
        &lt;li&gt;&lt;p&gt;For &lt;strong&gt;fails&lt;/strong&gt;, we identify the critical elements that led to the failure, explaining why the AI did not meet the user’s main objective or compromised important factors like user experience or security.&lt;/p&gt;&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;Most importantly, &lt;strong&gt;the critique should be detailed enough so that you can use it in a few-shot prompt for a LLM judge&lt;/strong&gt;. In other words, it should be detailed enough that a new employee could understand it. Being too terse is a common mistake.&lt;/p&gt;
        &lt;p&gt;Note that the example user interactions with the AI are simplified for brevity - but you might need to give the domain expert more context to make a judgement. More on that later.&lt;/p&gt;
        &lt;div class=&quot;callout callout-style-default callout-note callout-titled&quot;&gt;
        &lt;div class=&quot;callout-header d-flex align-content-center&quot;&gt;
        &lt;div class=&quot;callout-icon-container&quot;&gt;
        &lt;i class=&quot;callout-icon&quot;&gt;&lt;/i&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-title-container flex-fill&quot;&gt;
        Note
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;div class=&quot;callout-body-container callout-body&quot;&gt;
        &lt;p&gt;At this point, you don’t need to perform a root cause analysis into the technical reasons behind why the AI failed. Many times, it’s useful to get a sense of overall behavior before diving into the weeds.&lt;/p&gt;
        &lt;/div&gt;
        &lt;/div&gt;
        &lt;/section&gt;
        &lt;section id=&quot;dont-stray-from-binary-passfail-judgments-when-starting-out&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;dont-stray-from-binary-passfail-judgments-when-starting-out&quot;&gt;Don’t stray from binary pass/fail judgments when starting out&lt;/h3&gt;
        &lt;p&gt;A common mistake is straying from binary pass/fail judgments. Let’s revisit the dashboard from earlier:&lt;/p&gt;
        &lt;p&gt;&lt;img src=&quot;https://hamel.dev/blog/posts/llm-judge/dashboard.png&quot; class=&quot;img-fluid&quot; referrerpolicy=&quot;no-referrer&quot;&gt;&lt;/p&gt;
        &lt;p&gt;If your evaluations consist of a bunch of metrics that LLMs score on a 1-5 scale (or any other scale), you’re doing it wrong. Let’s unpack why.&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;&lt;strong&gt;It’s not actionable&lt;/strong&gt;: People don’t know what to do with a 3 or 4. It’s not immediately obvious how this number is better than a 2. You need to be able to say “this interaction passed because…” and “this interaction failed because…”.&lt;/li&gt;
        &lt;li&gt;More often than not, &lt;strong&gt;these metrics do not matter&lt;/strong&gt;. Every time I’ve analyzed data on domain expert judgments, they tend not to correlate with these kind of metrics. By having a domain expert make a binary judgment, you can figure out what truly matters.&lt;/li&gt;
        &lt;/ol&gt;
        &lt;p&gt;This is why I hate off the shelf metrics that come with many evaluation frameworks. They tend to lead people astray.&lt;/p&gt;
        &lt;p&gt;&lt;strong&gt;Common Objections to Pass/Fail Judgments:&lt;/strong&gt;&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;“The business said that these 8 dimensions are important, so we need to evaluate all of them.”&lt;/li&gt;
        &lt;li&gt;“We need to be able to say why an interaction passed or failed.”&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;I can guarantee you that if someone says you need to measure 8 things on a 1-5 scale, they don’t know what they are looking for. They are just guessing. You have to let the domain expert drive and make a pass/fail judgment with critiques so you can figure out what truly matters. Stand your ground here.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;make-it-easy-for-the-domain-expert-to-review-data&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;make-it-easy-for-the-domain-expert-to-review-data&quot;&gt;Make it easy for the domain expert to review data&lt;/h3&gt;
        &lt;p&gt;Finally, you need to remove all friction from reviewing data. I’ve written about this &lt;a href=&quot;https://hamel.dev/notes/llm/finetuning/data_cleaning.html&quot;&gt;here&lt;/a&gt;. Sometimes, you can just use a spreadsheet. It’s a judgement call in terms of what is easiest for the domain expert. I found that I often have to provide additional context to help the domain expert understand the user interaction, such as:&lt;/p&gt;
        &lt;ul&gt;
        &lt;li&gt;Metadata about the user, such as their location, subscription tier, etc.&lt;/li&gt;
        &lt;li&gt;Additional context about the system, such as the current time, inventory levels, etc.&lt;/li&gt;
        &lt;li&gt;Resources so you can check if the AI’s response is correct (ex: ability to search a database, etc.)&lt;/li&gt;
        &lt;/ul&gt;
        &lt;p&gt;All of this data needs to be presented on a single screen so the domain expert can review it without jumping through hoops. That’s why I recommend building &lt;a href=&quot;https://hamel.dev/notes/llm/finetuning/data_cleaning.html&quot;&gt;a simple web app&lt;/a&gt; to review data.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;how-many-examples-do-you-need&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;how-many-examples-do-you-need&quot;&gt;How many examples do you need?&lt;/h3&gt;
        &lt;p&gt;The number of examples you need depends on the complexity of the task. My heuristic is that I start with around 30 examples and keep going until I do not see any new failure modes. From there, I keep going until I’m not learning anything new.&lt;/p&gt;
        &lt;p&gt;Next, we’ll look at how to use this data to build an LLM judge.&lt;/p&gt;
        &lt;/section&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-4-fix-errors&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-4-fix-errors&quot;&gt;Step 4: Fix Errors&lt;/h2&gt;
        &lt;p&gt;After looking at the data, it’s likely you will find errors in your AI system. Instead of plowing ahead and building an LLM judge, you want to fix any obvious errors. Remember, the whole point of the LLM as a judge is to help you find these errors, so it’s totally fine if you find them earlier!&lt;/p&gt;
        &lt;p&gt;If you have already developed &lt;a href=&quot;https://hamel.dev/blog/posts/evals&quot;&gt;Level 1 evals as outlined in my previous post&lt;/a&gt;, you should not have any pervasive errors. However, these errors can sometimes slip through the cracks. If you find pervasive errors, fix them and go back to step 3. Keep iterating until you feel like you have stabilized your system.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;step-5-build-your-llm-as-a-judge-iteratively&quot; class=&quot;level2&quot;&gt;
        &lt;h2 class=&quot;anchored&quot; data-anchor-id=&quot;step-5-build-your-llm-as-a-judge-iteratively&quot;&gt;Step 5: Build Your LLM as A Judge, Iteratively&lt;/h2&gt;
        &lt;section id=&quot;the-hidden-power-of-critiques&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;the-hidden-power-of-critiques&quot;&gt;The Hidden Power of Critiques&lt;/h3&gt;
        &lt;p&gt;You cannot write a good judge prompt until you’ve seen the data. &lt;a href=&quot;https://arxiv.org/abs/2404.12272&quot;&gt;The paper from Shankar et al.,&lt;/a&gt; “Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences” summarizes this well:&lt;/p&gt;
        &lt;blockquote class=&quot;blockquote&quot;&gt;
        &lt;p&gt;to grade outputs,people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria. We dub this phenomenon criteria drift, and it implies thatit is impossible to completely determine evaluation criteria prior to human judging of LLM outputs.&lt;/p&gt;
        &lt;/blockquote&gt;
        &lt;/section&gt;
        &lt;section id=&quot;start-with-expert-examples&quot; class=&quot;level3&quot;&gt;
        &lt;h3 class=&quot;anchored&quot; data-anchor-id=&quot;start-with-expert-examples&quot;&gt;Start with Expert Examples&lt;/h3&gt;
        &lt;p&gt;Let me share a real-world example of building an LLM judge you can apply to your own use case. When I was helping Honeycomb build their &lt;a href=&quot;https://www.honeycomb.io/blog/introducing-query-assistant&quot;&gt;Query Assistant feature&lt;/a&gt;, we needed a way to evaluate if the AI was generating good queries. Here’s what our LLM judge prompt looked like, including few-shot examples of critiques from our domain expert, &lt;a href=&quot;https://x.com/_cartermp&quot;&gt;Phillip&lt;/a&gt;:&lt;/p&gt;
        &lt;pre class=&quot;text&quot;&gt;&lt;code&gt;You are a Honeycomb query evaluator with advanced capabilities to judge if a query is good or not.
        You understand the nuances of the Honeycomb query language, including what is likely to be
        most useful from an analytics perspective.
        Here is information about the Honeycomb query language:
        {{query_language_info}}
        Here are some guidelines for evaluating queries:
        {{guidelines}}
        Example evaluations:
        &amp;lt;examples&amp;gt;
        &amp;lt;example-1&amp;gt;
        &amp;lt;nlq&amp;gt;show me traces where ip is 10.0.2.90&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {
        &quot;breakdowns&quot;: [&quot;trace.trace_id&quot;],
        &quot;calculations&quot;: [{&quot;op&quot;: &quot;COUNT&quot;}],
        &quot;filters&quot;: [{&quot;column&quot;: &quot;net.host.ip&quot;, &quot;op&quot;: &quot;=&quot;, &quot;value&quot;: &quot;10.0.2.90&quot;}]
        }
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;
        {
        &quot;critique&quot;: &quot;The query correctly filters for traces with an IP address of 10.0.2.90
        and counts the occurrences of those traces, grouped by trace.trace_id. The response
        is good as it meets the requirement of showing traces from a specific IP address
        without additional complexities.&quot;,
        &quot;outcome&quot;: &quot;good&quot;
        }
        &amp;lt;/critique&amp;gt;
        &amp;lt;/example-1&amp;gt;
        &amp;lt;example-2&amp;gt;
        &amp;lt;nlq&amp;gt;show me slowest trace&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {
        &quot;calculations&quot;: [{&quot;column&quot;: &quot;duration_ms&quot;, &quot;op&quot;: &quot;MAX&quot;}],
        &quot;orders&quot;: [{&quot;column&quot;: &quot;duration_ms&quot;, &quot;op&quot;: &quot;MAX&quot;, &quot;order&quot;: &quot;descending&quot;}],
        &quot;limit&quot;: 1
        }
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;
        {
        &quot;critique&quot;: &quot;While the query attempts to find the slowest trace using MAX(duration_ms)
        and ordering correctly, it fails to group by trace.trace_id. Without this grouping,
        the query only shows the MAX(duration_ms) measurement over time, not the actual
        slowest trace.&quot;,
        &quot;outcome&quot;: &quot;bad&quot;
        }
        &amp;lt;/critique&amp;gt;
        &amp;lt;/example-2&amp;gt;
        &amp;lt;example-3&amp;gt;
        &amp;lt;nlq&amp;gt;count window-hash where window-hash exists per hour&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {
        &quot;breakdowns&quot;: [&quot;window-hash&quot;],
        &quot;calculations&quot;: [{&quot;op&quot;: &quot;COUNT&quot;}],
        &quot;filters&quot;: [{&quot;column&quot;: &quot;window-hash&quot;, &quot;op&quot;: &quot;exists&quot;}],
        &quot;time_range&quot;: 3600
        }
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;
        {
        &quot;critique&quot;: &quot;While the query correctly counts window-hash occurrences, the time_range
        of 3600 seconds (1 hour) is insufficient for per-hour analysis. When we say &#39;per hour&#39;,
        we need a time_range of at least 36000 seconds to show meaningful hourly patterns.&quot;,
        &quot;outcome&quot;: &quot;bad&quot;
        }
        &amp;lt;/critique&amp;gt;
        &amp;lt;/example-3&amp;gt;
        &amp;lt;/examples&amp;gt;
        For the following query, first write a detailed critique explaining your reasoning,
        then provide a pass/fail judgment in the same format as above.
        &amp;lt;nlq&amp;gt;{{user_input}}&amp;lt;/nlq&amp;gt;
        &amp;lt;query&amp;gt;
        {{generated_query}}
        &amp;lt;/query&amp;gt;
        &amp;lt;critique&amp;gt;&lt;/code&gt;&lt;/pre&gt;
        &lt;p&gt;Notice how each example includes:&lt;/p&gt;
        &lt;ol type=&quot;1&quot;&gt;
        &lt;li&gt;The natural language query (NLQ) in &lt;code&gt;&amp;lt;nlq&amp;gt;&lt;/code&gt; tags&lt;/li&gt;
        &lt;li&gt;The generated query in &lt;code&gt;&amp;lt;query&amp;gt;&lt;/code&gt; tags&lt;/li&gt;
        &lt;li&gt;The critique and outcome in &lt;code&gt;&amp;lt;critique&amp;gt;&lt;/code&gt; tags&lt;/li&gt;
        &lt;/ol&gt;
        &lt;p&gt;In the prompt above, the example critiques are fixed. An advanced approach is to include examples dynamically based upon the item you are judging. You can learn more in &lt;a href=&quot;https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/&quot;&gt;this post about Continual In-Context Learning&lt;/a&gt;.&lt;/p&gt;
        &lt;/section&gt;
        &lt;section id=&quot;keep-iterating-on-the-prompt-until-convergence-with-domain-expert&quot; class=&

@TonyRL TonyRL merged commit 0b813e0 into DIYgod:master Nov 5, 2024
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auto: Route Test Complete Auto route test has finished on given PR Route
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants