|
| 1 | +# Getting Started: LLM as a Judge with Promptolution |
| 2 | + |
| 3 | +## Welcome to Promptolution! |
| 4 | + |
| 5 | +Discover a powerful tool for evolving and optimizing your LLM prompts. This notebook provides a friendly introduction to one of Promptolution's most advanced features: LLM as a Judge. |
| 6 | + |
| 7 | +While the standard getting_started notebook shows how to optimize for classification tasks, this guide will focus on something different. We'll optimize prompts for a creative task where there's no single "correct" answer: *Finding an optimal argument for a statement*! |
| 8 | + |
| 9 | +## Intro |
| 10 | +In traditional machine learning and prompt optimization, we often rely on labeled data. For a classification task, you need an input (x) and a corresponding ground-truth label (y). The goal is to find a prompt that helps the model predict y correctly. |
| 11 | +But what if your task is more subjective? How do you "label" things like: |
| 12 | + |
| 13 | +- The quality of a generated argument? |
| 14 | +- The creativity of a story? |
| 15 | +- The helpfulness of a summary? |
| 16 | +- The persuasiveness of an essay? |
| 17 | + |
| 18 | +This is where LLM as a Judge comes in. Instead of relying on a pre-defined dataset of labels, we use another powerful Language Model (the "judge") to score the output of our prompts. The process looks like this: |
| 19 | + |
| 20 | +A candidate prompt is used to generate a response (e.g., an argument). |
| 21 | +A "judge" LLM then evaluates this response based on the task provided and assigns a score. |
| 22 | +Promptolution's optimizer uses these scores to identify which prompts are best and evolves them to generate even better responses. |
| 23 | + |
| 24 | +The beauty of this approach is its flexibility. While you can provide groundtruths (in case there is a correct answer) and let the LLM judge itself if both the prediction and the correct answer are equivalent - you don't need to. |
| 25 | + |
| 26 | +*New to Promptolution? If you haven't seen our classification tutorial yet, check out `getting_started.ipynb` first! It covers the basics of prompt optimization with simpler tasks like text classification. This notebook builds on those concepts but tackles more complex, subjective tasks.* |
| 27 | + |
| 28 | +## Installation |
| 29 | +Install Promptolution with a single command |
| 30 | + |
| 31 | + |
| 32 | +```python |
| 33 | +! pip install promptolution[api] |
| 34 | +``` |
| 35 | + |
| 36 | +## Imports |
| 37 | + |
| 38 | + |
| 39 | +```python |
| 40 | +import pandas as pd |
| 41 | +from promptolution.utils import ExperimentConfig |
| 42 | +from promptolution.helpers import run_experiment |
| 43 | +import nest_asyncio |
| 44 | + |
| 45 | +nest_asyncio.apply() # Required for notebook environments |
| 46 | +``` |
| 47 | + |
| 48 | +## Setting Up Your Experiment |
| 49 | + |
| 50 | +### Prepare the data |
| 51 | + |
| 52 | +For this tutorial, we're using IBM's Argument Quality Ranking dataset - a collection of crowd-sourced arguments on controversial topics like capital punishment, abortion rights, and climate change. |
| 53 | + |
| 54 | +Unlike classification tasks where you have clear input-output pairs, here we're working with debate topics that we want to generate compelling arguments for. |
| 55 | + |
| 56 | + |
| 57 | +```python |
| 58 | +df = pd.read_csv("hf://datasets/ibm-research/argument_quality_ranking_30k/dev.csv").sample(300) |
| 59 | +``` |
| 60 | + |
| 61 | + |
| 62 | +```python |
| 63 | +print("\nSample topics:") |
| 64 | +for topic in df["topic"].unique()[:3]: |
| 65 | + print(f"- {topic}") |
| 66 | +``` |
| 67 | + |
| 68 | + |
| 69 | + Sample topics: |
| 70 | + - We should adopt a zero-tolerance policy in schools |
| 71 | + - Payday loans should be banned |
| 72 | + - Intelligence tests bring more harm than good |
| 73 | + |
| 74 | + |
| 75 | +Our task: **Given a controversial statement, generate the strongest possible argument supporting that position.** |
| 76 | + |
| 77 | +Let's look at what we're working with: |
| 78 | + |
| 79 | +### Creating Inital Prompts |
| 80 | + |
| 81 | +Here are some starter prompts for generating compelling arguments. Feel free to experiment with your own! |
| 82 | + |
| 83 | + |
| 84 | +```python |
| 85 | +init_prompts = [ |
| 86 | + "Create a strong argument for this position with clear reasoning and examples:", |
| 87 | + "Write a persuasive argument supporting this statement. Include evidence and address counterarguments:", |
| 88 | + "Make a compelling case for this viewpoint using logical reasoning and real examples:", |
| 89 | + "Argue convincingly for this position. Provide supporting points and evidence:", |
| 90 | + "Build a strong argument for this statement with clear structure and solid reasoning:", |
| 91 | + "Generate a persuasive argument supporting this position. Use facts and logical flow:", |
| 92 | + "Create a well-reasoned argument for this viewpoint with supporting evidence:", |
| 93 | + "Write a convincing argument for this position. Include examples and counter opposing views:", |
| 94 | + "Develop a strong case supporting this statement using clear logic and evidence:", |
| 95 | + "Construct a persuasive argument for this position with solid reasoning and examples:", |
| 96 | +] |
| 97 | +``` |
| 98 | + |
| 99 | +### Configure Your LLM |
| 100 | + |
| 101 | +For this demonstration, we will again use the DeepInfra API, but you can easily switch to other providers like Anthropic or OpenAI by simply changing the `api_url` and `model_id`. |
| 102 | + |
| 103 | + |
| 104 | +```python |
| 105 | +api_key = "YOUR_API_KEY" # Replace with your Promptolution API key |
| 106 | +``` |
| 107 | + |
| 108 | +Here are the key parameters for LLM-as-a-Judge tasks: |
| 109 | + |
| 110 | + |
| 111 | +```python |
| 112 | +config = ExperimentConfig( |
| 113 | + optimizer="evopromptga", |
| 114 | + task_description="Given a statement, find the best argument supporting it.", |
| 115 | + x_column="topic", |
| 116 | + prompts=init_prompts, |
| 117 | + n_steps=3, |
| 118 | + n_subsamples=10, |
| 119 | + subsample_strategy="random_subsample", |
| 120 | + api_url="https://api.deepinfra.com/v1/openai", |
| 121 | + model_id="meta-llama/Meta-Llama-3-8B-Instruct", |
| 122 | + api_key=api_key, |
| 123 | + task_type="judge", |
| 124 | +) |
| 125 | +``` |
| 126 | + |
| 127 | +- `task_type="judge"` - This tells Promptolution to use LLM evaluation instead of accuracy metrics |
| 128 | +- `x_column="topic"` - We specify which column contains our input (debate topics) |
| 129 | +- `optimizer="evopromptga"` - In the classification task we show cased CAPO, here we are using EvoPrompt, a strong evolutionary prompt optimizer. |
| 130 | +- No y column needed - the judge will evaluate quality without ground truth labels! |
| 131 | + |
| 132 | +## Run Your Experiment |
| 133 | + |
| 134 | +With everything configured, you're ready to optimize your prompts! The run_experiment function will: |
| 135 | + |
| 136 | +1. Evaluate your initial prompts by generating arguments and having the judge LLM score them |
| 137 | +1. Use evolutionary operators (mutation, crossover) to create new prompt variations from the 1. best-performing ones |
| 138 | +1. Test these new prompt candidates and select the fittest ones for the next generation |
| 139 | +1. Repeat this evolutionary process for the specified number of steps, gradually improving prompt 1. quality |
| 140 | + |
| 141 | + |
| 142 | +```python |
| 143 | +prompts = run_experiment(df, config) |
| 144 | +``` |
| 145 | + |
| 146 | + 🔥 Starting optimization... |
| 147 | + |
| 148 | + |
| 149 | +You can expect this to take several minutes as the optimizer generates arguments, evaluates them with the judge, and evolves the prompts. |
| 150 | + |
| 151 | + |
| 152 | +```python |
| 153 | +prompts |
| 154 | +``` |
| 155 | + |
| 156 | + |
| 157 | + |
| 158 | + |
| 159 | +<div> |
| 160 | +<style scoped> |
| 161 | + .dataframe tbody tr th:only-of-type { |
| 162 | + vertical-align: middle; |
| 163 | + } |
| 164 | + |
| 165 | + .dataframe tbody tr th { |
| 166 | + vertical-align: top; |
| 167 | + } |
| 168 | + |
| 169 | + .dataframe thead th { |
| 170 | + text-align: right; |
| 171 | + } |
| 172 | +</style> |
| 173 | +<table border="1" class="dataframe"> |
| 174 | + <thead> |
| 175 | + <tr style="text-align: right;"> |
| 176 | + <th></th> |
| 177 | + <th>prompt</th> |
| 178 | + <th>score</th> |
| 179 | + </tr> |
| 180 | + </thead> |
| 181 | + <tbody> |
| 182 | + <tr> |
| 183 | + <th>0</th> |
| 184 | + <td>Construct a persuasive argument supporting the given statement, relying on logical coherence and evidence-based reasoning.</td> |
| 185 | + <td>0.931500</td> |
| 186 | + </tr> |
| 187 | + <tr> |
| 188 | + <th>1</th> |
| 189 | + <td>Develop a strong case supporting this statement using clear logic and evidence:</td> |
| 190 | + <td>0.924167</td> |
| 191 | + </tr> |
| 192 | + <tr> |
| 193 | + <th>2</th> |
| 194 | + <td>Construct a convincing case supporting the stated argument, providing evidence and responding to potential objections.</td> |
| 195 | + <td>0.915833</td> |
| 196 | + </tr> |
| 197 | + <tr> |
| 198 | + <th>3</th> |
| 199 | + <td>Develop a well-reasoned argument in favor of the given statement, incorporating reliable examples and addressing potential counterpoints.</td> |
| 200 | + <td>0.913333</td> |
| 201 | + </tr> |
| 202 | + <tr> |
| 203 | + <th>4</th> |
| 204 | + <td>Write a persuasive argument supporting this statement. Include evidence and address counterarguments:</td> |
| 205 | + <td>0.907500</td> |
| 206 | + </tr> |
| 207 | + <tr> |
| 208 | + <th>5</th> |
| 209 | + <td>Present a convincing case for this assertion, incorporating logical premises and applicable examples.</td> |
| 210 | + <td>0.903333</td> |
| 211 | + </tr> |
| 212 | + <tr> |
| 213 | + <th>6</th> |
| 214 | + <td>Fortify the provided statement with a robust and well-reasoned argument, underscoring logical relationships and leveraging empirical support to build a compelling case, while also anticipating and addressing potential counterpoints.</td> |
| 215 | + <td>0.902500</td> |
| 216 | + </tr> |
| 217 | + <tr> |
| 218 | + <th>7</th> |
| 219 | + <td>Construct a strong claim in support of this statement, employing a logical framework and relevant examples to make a convincing case.</td> |
| 220 | + <td>0.891667</td> |
| 221 | + </tr> |
| 222 | + <tr> |
| 223 | + <th>8</th> |
| 224 | + <td>Create a well-reasoned argument for this viewpoint with supporting evidence:</td> |
| 225 | + <td>0.888333</td> |
| 226 | + </tr> |
| 227 | + <tr> |
| 228 | + <th>9</th> |
| 229 | + <td>Extract the most compelling supporting argument for this statement, grounding it in logical reasoning and bolstered by relevant evidence and examples.</td> |
| 230 | + <td>0.697500</td> |
| 231 | + </tr> |
| 232 | + </tbody> |
| 233 | +</table> |
| 234 | +</div> |
| 235 | + |
| 236 | + |
| 237 | + |
| 238 | +The best prompts aren't always the most obvious ones - let the optimizer surprise you with what works! |
| 239 | + |
| 240 | + |
| 241 | +Happy prompt optimizing! 🚀✨ We can't wait to see what you build with Promptolution! 🤖💡 |
| 242 | + |
| 243 | + |
| 244 | +```python |
| 245 | + |
| 246 | +``` |
0 commit comments