-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[MLOB-5297] Prompt Optimization #34154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Preview links (active after the
|
f8d61ab to
22bece4
Compare
22bece4 to
528df03
Compare
cswatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed some changes to make this page private. Also, did a review pass with a few questions and suggestions
content/en/llm_observability/experiments/prompt_optimization.md
Outdated
Show resolved
Hide resolved
|
|
||
| Prompt Optimization automates the process of improving LLM prompts through systematic evaluation and AI-powered refinement. Instead of manually testing and tweaking prompts, the Prompt Optimizer uses advanced reasoning models to analyze performance data, identify failure patterns, and suggest targeted improvements. | ||
|
|
||
| The system runs your LLM application on a dataset with the current prompt, measures performance using your custom metrics, and then uses a reasoning model (such as OpenAI's o3-mini or Claude 3.5 Sonnet) to analyze the results and generate an improved prompt. This cycle repeats until your target metrics are achieved or the maximum number of iterations is reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What does "the system" refer to?
- For
such as OpenAI's o3-mini or Claude 3.5 Sonnet— do we expect to be able to update this line whenever new models become available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- By
the systemI meant the program or this feature or the Prompt Optimization feature.
Not sure which one is the best - For the reasoning model, maybe we can skip the examples as it will evolve in time indeed. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Let's go with
Prompt Optimization runs your LLM Application... - I think we should skip the examples
content/en/llm_observability/experiments/prompt_optimization.md
Outdated
Show resolved
Hide resolved
| - Parallel experiment execution for rapid iteration | ||
| - Full integration with LLM Observability for tracking and debugging | ||
|
|
||
| **Current support:** Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| **Current support:** Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions. | |
| Prompt Optimization supports Boolean detection tasks and classification use cases. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions. |
The documentation should not discuss things in terms of being "current"—everything in the documentation is implied to be current. For example, if support for Feature X is "coming soon," then documentation would state that "Feature X is not supported".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change this for:
Prompt Optimization supports any use case where the expected output is known and there is a defined way to score the model’s predictions. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that sounds great!
| The Prompt Optimizer automates the iterative refinement process through three core components: | ||
|
|
||
| 1. **Evaluation System**: Runs your LLM application on a dataset and measures performance using custom metrics you define | ||
| 2. **Analysis Engine**: Categorizes results into meaningful groups, presents performance data and examples to an AI reasoning model | ||
| 3. **Improvement Loop**: AI generates an improved prompt, the system tests it on the full dataset, compares performance, and repeats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these components necessary to explain? I think you can build user trust by showing them what's "under the hood", but the concepts here are never referenced again, and I don't think they build a reader's understanding. The bolding of terms and high placement of the page suggest that these concepts are very necessary for a reader to understand, which I don't think is the case. I would recommend removing this segment. The information here could potentially be used in a blog post or other supplemental content.
content/en/llm_observability/experiments/prompt_optimization.md
Outdated
Show resolved
Hide resolved
| print(result.summary()) | ||
| ``` | ||
|
|
||
| ## Configuration options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How/where are these options set? Is it through the configuration dictionary mentioned in line 325?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that it is not clear. max_iterations, stopping_condition are _prompt_optimization parameters. jobs is the parameter of the function run and the other configuration are in the config parameter.
Should I create sub sections to be clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh I see, these are being set in the example code for Step 5. Because it's in a new section, I didn't see that the two are related. I have an idea for how to format this more clearly — when you're done making other changes, I'll make my own edits for this section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot! I made the modifications!
content/en/llm_observability/experiments/prompt_optimization.md
Outdated
Show resolved
Hide resolved
content/en/llm_observability/experiments/prompt_optimization.md
Outdated
Show resolved
Hide resolved
content/en/llm_observability/experiments/prompt_optimization.md
Outdated
Show resolved
Hide resolved
Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>
Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>
ncybul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some minor suggestions -- I think generally this may be a bit verbose. The best practices section seems like overkill but I think it's fine to leave it in if it would be helpful.
| ) | ||
| ``` | ||
|
|
||
| To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases | |
| To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases. |
|
|
||
| ### 2. Define your task function | ||
|
|
||
| Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized): | |
| Implement the function that represents your LLM application. This function receives input data and the configuration (including the prompt being optimized): |
| # Optimize for both precision and accuracy | ||
| return metrics['precision'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Optimize for both precision and accuracy | |
| return metrics['precision'] | |
| # Optimize for precision | |
| return metrics['precision'] |
Thanks for these! Applied all suggested changes in latest commit |
| } | ||
| ``` | ||
|
|
||
| - **Scoring function** can return one of the values returned by one of the summary evaluators (for example, precision) or a combination of multiple metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an prompt optim specific concept?
| - **Individual evaluators** measure each output: | ||
|
|
||
| ```python | ||
| def accuracy_evaluator(input_data, output_data, expected_output): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't it more "confusion matrix"? https://en.wikipedia.org/wiki/Confusion_matrix
| - **Labelization functions** categorize results for showing diverse examples to the optimizer: | ||
|
|
||
| ```python | ||
| def labelization_function(individual_result): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this also necessary, or could we have a default?
|
|
||
| ### 3. Define evaluation functions | ||
|
|
||
| Create functions that measure how well your task performs. You need four types of functions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested explanation:
To optimize your prompt, you need multiple layers of metric computation:
- At record level to label the results following the confusion matrix labels (TP/FP, TN, FP)
- [opt] Aggregate these labels into scores
- Compute the score you want to optimize the prompt for (e.g precision, recall...)
- Label the missclassification examples to feed to the prompt
Actually is the "3) compute score step" necessary?
| plt.show() | ||
| ``` | ||
|
|
||
| ## Best practices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we add a cookbook that a user can just run?
https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks
Closes MLOB-5297