[MLOB-5297] Prompt Optimization #34154

tillwf · 2026-01-27T09:13:51Z

Closes MLOB-5297

github-actions · 2026-01-27T09:17:54Z

Preview links (active after the `build_preview` check completes)

New or renamed files

https://docs-staging.datadoghq.com/till.wohlfarth/MLOB-5297/prompt_optimizer/llm_observability/experiments/prompt_optimization

cswatt

I pushed some changes to make this page private. Also, did a review pass with a few questions and suggestions

content/en/llm_observability/experiments/prompt_optimization.md

cswatt · 2026-01-27T18:54:14Z

content/en/llm_observability/experiments/prompt_optimization.md

+
+Prompt Optimization automates the process of improving LLM prompts through systematic evaluation and AI-powered refinement. Instead of manually testing and tweaking prompts, the Prompt Optimizer uses advanced reasoning models to analyze performance data, identify failure patterns, and suggest targeted improvements.
+
+The system runs your LLM application on a dataset with the current prompt, measures performance using your custom metrics, and then uses a reasoning model (such as OpenAI's o3-mini or Claude 3.5 Sonnet) to analyze the results and generate an improved prompt. This cycle repeats until your target metrics are achieved or the maximum number of iterations is reached.


What does "the system" refer to?

For such as OpenAI's o3-mini or Claude 3.5 Sonnet — do we expect to be able to update this line whenever new models become available?

By the system I meant the program or this feature or the Prompt Optimization feature.
Not sure which one is the best

For the reasoning model, maybe we can skip the examples as it will evolve in time indeed. WDYT?

Let's go with Prompt Optimization runs your LLM Application...

I think we should skip the examples

content/en/llm_observability/experiments/prompt_optimization.md

cswatt · 2026-01-27T19:01:10Z

content/en/llm_observability/experiments/prompt_optimization.md

+- Parallel experiment execution for rapid iteration
+- Full integration with LLM Observability for tracking and debugging
+
+**Current support:** Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions.


Suggested change

**Current support:** Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions.

Prompt Optimization supports Boolean detection tasks and classification use cases. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions.

The documentation should not discuss things in terms of being "current"—everything in the documentation is implied to be current. For example, if support for Feature X is "coming soon," then documentation would state that "Feature X is not supported".

I will change this for:

Prompt Optimization supports any use case where the expected output is known and there is a defined way to score the model’s predictions. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions.

WDYT?

Yes that sounds great!

cswatt · 2026-01-27T19:04:42Z

content/en/llm_observability/experiments/prompt_optimization.md

+The Prompt Optimizer automates the iterative refinement process through three core components:
+
+1. **Evaluation System**: Runs your LLM application on a dataset and measures performance using custom metrics you define
+2. **Analysis Engine**: Categorizes results into meaningful groups, presents performance data and examples to an AI reasoning model
+3. **Improvement Loop**: AI generates an improved prompt, the system tests it on the full dataset, compares performance, and repeats


Are these components necessary to explain? I think you can build user trust by showing them what's "under the hood", but the concepts here are never referenced again, and I don't think they build a reader's understanding. The bolding of terms and high placement of the page suggest that these concepts are very necessary for a reader to understand, which I don't think is the case. I would recommend removing this segment. The information here could potentially be used in a blog post or other supplemental content.

content/en/llm_observability/experiments/prompt_optimization.md

cswatt · 2026-01-27T19:20:35Z

content/en/llm_observability/experiments/prompt_optimization.md

+print(result.summary())
+```
+
+## Configuration options


How/where are these options set? Is it through the configuration dictionary mentioned in line 325?

It's true that it is not clear. max_iterations, stopping_condition are _prompt_optimization parameters. jobs is the parameter of the function run and the other configuration are in the config parameter.

Should I create sub sections to be clearer?

Ohhh I see, these are being set in the example code for Step 5. Because it's in a new section, I didn't see that the two are related. I have an idea for how to format this more clearly — when you're done making other changes, I'll make my own edits for this section.

Thanks a lot! I made the modifications!

content/en/llm_observability/experiments/prompt_optimization.md

Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>

ncybul

I left some minor suggestions -- I think generally this may be a bit verbose. The best practices section seems like overkill but I think it's fine to leave it in if it would be helpful.

ncybul · 2026-01-28T19:34:50Z

content/en/llm_observability/experiments/prompt_optimization.md

+)
+```
+
+To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases


Suggested change

To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases

To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases.

ncybul · 2026-01-28T19:35:19Z

content/en/llm_observability/experiments/prompt_optimization.md

+
+### 2. Define your task function
+
+Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized):


Suggested change

Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized):

Implement the function that represents your LLM application. This function receives input data and the configuration (including the prompt being optimized):

ncybul · 2026-01-28T19:38:33Z

content/en/llm_observability/experiments/prompt_optimization.md

+    # Optimize for both precision and accuracy
+    return metrics['precision']


Suggested change

# Optimize for both precision and accuracy

return metrics['precision']

# Optimize for precision

return metrics['precision']

cswatt · 2026-01-28T21:33:46Z

I left some minor suggestions -- I think generally this may be a bit verbose. The best practices section seems like overkill but I think it's fine to leave it in if it would be helpful.

Thanks for these! Applied all suggested changes in latest commit

CFLJacquet · 2026-01-28T22:05:36Z

content/en/llm_observability/experiments/prompt_optimization.md

+       }
+   ```
+
+- **Scoring function** can return one of the values returned by one of the summary evaluators (for example, precision) or a combination of multiple metrics:


Is this an prompt optim specific concept?

CFLJacquet · 2026-01-28T22:08:53Z

content/en/llm_observability/experiments/prompt_optimization.md

+- **Individual evaluators** measure each output:
+
+   ```python
+   def accuracy_evaluator(input_data, output_data, expected_output):


isn't it more "confusion matrix"? https://en.wikipedia.org/wiki/Confusion_matrix

CFLJacquet · 2026-01-28T22:12:18Z

content/en/llm_observability/experiments/prompt_optimization.md

+- **Labelization functions** categorize results for showing diverse examples to the optimizer:
+
+   ```python
+   def labelization_function(individual_result):


Is this also necessary, or could we have a default?

CFLJacquet · 2026-01-28T22:12:58Z

content/en/llm_observability/experiments/prompt_optimization.md

+
+### 3. Define evaluation functions
+
+Create functions that measure how well your task performs. You need four types of functions:


Suggested explanation:
To optimize your prompt, you need multiple layers of metric computation:

At record level to label the results following the confusion matrix labels (TP/FP, TN, FP)

[opt] Aggregate these labels into scores

Compute the score you want to optimize the prompt for (e.g precision, recall...)

Label the missclassification examples to feed to the prompt

Actually is the "3) compute score step" necessary?

CFLJacquet · 2026-01-28T22:16:08Z

content/en/llm_observability/experiments/prompt_optimization.md

+plt.show()
+```
+
+## Best practices


could we add a cookbook that a user can just run?

https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks

tillwf requested a review from a team as a code owner January 27, 2026 09:13

tillwf force-pushed the till.wohlfarth/MLOB-5297/prompt_optimizer branch from f8d61ab to 22bece4 Compare January 27, 2026 09:21

github-actions bot added the Architecture Everything related to the Doc backend label Jan 27, 2026

Init branch with Claudes's proposition

528df03

tillwf force-pushed the till.wohlfarth/MLOB-5297/prompt_optimizer branch from 22bece4 to 528df03 Compare January 27, 2026 09:50

tillwf added 2 commits January 27, 2026 11:39

Human moodifications

b45aba2

Vale fixes

2c90b72

cswatt added the editorial review Waiting on a more in-depth review label Jan 27, 2026

cswatt self-assigned this Jan 27, 2026

cswatt added 2 commits January 27, 2026 10:24

remove from left nav

d596d6a

make page private

4a4d339

github-actions bot removed the Architecture Everything related to the Doc backend label Jan 27, 2026

cswatt requested changes Jan 27, 2026

View reviewed changes

tillwf and others added 4 commits January 28, 2026 09:47

Apply suggestions from code review

008abb7

Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>

Remove How it work section and update one paragraph

3ebe6d0

Apply suggestions from code review

7684b47

Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>

Review modifications

cb355dc

ncybul reviewed Jan 28, 2026

View reviewed changes

applying final edits + nicole cybul comments

ce43592

cswatt approved these changes Jan 28, 2026

View reviewed changes

CFLJacquet reviewed Jan 28, 2026

View reviewed changes


		Prompt Optimization automates the process of improving LLM prompts through systematic evaluation and AI-powered refinement. Instead of manually testing and tweaking prompts, the Prompt Optimizer uses advanced reasoning models to analyze performance data, identify failure patterns, and suggest targeted improvements.

		The system runs your LLM application on a dataset with the current prompt, measures performance using your custom metrics, and then uses a reasoning model (such as OpenAI's o3-mini or Claude 3.5 Sonnet) to analyze the results and generate an improved prompt. This cycle repeats until your target metrics are achieved or the maximum number of iterations is reached.

	Current support: Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions.
	Prompt Optimization supports Boolean detection tasks and classification use cases. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions.

	To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases
	To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases.


		### 2. Define your task function

		Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized):

		# Optimize for both precision and accuracy
		return metrics['precision']


		### 3. Define evaluation functions

		Create functions that measure how well your task performs. You need four types of functions:

[MLOB-5297] Prompt Optimization #34154

Are you sure you want to change the base?

[MLOB-5297] Prompt Optimization #34154

Conversation

tillwf commented Jan 27, 2026 • edited by atlassian bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 27, 2026

Preview links (active after the build_preview check completes)

New or renamed files

Uh oh!

cswatt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ncybul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cswatt commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tillwf commented Jan 27, 2026 •

edited by atlassian bot

Loading

Preview links (active after the `build_preview` check completes)