Skip to content

Conversation

@tillwf
Copy link
Contributor

@tillwf tillwf commented Jan 27, 2026

Closes MLOB-5297

@tillwf tillwf requested a review from a team as a code owner January 27, 2026 09:13
@github-actions
Copy link
Contributor

Preview links (active after the build_preview check completes)

New or renamed files

@tillwf tillwf force-pushed the till.wohlfarth/MLOB-5297/prompt_optimizer branch from f8d61ab to 22bece4 Compare January 27, 2026 09:21
@github-actions github-actions bot added the Architecture Everything related to the Doc backend label Jan 27, 2026
@tillwf tillwf force-pushed the till.wohlfarth/MLOB-5297/prompt_optimizer branch from 22bece4 to 528df03 Compare January 27, 2026 09:50
@cswatt cswatt added the editorial review Waiting on a more in-depth review label Jan 27, 2026
@cswatt cswatt self-assigned this Jan 27, 2026
@github-actions github-actions bot removed the Architecture Everything related to the Doc backend label Jan 27, 2026
Copy link
Contributor

@cswatt cswatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed some changes to make this page private. Also, did a review pass with a few questions and suggestions


Prompt Optimization automates the process of improving LLM prompts through systematic evaluation and AI-powered refinement. Instead of manually testing and tweaking prompts, the Prompt Optimizer uses advanced reasoning models to analyze performance data, identify failure patterns, and suggest targeted improvements.

The system runs your LLM application on a dataset with the current prompt, measures performance using your custom metrics, and then uses a reasoning model (such as OpenAI's o3-mini or Claude 3.5 Sonnet) to analyze the results and generate an improved prompt. This cycle repeats until your target metrics are achieved or the maximum number of iterations is reached.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • What does "the system" refer to?
  • For such as OpenAI's o3-mini or Claude 3.5 Sonnet — do we expect to be able to update this line whenever new models become available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • By the system I meant the program or this feature or the Prompt Optimization feature.
    Not sure which one is the best
  • For the reasoning model, maybe we can skip the examples as it will evolve in time indeed. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Let's go with Prompt Optimization runs your LLM Application...
  • I think we should skip the examples

- Parallel experiment execution for rapid iteration
- Full integration with LLM Observability for tracking and debugging

**Current support:** Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Current support:** Prompt Optimization has been validated on Boolean detection tasks and classification use cases, though the architecture supports any output type including structured data extraction, free-form text generation, and numerical predictions.
Prompt Optimization supports Boolean detection tasks and classification use cases. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions.

The documentation should not discuss things in terms of being "current"—everything in the documentation is implied to be current. For example, if support for Feature X is "coming soon," then documentation would state that "Feature X is not supported".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change this for:

Prompt Optimization supports any use case where the expected output is known and there is a defined way to score the model’s predictions. Prompt Optimization's architecture supports any output type, including structured data extraction, free-form text generation, and numerical predictions.

WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds great!

Comment on lines 34 to 38
The Prompt Optimizer automates the iterative refinement process through three core components:

1. **Evaluation System**: Runs your LLM application on a dataset and measures performance using custom metrics you define
2. **Analysis Engine**: Categorizes results into meaningful groups, presents performance data and examples to an AI reasoning model
3. **Improvement Loop**: AI generates an improved prompt, the system tests it on the full dataset, compares performance, and repeats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these components necessary to explain? I think you can build user trust by showing them what's "under the hood", but the concepts here are never referenced again, and I don't think they build a reader's understanding. The bolding of terms and high placement of the page suggest that these concepts are very necessary for a reader to understand, which I don't think is the case. I would recommend removing this segment. The information here could potentially be used in a blog post or other supplemental content.

print(result.summary())
```

## Configuration options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How/where are these options set? Is it through the configuration dictionary mentioned in line 325?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that it is not clear. max_iterations, stopping_condition are _prompt_optimization parameters. jobs is the parameter of the function run and the other configuration are in the config parameter.

Should I create sub sections to be clearer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh I see, these are being set in the example code for Step 5. Because it's in a new section, I didn't see that the two are related. I have an idea for how to format this more clearly — when you're done making other changes, I'll make my own edits for this section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! I made the modifications!

tillwf and others added 4 commits January 28, 2026 09:47
Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>
Co-authored-by: cecilia saixue wat-kim <cecilia.watt@datadoghq.com>
Copy link
Contributor

@ncybul ncybul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some minor suggestions -- I think generally this may be a bit verbose. The best practices section seems like overkill but I think it's fine to leave it in if it would be helpful.

)
```

To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases
To help ensure robust optimization, Datadog recommends that you include diverse examples that cover typical cases and edge cases.


### 2. Define your task function

Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Implement the function that represents your LLM application. This function receives input data and configuration (including the prompt being optimized):
Implement the function that represents your LLM application. This function receives input data and the configuration (including the prompt being optimized):

Comment on lines 171 to 172
# Optimize for both precision and accuracy
return metrics['precision']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Optimize for both precision and accuracy
return metrics['precision']
# Optimize for precision
return metrics['precision']

@cswatt
Copy link
Contributor

cswatt commented Jan 28, 2026

I left some minor suggestions -- I think generally this may be a bit verbose. The best practices section seems like overkill but I think it's fine to leave it in if it would be helpful.

Thanks for these! Applied all suggested changes in latest commit

}
```

- **Scoring function** can return one of the values returned by one of the summary evaluators (for example, precision) or a combination of multiple metrics:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an prompt optim specific concept?

- **Individual evaluators** measure each output:

```python
def accuracy_evaluator(input_data, output_data, expected_output):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it more "confusion matrix"? https://en.wikipedia.org/wiki/Confusion_matrix

- **Labelization functions** categorize results for showing diverse examples to the optimizer:

```python
def labelization_function(individual_result):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also necessary, or could we have a default?


### 3. Define evaluation functions

Create functions that measure how well your task performs. You need four types of functions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested explanation:
To optimize your prompt, you need multiple layers of metric computation:

  1. At record level to label the results following the confusion matrix labels (TP/FP, TN, FP)
  2. [opt] Aggregate these labels into scores
  3. Compute the score you want to optimize the prompt for (e.g precision, recall...)
  4. Label the missclassification examples to feed to the prompt

Actually is the "3) compute score step" necessary?

plt.show()
```

## Best practices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add a cookbook that a user can just run?

https://github.com/DataDog/llm-observability/tree/main/experiments/notebooks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

editorial review Waiting on a more in-depth review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants