Automated benchmark metrics #7

djriffle · 2025-07-08T01:27:41Z

This pull request introduces significant enhancements to the benchmarking framework, focusing on agent system capabilities, benchmarking automation, and code quality improvements. The most important changes include updates to the agent system configuration, the addition of an abstract metric class and a specific metric implementation, integration of benchmarking modules into the testing framework, and the creation of a new automated testing script.

Enhancements to the agent system:

Updated the coder_agent prompt in system_blueprint.json to clarify its specialization in single-cell RNA analysis and to emphasize constraints such as avoiding file modifications and prioritizing incremental responses.
Expanded the description of the delegate_to_coder command to include analyzing single-cell RNA and spatial single-cell data.

Benchmarking framework improvements:

Introduced an abstract base class AutoMetric in AutoMetric.py to standardize metrics applied to AnnData objects, including JSON serialization for results.
Added a new metric implementation, CellCountMetric, to count the number of cells and genes in an AnnData object.
Enhanced the MultiAgentTester.py to support running benchmarks interactively, allowing users to select benchmark modules and execute them during the testing loop. [1] [2] [3]

Automation and usability:

Created a new MultiAgentAutoTester.py script to automate agent system testing, including sandbox initialization, dataset handling, and benchmark execution. This script supports both interactive and automated workflows.
Added a run_automated.sh shell script to simplify the execution of the automated tester, enabling users to run tests directly from the command line.

Code quality and cleanup:

Removed unnecessary debug print statements in io_helpers.py to improve code readability.
Added rich.table imports and utilities to enhance the display of benchmark results in both MultiAgentTester.py and MultiAgentAutoTester.py.

These changes collectively improve the functionality, usability, and maintainability of the benchmarking framework, particularly for single-cell RNA analysis workflows.

djriffle added 2 commits July 7, 2025 15:45

Added Benchmarking Support to Multi Agent Testing

012e379

Added Script to Automate Agent Runs

07e3c43

djriffle merged commit f2958b8 into main Jul 8, 2025
1 check passed

djriffle deleted the AutomatedBenchmarkMetrics branch July 8, 2025 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated benchmark metrics #7

Automated benchmark metrics #7

Uh oh!

djriffle commented Jul 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Automated benchmark metrics #7

Automated benchmark metrics #7

Uh oh!

Conversation

djriffle commented Jul 8, 2025

Enhancements to the agent system:

Benchmarking framework improvements:

Automation and usability:

Code quality and cleanup:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants