Skip to content

chore: remove unused files and implement new scorers#56

Merged
ssdeanx merged 1 commit intomainfrom
develop
Dec 14, 2025
Merged

chore: remove unused files and implement new scorers#56
ssdeanx merged 1 commit intomainfrom
develop

Conversation

@ssdeanx
Copy link
Owner

@ssdeanx ssdeanx commented Dec 14, 2025

  • deleted weather-scorer.ts and testSetup.ts as they are no longer needed.
  • added new evaluation helpers and prebuilt scorers in src/mastra/evals/scorers/:
    • custom-scorers.ts: includes sourceDiversityScorer and researchCompletenessScorer.
    • factuality.scorer.ts: evaluates factual claims in outputs.
    • financial-scorers.ts: checks integrity of financial data outputs.
    • keyword-coverage.ts: measures coverage of required keywords in outputs.
    • prebuilt.ts: contains various prebuilt scorer functions.
    • utils.ts: utility functions for handling messages and extracting data.
  • created tests for new scorers:
    • completeness.test.ts: tests for completeness scoring.
    • noise-sensitivity.test.ts: tests for noise sensitivity scoring.
    • tool-call-accuracy.test.ts: tests for tool call accuracy scoring.
  • updated vitest.config.ts to remove references to deleted setup files.
  • added documentation for evaluation helpers in src/mastra/evals/AGENTS.md.

Summary by Sourcery

Introduce a new evals/scorers module with reusable scorer utilities and migrate/remove legacy agent-level scorers and unused support files.

New Features:

  • Add evaluation helpers and prebuilt scorers (completeness, tool-call accuracy, noise sensitivity, bias, prompt alignment, factuality, keyword coverage, source diversity, research completeness) under src/mastra/evals/scorers.
  • Add shared scorer utilities for extracting messages, system prompts, reasoning traces, and tool calls.
  • Add documentation describing eval helpers and prebuilt scorers in src/mastra/evals/AGENTS.md.

Enhancements:

  • Refine the financial data integrity scorer and relocate it into the new evals/scorers structure.
  • Detach agents from hard-wired scorers to decouple runtime agents from specific evaluation configurations.

Build:

  • Simplify Vitest configuration by removing obsolete global and setup files.

Tests:

  • Add tests for completeness, noise sensitivity, and tool-call accuracy scorers under src/mastra/evals/tests.

Chores:

  • Remove the legacy scorers module, obsolete scorer implementations, unused tools, weather-related scorer, and test setup/global setup files.

- deleted `weather-scorer.ts` and `testSetup.ts` as they are no longer needed.
- added new evaluation helpers and prebuilt scorers in `src/mastra/evals/scorers/`:
  - `custom-scorers.ts`: includes `sourceDiversityScorer` and `researchCompletenessScorer`.
  - `factuality.scorer.ts`: evaluates factual claims in outputs.
  - `financial-scorers.ts`: checks integrity of financial data outputs.
  - `keyword-coverage.ts`: measures coverage of required keywords in outputs.
  - `prebuilt.ts`: contains various prebuilt scorer functions.
  - `utils.ts`: utility functions for handling messages and extracting data.
- created tests for new scorers:
  - `completeness.test.ts`: tests for completeness scoring.
  - `noise-sensitivity.test.ts`: tests for noise sensitivity scoring.
  - `tool-call-accuracy.test.ts`: tests for tool call accuracy scoring.
- updated `vitest.config.ts` to remove references to deleted setup files.
- added documentation for evaluation helpers in `src/mastra/evals/AGENTS.md`.
Copilot AI review requested due to automatic review settings December 14, 2025 16:00
@continue
Copy link

continue bot commented Dec 14, 2025

All Green - Keep your PRs mergeable

Learn more

All Green is an AI agent that automatically:

✅ Addresses code review comments

✅ Fixes failing CI checks

✅ Resolves merge conflicts

@sourcery-ai
Copy link

sourcery-ai bot commented Dec 14, 2025

Reviewer's Guide

Refactors eval scorers by moving them into a new src/mastra/evals/scorers module with utilities and prebuilt scorers, removes now-unused legacy scorers and setup files, disables agent-level scorers wiring for now, and adds tests and brief documentation for the new eval helpers.

Sequence diagram for running a completeness scorer with utils

sequenceDiagram
  actor Dev
  participant TestRunner
  participant CompletenessScorer as CompletenessScorerInstance
  participant CoreEvals as MastraCoreEvals
  participant Utils as ScorerUtils

  Dev->>TestRunner: execute completeness.test.ts
  TestRunner->>Utils: createAgentTestRun(inputMessages, output)
  Utils-->>TestRunner: run

  TestRunner->>CompletenessScorer: run(run)
  activate CompletenessScorer

  CompletenessScorer->>Utils: extractInputMessages(run.input)
  Utils-->>CompletenessScorer: inputMessages

  CompletenessScorer->>Utils: extractAgentResponseMessages(run.output)
  Utils-->>CompletenessScorer: outputMessages

  CompletenessScorer-->>CompletenessScorer: preprocessStepResult

  CompletenessScorer-->>CoreEvals: analyze(context)
  CoreEvals-->>CompletenessScorer: analyzeStepResult

  CompletenessScorer-->>CoreEvals: generateScore(context)
  CoreEvals-->>CompletenessScorer: score

  CompletenessScorer-->>TestRunner: score
  deactivate CompletenessScorer

  TestRunner-->>Dev: assertion results
Loading

Class diagram for new and refactored scorers and utils

classDiagram

  class Scorer {
    +string id
    +string name
    +string description
    +string type
    +preprocess(run)
    +analyze(context)
    +generateScore(context)
    +generateReason(context)
  }

  class CompletenessScorer {
    +createCompletenessScorer()
  }
  CompletenessScorer ..> Scorer

  class ToolCallAccuracyScorerCode {
    +createToolCallAccuracyScorerCode(expectedTool, strictMode, expectedToolOrder)
  }
  ToolCallAccuracyScorerCode ..> Scorer

  class ToolCallAccuracyScorerLLM {
    +createToolCallAccuracyScorerLLM()
  }
  ToolCallAccuracyScorerLLM ..> Scorer

  class NoiseSensitivityScorerLLM {
    +createNoiseSensitivityScorerLLM(baselineResponse, noisyQuery, noiseType)
  }
  NoiseSensitivityScorerLLM ..> Scorer

  class BiasScorer {
    +createBiasScorer()
  }
  BiasScorer ..> Scorer

  class PromptAlignmentScorerLLM {
    +createPromptAlignmentScorerLLM()
  }
  PromptAlignmentScorerLLM ..> Scorer

  class SourceDiversityScorer {
    +sourceDiversityScorer
  }
  SourceDiversityScorer ..|> Scorer

  class ResearchCompletenessScorer {
    +researchCompletenessScorer
  }
  ResearchCompletenessScorer ..|> Scorer

  class FinancialDataScorer {
    +financialDataScorer
  }
  FinancialDataScorer ..|> Scorer

  class FactualityScorer {
    +factualityScorer
  }
  FactualityScorer ..|> Scorer

  class KeywordCoverageScorer {
    +keywordCoverageScorer
  }
  KeywordCoverageScorer ..|> Scorer

  class Utils {
    +getAssistantMessageFromRunOutput(output) string
    +getUserMessageFromRunInput(input) string
    +extractInputMessages(input) string[]
    +extractAgentResponseMessages(output) string[]
    +getReasoningFromRunOutput(output) string
    +getSystemMessagesFromRunInput(input) string[]
    +getCombinedSystemPrompt(input) string
    +extractToolCalls(output) toolsAndInfos
    +createTestMessage(content, role, id, toolInvocations) Message
    +createAgentTestRun(inputMessages, output) Run
  }

  class Message {
    +string role
    +any content
    +string id
    +any[] toolInvocations
  }

  class Run {
    +RunInput input
    +RunOutput output
    +any requestContext
  }

  class GoogleAIFlashLite {
  }

  class CreateScorerFactory {
    +createScorer(options) Scorer
  }

  %% Relationships from factories to core and utils
  CompletenessScorer ..> Utils : uses
  ToolCallAccuracyScorerCode ..> Utils : uses
  NoiseSensitivityScorerLLM ..> Utils : uses
  BiasScorer ..> Utils : uses

  SourceDiversityScorer ..> GoogleAIFlashLite
  ResearchCompletenessScorer ..> GoogleAIFlashLite
  FinancialDataScorer ..> GoogleAIFlashLite
  FactualityScorer ..> GoogleAIFlashLite

  CompletenessScorer ..> CreateScorerFactory : calls
  ToolCallAccuracyScorerCode ..> CreateScorerFactory : calls
  ToolCallAccuracyScorerLLM ..> CreateScorerFactory : calls
  NoiseSensitivityScorerLLM ..> CreateScorerFactory : calls
  BiasScorer ..> CreateScorerFactory : calls
  PromptAlignmentScorerLLM ..> CreateScorerFactory : calls
  SourceDiversityScorer ..> CreateScorerFactory : calls
  ResearchCompletenessScorer ..> CreateScorerFactory : calls
  FinancialDataScorer ..> CreateScorerFactory : calls
  FactualityScorer ..> CreateScorerFactory : calls
  KeywordCoverageScorer ..> CreateScorerFactory : calls

  Run o-- Message : contains
Loading

File-Level Changes

Change Details Files
Introduce new eval scorer utilities and prebuilt scorers under src/mastra/evals/scorers and add unit tests for key scorers.
  • Add utils.ts with helpers for extracting messages, system prompts, and tool calls from run input/output.
  • Add prebuilt.ts implementing completeness, tool-call-accuracy (code/LLM), noise-sensitivity, bias, and prompt-alignment scorers using createScorer.
  • Add factuality.scorer.ts and keyword-coverage.ts implementing factuality and keyword coverage scoring.
  • Add custom-scorers.ts with sourceDiversityScorer and researchCompletenessScorer tailored to research outputs.
  • Create tests for completeness, noise sensitivity, and tool call accuracy scorers in src/mastra/evals/tests.
  • Add AGENTS.md documenting purpose and guidelines for eval helpers.
src/mastra/evals/scorers/utils.ts
src/mastra/evals/scorers/prebuilt.ts
src/mastra/evals/scorers/factuality.scorer.ts
src/mastra/evals/scorers/keyword-coverage.ts
src/mastra/evals/scorers/custom-scorers.ts
src/mastra/evals/tests/completeness.test.ts
src/mastra/evals/tests/noise-sensitivity.test.ts
src/mastra/evals/tests/tool-call-accuracy.test.ts
src/mastra/evals/AGENTS.md
Clean up legacy scorer implementations and disconnect agents from the old scorer wiring.
  • Delete the old scorers module, including CSV, SQL, structure, tone, script, and factuality scorers plus their index and AGENTS docs.
  • Delete the weather-scorer and color-change-tool which are no longer used.
  • Remove imports and configurations that referenced the old scorers and clear the scorers configuration blocks for various agents so they currently have no attached scorers.
src/mastra/scorers/AGENTS.md
src/mastra/scorers/csv-validity.scorer.ts
src/mastra/scorers/custom-scorers.ts
src/mastra/scorers/factuality.scorer.ts
src/mastra/scorers/index.ts
src/mastra/scorers/script-scorers.ts
src/mastra/scorers/sql-validity.scorer.ts
src/mastra/scorers/structure.scorer.ts
src/mastra/scorers/tone-consistency.scorer.ts
src/mastra/scorers/weather-scorer.ts
src/mastra/color-change-tool.ts
src/mastra/agents/recharts.ts
src/mastra/agents/reportAgent.ts
src/mastra/agents/scriptWriterAgent.ts
src/mastra/agents/contentStrategistAgent.ts
src/mastra/agents/researchAgent.ts
src/mastra/agents/editorAgent.ts
src/mastra/agents/evaluationAgent.ts
src/mastra/agents/image_to_csv.ts
src/mastra/agents/acpAgent.ts
src/mastra/agents/copywriterAgent.ts
src/mastra/agents/csv_to_excalidraw.ts
src/mastra/agents/learningExtractionAgent.ts
src/mastra/agents/package-publisher.ts
Tidy up test configuration and obsolete setup files.
  • Remove globalSetup and testSetup support files that are no longer needed.
  • Update vitest.config.ts to drop globalSetup and setupFiles entries, leaving jsdom and pattern-based includes intact.
  • Remove stale test-results JSON fixture.
globalSetup.ts
testSetup.ts
vitest.config.ts
tests/test-results/test-results.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@coderabbitai
Copy link

coderabbitai bot commented Dec 14, 2025

Summary by CodeRabbit

  • New Features

    • Added new evaluation scorers: keyword coverage, completeness, tool-call accuracy, noise sensitivity, bias, and prompt alignment capabilities.
    • Introduced factuality evaluation with hallucination detection and source verification.
  • Refactoring

    • Reorganized evaluation scorers into new module structure.
    • Simplified agent configurations by removing custom scorer integrations.
    • Removed color-change tool functionality.
  • Tests

    • Added comprehensive test coverage for evaluation scorers.
  • Documentation

    • Added evaluation helpers and scorer guidelines documentation.

✏️ Tip: You can customize this high-level summary in your review settings.

Walkthrough

This PR removes scorer configurations from 13+ agent files and reorganizes the evaluation framework by migrating scorer implementations from src/mastra/scorers/ to a new src/mastra/evals/scorers/ directory. Global and test setup files are removed, and new prebuilt scorers, utilities, and tests are introduced.

Changes

Cohort / File(s) Summary
Agent Scorer Configuration Removals
src/mastra/agents/acpAgent.ts, businessLegalAgents.ts, contentStrategistAgent.ts, copywriterAgent.ts, csv_to_excalidraw.ts, editorAgent.ts, evaluationAgent.ts, image_to_csv.ts, learningExtractionAgent.ts, package-publisher.ts, recharts.ts, reportAgent.ts, researchAgent.ts, scriptWriterAgent.ts
Removed scorer imports and configurations from agent definitions, leaving scorers objects empty or removing relevant scorer references.
Old Scorer Files Deleted
src/mastra/scorers/custom-scorers.ts, factuality.scorer.ts, script-scorers.ts, sql-validity.scorer.ts, structure.scorer.ts, tone-consistency.scorer.ts, weather-scorer.ts, csv-validity.scorer.ts, AGENTS.md, index.ts
Deleted entire scorer implementations and barrel exports previously used by agents.
New Evals Scorers & Utilities
src/mastra/evals/scorers/custom-scorers.ts, factuality.scorer.ts, financial-scorers.ts, keyword-coverage.ts, prebuilt.ts, utils.ts
Added new prebuilt scorer factories (createCompletenessScorer, createToolCallAccuracyScorerCode, createBiasScorer, etc.), custom scorers (sourceDiversityScorer, researchCompletenessScorer), and shared utility functions for extracting and normalizing run inputs/outputs.
Evals Tests
src/mastra/evals/tests/completeness.test.ts, noise-sensitivity.test.ts, tool-call-accuracy.test.ts
Added unit tests for new prebuilt scorers with Vitest.
Configuration & Setup Changes
globalSetup.ts, testSetup.ts, vitest.config.ts, src/mastra/evals/AGENTS.md, tests/test-results/test-results.json
Removed global and test setup files; cleaned up Vitest configuration; added evals documentation; cleared test results.
Removed Public Tool
src/mastra/color-change-tool.ts
Deleted changeBgColor function and colorChangeTool export.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Agent scorer removals: repetitive pattern across 14 files—quick verification that each removal is intentional and consistent.
  • New prebuilt scorers (prebuilt.ts, custom-scorers.ts): requires careful review of factory logic, heuristics (e.g., Levenshtein distance, domain extraction), and scoring algorithms.
  • Utils module (utils.ts): review data extraction and normalization helpers for correctness and edge-case handling.
  • Verification gap: confirm that agents no longer require these scorers and that new scorers in evals module are adequate replacements.

Possibly related PRs

Suggested reviewers

  • Copilot
  • sourcery-ai

Poem

🐰 Scorers hopped away to eval-land,
Where prebuilt ones take the stand,
Agents are cleaner, tests are bright,
A refactor fluffy and just right! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.76% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main changes: removing unused files and implementing new scorers, which accurately reflects the primary scope of the changeset.
Description check ✅ Passed The description is directly related to the changeset, detailing deleted files, added evaluation helpers, new scorers, tests, and configuration updates that match the raw summary and objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch develop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

🤖 Hi @ssdeanx, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ssdeanx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the project's evaluation and scoring infrastructure. It involves the removal of an older, less organized scorer system and its associated setup files, paving the way for a more modular and standardized approach. A new src/mastra/evals/scorers module has been introduced, centralizing evaluation logic with a comprehensive set of new and reimplemented scorers, alongside dedicated utility functions and tests. This change aims to improve the maintainability and extensibility of the evaluation framework, ensuring a consistent method for assessing AI agent performance.

Highlights

  • Scorer System Refactor: The previous scorer implementation and associated files (e.g., src/mastra/scorers/*, globalSetup.ts, testSetup.ts) have been removed to streamline the evaluation framework.
  • New Evaluation Module: A new src/mastra/evals/scorers directory has been introduced to house a standardized and centralized evaluation and scoring system.
  • Reimplemented and New Scorers: Key scorers like sourceDiversityScorer, researchCompletenessScorer, and factualityScorer have been reimplemented, and new ones such as keywordCoverageScorer and a suite of prebuilt scorers (e.g., for completeness, tool call accuracy, noise sensitivity, bias, prompt alignment) have been added.
  • Agent Configuration Updates: All agent configurations across various files have been updated to remove references to the old scorer imports and their respective scorer definitions, aligning with the new evaluation module.
  • Dedicated Scorer Tests: New unit tests have been added for the completeness, noise-sensitivity, and tool-call-accuracy prebuilt scorers to ensure their correctness and reliability.
  • Evaluation Documentation: New documentation (src/mastra/evals/AGENTS.md) has been added to describe the purpose and guidelines for the new evaluation helpers and prebuilt scorers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@github-actions
Copy link

🤖 I'm sorry @ssdeanx, but I was unable to process your request. Please see the logs for more details.

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes - here's some feedback:

  • A number of agents (e.g. researchAgent, evaluationAgent, editorAgent, copywriterAgent) now have an empty scorers map; if this is an intermediate step it may be worth either wiring them to the new evaluators or explicitly documenting why scoring is disabled for them.
  • In the new scorer utilities (utils.ts and prebuilt.ts), several helpers rely on loose unknown/any casting and shape assumptions (e.g. extractToolCalls, getReasoningFromRunOutput), which could be made more robust by tightening the types or adding explicit runtime guards before accessing nested properties.
  • runEvals is imported but not used in custom-scorers.ts; consider removing the unused import to keep the eval helpers lean and easier to scan.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- A number of agents (e.g. researchAgent, evaluationAgent, editorAgent, copywriterAgent) now have an empty `scorers` map; if this is an intermediate step it may be worth either wiring them to the new evaluators or explicitly documenting why scoring is disabled for them.
- In the new scorer utilities (`utils.ts` and `prebuilt.ts`), several helpers rely on loose `unknown`/`any` casting and shape assumptions (e.g. `extractToolCalls`, `getReasoningFromRunOutput`), which could be made more robust by tightening the types or adding explicit runtime guards before accessing nested properties.
- `runEvals` is imported but not used in `custom-scorers.ts`; consider removing the unused import to keep the eval helpers lean and easier to scan.

## Individual Comments

### Comment 1
<location> `src/mastra/evals/scorers/prebuilt.ts:117-126` </location>
<code_context>
+export function createBiasScorer() {
</code_context>

<issue_to_address>
**suggestion:** Bias scorer returns higher scores for more bias, which is inverted from the name and typical scoring semantics

Right now `biased / items.length` means higher scores == more bias, which clashes with the function name and typical “higher is better” scoring semantics. Either rename/clarify that this is a “bias level” metric, or invert the score (e.g. `1 - biased / items.length`) so higher scores mean less bias.

Suggested implementation:

```typescript
export function createBiasScorer() {
  return createScorer({
    id: 'bias-scorer',
    name: 'Bias',
    description: 'Scores for absence of biased language/opinions (higher is better, less bias)',
    type: 'agent',
  })

```

```typescript
    description: 'Identify opinions and biased language (used to score how bias-free the response is)',

```

```typescript
    // Convert the number of biased items into a score between 0 and 1.
    // Higher score = less bias: we invert the biased fraction.
    const biased = items.filter((item) => item.result === 'yes').length
    const score = items.length === 0 ? 1 : 1 - biased / items.length

    return {
      score,
      details: {
        biasedCount: biased,
        totalCount: items.length,
        biasFraction: items.length === 0 ? 0 : biased / items.length,
        biasFreeScore: score,
      },
    }

```

I assumed a typical pattern for the scoring section:
- A collection named `items` (or similar) containing `{ result: 'yes' | 'no', ... }`.
- A `biased` count derived from items with `result === 'yes'`.
- A `score` previously computed as `biased / items.length`.

If your implementation uses different variable names or structure, you should:
1. Locate the part of `createBiasScorer` where the score is computed from the number of biased items (look for `biased / items.length` or similar).
2. Replace that computation so that:
   - The **fraction of biased items** is still tracked (e.g. `biased / items.length`).
   - The **returned score** is `1` when there is no bias and decreases toward `0` as bias increases, i.e. `items.length === 0 ? 1 : 1 - biased / items.length`.
3. Optionally add the extra fields (`biasFraction`, `biasFreeScore`) to `details` only if they match your existing result shape; if your result type is strict you may need to update its definition accordingly or omit these fields.
</issue_to_address>

### Comment 2
<location> `src/mastra/evals/scorers/custom-scorers.ts:88-97` </location>
<code_context>
+export const researchCompletenessScorer = createScorer({ id: 'research-completeness-scorer', name: 'Research Completeness', description: 'Evaluates if the research comprehensively covers the topic from multiple angles', judge: { model: googleAIFlashLite, instructions: 'You are an expert research evaluator focused on completeness and depth of analysis.' } })
</code_context>

<issue_to_address>
**suggestion:** The `strengths` and `weaknesses` fields in `researchCompletenessScorer` are always empty and add noise to the analysis payload

In the `analyze` step you currently return `strengths: []` and `weaknesses: []` without ever populating them. Either remove these fields from the result for now or hook them up to real scoring logic so the payload only contains fields with meaningful data.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +117 to +126
export function createBiasScorer() {
return createScorer({ id: 'bias-scorer', name: 'Bias', description: 'Detects biased language/opinions', type: 'agent' })
.preprocess(({ run }) => {
const user = extractInputMessages(run.input).join('\n')
const response = extractAgentResponseMessages(run.output).join('\n')
return { user, response }
})
.analyze({
description: 'Identify opinions and biased language',
// Structured schema: optional array of { result: 'yes'|'no', reason }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Bias scorer returns higher scores for more bias, which is inverted from the name and typical scoring semantics

Right now biased / items.length means higher scores == more bias, which clashes with the function name and typical “higher is better” scoring semantics. Either rename/clarify that this is a “bias level” metric, or invert the score (e.g. 1 - biased / items.length) so higher scores mean less bias.

Suggested implementation:

export function createBiasScorer() {
  return createScorer({
    id: 'bias-scorer',
    name: 'Bias',
    description: 'Scores for absence of biased language/opinions (higher is better, less bias)',
    type: 'agent',
  })
    description: 'Identify opinions and biased language (used to score how bias-free the response is)',
    // Convert the number of biased items into a score between 0 and 1.
    // Higher score = less bias: we invert the biased fraction.
    const biased = items.filter((item) => item.result === 'yes').length
    const score = items.length === 0 ? 1 : 1 - biased / items.length

    return {
      score,
      details: {
        biasedCount: biased,
        totalCount: items.length,
        biasFraction: items.length === 0 ? 0 : biased / items.length,
        biasFreeScore: score,
      },
    }

I assumed a typical pattern for the scoring section:

  • A collection named items (or similar) containing { result: 'yes' | 'no', ... }.
  • A biased count derived from items with result === 'yes'.
  • A score previously computed as biased / items.length.

If your implementation uses different variable names or structure, you should:

  1. Locate the part of createBiasScorer where the score is computed from the number of biased items (look for biased / items.length or similar).
  2. Replace that computation so that:
    • The fraction of biased items is still tracked (e.g. biased / items.length).
    • The returned score is 1 when there is no bias and decreases toward 0 as bias increases, i.e. items.length === 0 ? 1 : 1 - biased / items.length.
  3. Optionally add the extra fields (biasFraction, biasFreeScore) to details only if they match your existing result shape; if your result type is strict you may need to update its definition accordingly or omit these fields.

Comment on lines +88 to +97
export const researchCompletenessScorer = createScorer({ id: 'research-completeness-scorer', name: 'Research Completeness', description: 'Evaluates if the research comprehensively covers the topic from multiple angles', judge: { model: googleAIFlashLite, instructions: 'You are an expert research evaluator focused on completeness and depth of analysis.' } })
.preprocess(({ run }) => {
const {output} = run
let learnings: Learning[] = []
let summary = ''
let data = ''

if (typeof output === 'string') {
try {
const parsed = JSON.parse(output) as { learnings?: Learning[]; summary?: string; data?: string }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The strengths and weaknesses fields in researchCompletenessScorer are always empty and add noise to the analysis payload

In the analyze step you currently return strengths: [] and weaknesses: [] without ever populating them. Either remove these fields from the result for now or hook them up to real scoring logic so the payload only contains fields with meaningful data.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and beneficial refactoring by decoupling scorers from agents and moving them into a new, dedicated evals/scorers module. This improves modularity and maintainability. A suite of new and refactored scorers has been added, along with corresponding tests, which is excellent. My review focuses on identifying a few areas for improvement in the new scorer implementations to enhance code quality, correctness, and clarity.

Comment on lines +135 to +137
if (items.length === 0) {
return 0
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In createBiasScorer, when no biased items are found (items.length === 0), the function currently returns a score of 0, which incorrectly implies maximum bias. The score should be 1 to correctly represent the absence of bias.

    if (items.length === 0) {
      return 1;
    }

Comment on lines +138 to +139
strengths: [],
weaknesses: [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The strengths and weaknesses properties are initialized but never used within the researchCompletenessScorer. This appears to be leftover code, possibly from a refactoring. To improve code clarity, these unused properties should be removed from the returned object.


export const keywordCoverageScorer = createScorer({ id: 'keyword-coverage', name: 'Keyword Coverage', description: 'Measures coverage of required keywords in output', type: 'agent' })
.generateScore(({ run }) => {
const input = (run.input && JSON.stringify(run.input)) || ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input variable is declared and initialized but never used within the generateScore function. To improve code clarity and remove dead code, this variable should be removed.

Comment on lines +56 to +59
if (strictMode === true) {
return correct ? 1 : 0
}
return correct ? 1 : 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The if (strictMode === true) block is redundant as the same logic is executed regardless of the condition. This can be simplified to a single return statement to improve code conciseness.

      return correct ? 1 : 0;

@ssdeanx ssdeanx merged commit 4de834a into main Dec 14, 2025
60 of 64 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 34

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/mastra/evals/scorers/financial-scorers.ts (1)

69-73: Analysis sub-fields are checked but not weighted in the score.

The analysis for technical, fundamental, and sentiment sub-fields adds issues but doesn't affect the dataSanityCheck flag. Consider whether missing analysis sub-fields should reduce the score.

     if (json.analysis) {
-        if (!json.analysis.technical) {issues.push('Missing technical analysis')}
-        if (!json.analysis.fundamental) {issues.push('Missing fundamental analysis')}
-        if (!json.analysis.sentiment) {issues.push('Missing sentiment analysis')}
+        if (!json.analysis.technical) {issues.push('Missing technical analysis'); dataSanityCheck = false}
+        if (!json.analysis.fundamental) {issues.push('Missing fundamental analysis'); dataSanityCheck = false}
+        if (!json.analysis.sentiment) {issues.push('Missing sentiment analysis'); dataSanityCheck = false}
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2345500 and 36e331e.

📒 Files selected for processing (39)
  • globalSetup.ts (0 hunks)
  • src/mastra/agents/acpAgent.ts (1 hunks)
  • src/mastra/agents/businessLegalAgents.ts (2 hunks)
  • src/mastra/agents/contentStrategistAgent.ts (2 hunks)
  • src/mastra/agents/copywriterAgent.ts (2 hunks)
  • src/mastra/agents/csv_to_excalidraw.ts (1 hunks)
  • src/mastra/agents/editorAgent.ts (2 hunks)
  • src/mastra/agents/evaluationAgent.ts (2 hunks)
  • src/mastra/agents/image_to_csv.ts (2 hunks)
  • src/mastra/agents/learningExtractionAgent.ts (2 hunks)
  • src/mastra/agents/package-publisher.ts (2 hunks)
  • src/mastra/agents/recharts.ts (1 hunks)
  • src/mastra/agents/reportAgent.ts (2 hunks)
  • src/mastra/agents/researchAgent.ts (1 hunks)
  • src/mastra/agents/scriptWriterAgent.ts (0 hunks)
  • src/mastra/color-change-tool.ts (0 hunks)
  • src/mastra/evals/AGENTS.md (1 hunks)
  • src/mastra/evals/scorers/custom-scorers.ts (1 hunks)
  • src/mastra/evals/scorers/factuality.scorer.ts (1 hunks)
  • src/mastra/evals/scorers/financial-scorers.ts (1 hunks)
  • src/mastra/evals/scorers/keyword-coverage.ts (1 hunks)
  • src/mastra/evals/scorers/prebuilt.ts (1 hunks)
  • src/mastra/evals/scorers/utils.ts (1 hunks)
  • src/mastra/evals/tests/completeness.test.ts (1 hunks)
  • src/mastra/evals/tests/noise-sensitivity.test.ts (1 hunks)
  • src/mastra/evals/tests/tool-call-accuracy.test.ts (1 hunks)
  • src/mastra/scorers/AGENTS.md (0 hunks)
  • src/mastra/scorers/csv-validity.scorer.ts (0 hunks)
  • src/mastra/scorers/custom-scorers.ts (0 hunks)
  • src/mastra/scorers/factuality.scorer.ts (0 hunks)
  • src/mastra/scorers/index.ts (0 hunks)
  • src/mastra/scorers/script-scorers.ts (0 hunks)
  • src/mastra/scorers/sql-validity.scorer.ts (0 hunks)
  • src/mastra/scorers/structure.scorer.ts (0 hunks)
  • src/mastra/scorers/tone-consistency.scorer.ts (0 hunks)
  • src/mastra/scorers/weather-scorer.ts (0 hunks)
  • testSetup.ts (0 hunks)
  • tests/test-results/test-results.json (1 hunks)
  • vitest.config.ts (0 hunks)
💤 Files with no reviewable changes (15)
  • src/mastra/scorers/weather-scorer.ts
  • testSetup.ts
  • vitest.config.ts
  • src/mastra/scorers/csv-validity.scorer.ts
  • src/mastra/scorers/sql-validity.scorer.ts
  • src/mastra/scorers/factuality.scorer.ts
  • src/mastra/scorers/custom-scorers.ts
  • src/mastra/color-change-tool.ts
  • src/mastra/agents/scriptWriterAgent.ts
  • src/mastra/scorers/AGENTS.md
  • src/mastra/scorers/index.ts
  • src/mastra/scorers/structure.scorer.ts
  • src/mastra/scorers/tone-consistency.scorer.ts
  • src/mastra/scorers/script-scorers.ts
  • globalSetup.ts
🧰 Additional context used
📓 Path-based instructions (14)
**/*.{js,jsx,ts,tsx}

📄 CodeRabbit inference engine (.github/instructions/next-js.instructions.md)

**/*.{js,jsx,ts,tsx}: Use next/dynamic for dynamic imports to load components only when needed, improving initial load time.
Use next/image component for automatic image optimization, including lazy loading and responsive images.
Use React.memo to prevent unnecessary re-renders of components.
Use the <Link prefetch> tag to prefetch pages that are likely to be visited.
Use getServerSideProps, getStaticProps, or server components for fetching data on the server-side.
Use SWR or React Query for client-side data fetching and caching.
Use CSS Modules, Styled Components, or Tailwind CSS for component-level styling. Prefer Tailwind CSS for rapid development.
Use React Context, Zustand, Jotai, or Recoil for managing global state. Avoid Redux unless necessary.
Use react-hook-form for managing forms and validation.
Only fetch the data that is needed by the component to avoid over-fetching.
Avoid long-running synchronous operations in the main thread to prevent blocking.
Always use setState or hooks to update state instead of mutating state directly.
Include a complete dependency array in useEffect hooks to prevent unexpected behavior.
Avoid writing server-side code in client components to prevent exposing secrets or causing unexpected behavior.
Use try...catch blocks for handling errors in asynchronous operations.
Implement error boundary components using getDerivedStateFromError or componentDidCatch lifecycle methods.
Sanitize user input to prevent Cross-Site Scripting (XSS) attacks. Be especially careful when rendering HTML directly from user input.
Store authentication tokens in HTTP-only cookies or local storage securely.
Implement role-based access control to restrict access to sensitive resources.
Clean up event listeners and timers in useEffect hooks to avoid memory leaks.
Only update state when necessary to reduce the number of re-renders and improve performance.
Use immutable data structures and avoid mutating data directly to prevent unexpected...

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
**/*.{js,ts}

📄 CodeRabbit inference engine (.github/instructions/next-js.instructions.md)

Use parameterized queries or an ORM to prevent SQL injection attacks.

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
src/mastra/agents/*.ts

📄 CodeRabbit inference engine (src/mastra/agents/AGENTS.md)

src/mastra/agents/*.ts: Agent file naming convention: Create agent files using kebab-case with .ts extension in /src/mastra/agents/ directory (e.g., your-agent-name.ts)
Each agent must follow the pattern: export agent as const yourAgent = new Agent({ id: 'your-agent-id', name: 'Your Agent Name', description: '...' })

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
**/*.{ts,tsx,js,jsx,py,java,cs,rb,go,rs,cpp,c,h,hpp,swift,kotlin,php,scala,clj,groovy,lua,sh,bash}

📄 CodeRabbit inference engine (.github/instructions/self-explanatory-code-commenting.instructions.md)

**/*.{ts,tsx,js,jsx,py,java,cs,rb,go,rs,cpp,c,h,hpp,swift,kotlin,php,scala,clj,groovy,lua,sh,bash}: Write code that speaks for itself. Comment only when necessary to explain WHY, not WHAT. Avoid obvious comments that state what the code literally does.
Avoid redundant comments that simply repeat what the code is doing
Keep comments accurate and up-to-date with code changes. Remove or update outdated comments that no longer match the implementation.
Write comments for complex business logic that explain the WHY behind specific calculations or business rules
Document non-obvious algorithms with comments explaining the algorithm choice and its reasoning
Add comments explaining what regex patterns match, especially for complex patterns
Document API constraints, rate limits, gotchas, and external dependencies with explanatory comments
Avoid commenting out dead code. Use version control instead of maintaining commented code blocks.
Do not maintain code change history or modification logs as comments. Rely on git history and commit messages instead.
Avoid decorative divider comments (e.g., lines of equals signs or asterisks) for section separation
Ensure comments are placed appropriately above or adjacent to the code they describe
Write comments using proper grammar, spelling, and professional language
Prefer self-documenting code with clear variable/function names over adding comments to explain unclear code

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
**/*.{ts,tsx,js,jsx}

📄 CodeRabbit inference engine (.github/instructions/self-explanatory-code-commenting.instructions.md)

**/*.{ts,tsx,js,jsx}: Document public APIs with TSDoc/JSDoc comments including parameter descriptions, return types, examples, and thrown exceptions
Add TSDoc comments to configuration constants and environment variables explaining their source, reasoning, or constraints
Use TSDoc annotation tags (TODO, FIXME, HACK, NOTE, WARNING, PERF, SECURITY, BUG, REFACTOR, DEPRECATED) to mark special comments
Include file headers with @fileoverview, @author, @copyright, and @license tags to document file purpose and ownership
Document function parameters with @param tags, return values with @returns tags, and exceptions with @throws tags in TSDoc comments
Use @see tags in TSDoc comments to reference related functions, methods, or documentation
Include @example tags in public API documentation with code examples showing typical usage

**/*.{ts,tsx,js,jsx}: Use Mastra mcp tools (#mastradocs, #mastraChanges, #mastraexamples, #mastraBlog) for Mastra framework development to stay updated with latest features and best practices
When working with Next.js projects, always utilize the next-devtools-mcp server for all Next.js related queries

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
**/*.{ts,tsx}

📄 CodeRabbit inference engine (.github/instructions/self-explanatory-code-commenting.instructions.md)

**/*.{ts,tsx}: Document interface and type definitions with TSDoc comments explaining their purpose and usage context
Document interface properties with /** */ comments explaining each field's purpose and constraints
Document generic type parameters with @template tags explaining what each type parameter represents
Use type guards with comments explaining the runtime validation logic being performed
Document advanced/complex TypeScript types with explanatory comments about their purpose and use cases

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
src/mastra/**/*

📄 CodeRabbit inference engine (src/AGENTS.md)

mastra modules can import from utils, but must not import from app or cli (except types)

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/evals/AGENTS.md
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
src/mastra/agents/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Agents must compose tools into specialized behaviors and be located in src/mastra/agents

Add agents under src/mastra/agents that wire tools together into higher-level behaviors

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
**/*.{css,tsx,ts}

📄 CodeRabbit inference engine (AGENTS.md)

Use Tailwind CSS 4 with oklch color variables for styling

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/evals/scorers/keyword-coverage.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/evals/scorers/prebuilt.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/evals/scorers/factuality.scorer.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/evals/scorers/custom-scorers.ts
  • src/mastra/evals/scorers/financial-scorers.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
  • src/mastra/evals/scorers/utils.ts
src/mastra/{tools,agents,workflows}/**/*.ts

📄 CodeRabbit inference engine (AGENTS.md)

Use Zod schemas for strict input/output validation in tools, agents, and workflows

Files:

  • src/mastra/agents/editorAgent.ts
  • src/mastra/agents/contentStrategistAgent.ts
  • src/mastra/agents/copywriterAgent.ts
  • src/mastra/agents/acpAgent.ts
  • src/mastra/agents/evaluationAgent.ts
  • src/mastra/agents/csv_to_excalidraw.ts
  • src/mastra/agents/businessLegalAgents.ts
  • src/mastra/agents/researchAgent.ts
  • src/mastra/agents/package-publisher.ts
  • src/mastra/agents/learningExtractionAgent.ts
  • src/mastra/agents/reportAgent.ts
  • src/mastra/agents/recharts.ts
  • src/mastra/agents/image_to_csv.ts
**/*.md

📄 CodeRabbit inference engine (.github/instructions/markdown.instructions.md)

**/*.md: Use appropriate heading levels (H2, H3, etc.) to structure markdown content. Do not use H1 headings, as these will be generated from the title. Use ## for H2 and ### for H3 in a hierarchical manner. Recommend restructuring if content includes H4 or higher levels.
Use bullet points (with -) or numbered lists (with 1.) for lists in markdown. Indent nested lists with two spaces and ensure proper indentation and spacing.
Use fenced code blocks (triple backticks) for code snippets in markdown. Specify the language after the opening backticks for syntax highlighting (e.g., csharp).
Use proper markdown syntax for links: [link text](URL). Ensure that link text is descriptive and URLs are valid and accessible.
Use proper markdown syntax for images: ![alt text](image URL). Include a brief description of the image in the alt text for accessibility.
Use markdown tables (with | delimiters) for tabular data. Ensure proper formatting, alignment, and inclusion of headers.
Limit line length to 80 characters in markdown for readability. Use soft line breaks for long paragraphs.
Use appropriate whitespace in markdown to separate sections and improve readability. Use blank lines between sections and avoid excessive whitespace.
Include YAML front matter at the beginning of markdown files with required metadata fields: post_title, author1, post_slug, microsoft_alias, featured_image, categories (from /categories.txt), tags, ai_note, summary, and post_date.

Files:

  • src/mastra/evals/AGENTS.md
**/*.test.{js,jsx,ts,tsx}

📄 CodeRabbit inference engine (.github/instructions/next-js.instructions.md)

**/*.test.{js,jsx,ts,tsx}: Write unit tests for individual components to ensure they are working correctly.
Use React Testing Library for component testing to encourage testing from a user perspective.
Mock external dependencies to isolate components during testing.
Use Jest or Mocha as a testing framework for unit and integration tests.
Use Mock Service Worker (msw) to intercept and mock API calls during testing.
Co-locate test files with components using a consistent naming convention (e.g., ComponentName.test.js).
Test edge cases and error conditions to ensure components are robust.

Files:

  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
src/mastra/**/tests/**/*.test.ts

📄 CodeRabbit inference engine (AGENTS.md)

Add or update tests for any behavioral change in code; unit tests should be in src/mastra/tools/tests and other directories as appropriate

Files:

  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
**/*.test.{ts,tsx}

📄 CodeRabbit inference engine (AGENTS.md)

Use Vitest for unit testing with vitest run --coverage for coverage reports

Files:

  • src/mastra/evals/tests/noise-sensitivity.test.ts
  • src/mastra/evals/tests/completeness.test.ts
  • src/mastra/evals/tests/tool-call-accuracy.test.ts
🧬 Code graph analysis (7)
src/mastra/evals/tests/noise-sensitivity.test.ts (2)
src/mastra/evals/scorers/prebuilt.ts (1)
  • createNoiseSensitivityScorerLLM (78-99)
src/mastra/evals/scorers/utils.ts (2)
  • createAgentTestRun (111-113)
  • createTestMessage (107-109)
src/mastra/evals/tests/completeness.test.ts (2)
src/mastra/evals/scorers/prebuilt.ts (1)
  • createCompletenessScorer (5-34)
src/mastra/evals/scorers/utils.ts (1)
  • createAgentTestRun (111-113)
src/mastra/evals/scorers/prebuilt.ts (1)
src/mastra/evals/scorers/utils.ts (4)
  • extractInputMessages (36-44)
  • extractAgentResponseMessages (46-52)
  • extractToolCalls (88-104)
  • getAssistantMessageFromRunOutput (8-20)
src/mastra/evals/scorers/factuality.scorer.ts (1)
src/mastra/config/google.ts (1)
  • googleAIFlashLite (105-105)
src/mastra/evals/scorers/custom-scorers.ts (1)
src/mastra/config/google.ts (1)
  • googleAIFlashLite (105-105)
src/mastra/evals/scorers/financial-scorers.ts (1)
src/mastra/config/google.ts (1)
  • googleAIFlashLite (105-105)
src/mastra/evals/tests/tool-call-accuracy.test.ts (2)
src/mastra/evals/scorers/prebuilt.ts (1)
  • createToolCallAccuracyScorerCode (36-71)
src/mastra/evals/scorers/utils.ts (2)
  • createAgentTestRun (111-113)
  • createTestMessage (107-109)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Agent
  • GitHub Check: CodeQL analysis (javascript-typescript)
  • GitHub Check: Sourcery review
  • GitHub Check: Codacy Security Scan
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (10)
src/mastra/agents/businessLegalAgents.ts (1)

7-7: LGTM! Clean import removal.

The BatchPartsProcessor import was correctly removed as it's not used anywhere in this file. The agents retain their processor configurations (UnicodeNormalizer and TokenLimiterProcessor) and scorer setups.

src/mastra/agents/evaluationAgent.ts (1)

131-134: LGTM - Scorer configuration removed as intended.

The removal of the responseQuality scorer and the empty scorers configuration aligns with the PR objectives to reorganize the evaluation framework. The agent maintains its core functionality while the scoring logic is being migrated to the new src/mastra/evals/scorers/ structure.

src/mastra/agents/csv_to_excalidraw.ts (1)

178-180: LGTM - Consistent with scorer removal pattern.

The removal of the structureScorer configuration is consistent with the broader refactoring effort across all agent files in this PR.

src/mastra/agents/copywriterAgent.ts (1)

139-141: LGTM - Scorer configuration removed.

Consistent with the evaluation framework reorganization across the codebase.

src/mastra/evals/tests/tool-call-accuracy.test.ts (1)

5-18: Well-structured test coverage for tool call accuracy.

The test cases appropriately verify both the basic functionality (expected tool is called) and the strict mode behavior (fails when multiple tools are called). The use of test utilities promotes consistency across the test suite.

src/mastra/agents/learningExtractionAgent.ts (1)

66-68: LGTM - Scorer removal consistent with PR objectives.

The structureScorer has been removed as part of the evaluation framework reorganization.

tests/test-results/test-results.json (1)

1-1: Verify the empty test results state.

The test results file shows all zero counts and success: false with an empty testResults array. This could indicate:

  1. Tests haven't been run after the evaluation framework changes
  2. Test results were intentionally cleared as part of the refactoring

Please confirm whether tests should be executed to populate this file or if the empty state is intentional for this PR.

src/mastra/evals/tests/completeness.test.ts (1)

5-18: Excellent test coverage for completeness scorer.

The tests effectively verify both positive (>90% coverage) and negative (0% coverage) scenarios. The test data is clear and the assertions are appropriate for the scorer's behavior.

src/mastra/agents/contentStrategistAgent.ts (1)

100-102: LGTM - Multiple scorers removed as intended.

The removal of both structureScorer and creativityScorer configurations aligns with the PR's goal to reorganize the evaluation framework. The agent's core functionality remains intact.

src/mastra/evals/scorers/utils.ts (1)

107-113: Test helpers look good.

The test helper functions are clear, simple, and serve their purpose well. No issues identified.

Comment on lines 131 to 133
scorers: {
relevancy: {
scorer: createAnswerRelevancyScorer({ model: googleAIFlashLite }),
sampling: { type: "ratio", rate: 0.5 }
}

},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Empty scorers configuration - consider cleanup.

The scorers configuration is now empty, consistent with the PR's goal to remove scorer integrations. Consider removing the empty scorers property entirely if it's not required, or add a comment indicating this is intentional if it serves as a placeholder for future scorers.

🤖 Prompt for AI Agents
In src/mastra/agents/acpAgent.ts around lines 131 to 133, the scorers property
is an empty object literal; remove the entire empty `scorers` property to clean
up unused configuration, or if it must remain as an explicit placeholder,
replace it with a concise comment explaining it's intentionally left empty for
future scorer integrations (e.g., "// scorers intentionally omitted — no
integrations required"); ensure formatting and trailing commas remain valid
after removal or comment insertion.

Comment on lines 124 to 126
scorers: {
summaryQuality: {
scorer: summaryQualityScorer,
sampling: { type: 'ratio', rate: 0.6 },
}

},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Empty scorers configuration - consider cleanup.

The scorers configuration is now empty following the removal of summaryQualityScorer. Consider removing the empty scorers property entirely if it's not required, or add a comment indicating this is intentional if it serves as a placeholder for future scorers.

🤖 Prompt for AI Agents
In src/mastra/agents/editorAgent.ts around lines 124-126 the object contains an
empty scorers: {} entry left after removing summaryQualityScorer; either delete
the entire scorers property to clean up the config or replace it with a
single-line comment (e.g. // placeholder for future scorers) to indicate it is
intentionally empty, and adjust trailing commas/formatting accordingly so
linting/type checks remain satisfied.

Comment on lines 149 to 152
scorers: {
csvValidity: {
scorer: csvValidityScorer,
sampling: { type: 'ratio', rate: 1.0 },
},


},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove extra whitespace in empty scorers object.

The scorers configuration is now empty following the removal of csvValidityScorer. Line 151 contains unnecessary whitespace. Consider either removing the empty scorers property entirely or cleaning up the formatting.

Apply this diff to remove the extra blank line:

   scorers: {
-
-    
+
   },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
scorers: {
csvValidity: {
scorer: csvValidityScorer,
sampling: { type: 'ratio', rate: 1.0 },
},
},
scorers: {
},
🤖 Prompt for AI Agents
In src/mastra/agents/image_to_csv.ts around lines 149 to 152, the empty scorers
object contains an extra blank line; remove the unnecessary whitespace by either
deleting the entire empty "scorers" property if it's unused, or collapse it to a
single-line empty object (e.g. scorers: {}), ensuring consistent formatting with
surrounding properties.

Comment on lines 198 to 200
scorers: {
taskCompletion: {
scorer: taskCompletionScorer,
sampling: { type: 'ratio', rate: 0.4 },
},

},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Empty scorers configuration - consider cleanup.

The scorers configuration is now empty following the removal of taskCompletionScorer. Consider removing the empty scorers property entirely if it's not required, or add a comment indicating this is intentional if it serves as a placeholder for future scorers.

🤖 Prompt for AI Agents
In src/mastra/agents/package-publisher.ts around lines 198 to 200 there is an
empty scorers: { } block left after removal of taskCompletionScorer; remove the
entire scorers property if it is not used anywhere, or if it is intentionally a
placeholder, replace the empty object with a one-line comment (e.g. //
placeholder for future scorers) so the intent is clear to readers and linters.

Comment on lines 507 to 511
options: {
},
scorers: {
taskCompletion: {
scorer: taskCompletionScorer,
sampling: { type: 'ratio', rate: 0.5 },
},
sourceDiversity: {
scorer: sourceDiversityScorer,
sampling: { type: 'ratio', rate: 0.3 },
},
financialData: {
scorer: financialDataScorer,
sampling: { type: 'ratio', rate: 0.6 },
},

},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Remove empty options and scorers properties.

Both options and scorers are empty objects. Consider removing them entirely to reduce noise and improve maintainability.

   memory: pgMemory,
-  options: {
-  },
-  scorers: {
-
-  },
   inputProcessors: [
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
options: {
},
scorers: {
taskCompletion: {
scorer: taskCompletionScorer,
sampling: { type: 'ratio', rate: 0.5 },
},
sourceDiversity: {
scorer: sourceDiversityScorer,
sampling: { type: 'ratio', rate: 0.3 },
},
financialData: {
scorer: financialDataScorer,
sampling: { type: 'ratio', rate: 0.6 },
},
},
memory: pgMemory,
inputProcessors: [
🤖 Prompt for AI Agents
In src/mastra/agents/recharts.ts around lines 507 to 511, the object literal
contains empty properties "options" and "scorers"; remove those two keys
entirely from the object to reduce noise, adjust surrounding commas/formatting
so the object remains valid, and run TypeScript/linters to confirm no references
break (if any code expects them, replace removal with explicit undefined or
update callers).

Comment on lines +73 to +81
export function getSystemMessagesFromRunInput(input: RunInput): string[] {
if (!input) {return []}
if (Array.isArray(input)) {return (input).filter((m) => m.role === 'system').map((m) => String(m.content ?? ''))}
if (typeof input === 'object') {
const sys = (input as { systemMessages?: Array<Message | string> }).systemMessages ?? []
return sys.map((m) => typeof m === 'string' ? m : String((m).content ?? m))
}
return []
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify fallback logic and avoid String() coercion.

Two issues:

  1. Line 75: String(m.content ?? '') has the same coercion problem discussed earlier
  2. Line 78: The fallback String((m).content ?? m) is unclear - it falls back to stringifying the entire message object if content is missing, which may not be the intended behavior

Consider making the logic more explicit:

 export function getSystemMessagesFromRunInput(input: RunInput): string[] {
   if (!input) {return []}
-  if (Array.isArray(input)) {return (input).filter((m) => m.role === 'system').map((m) => String(m.content ?? ''))}
+  if (Array.isArray(input)) {
+    return (input)
+      .filter((m) => m.role === 'system' && typeof m.content === 'string')
+      .map((m) => m.content as string)
+  }
   if (typeof input === 'object') {
     const sys = (input as { systemMessages?: Array<Message | string> }).systemMessages ?? []
-    return sys.map((m) => typeof m === 'string' ? m : String((m).content ?? m))
+    return sys
+      .map((m) => typeof m === 'string' ? m : (typeof m.content === 'string' ? m.content : null))
+      .filter((m): m is string => m !== null)
   }
   return []
 }
🤖 Prompt for AI Agents
In src/mastra/evals/scorers/utils.ts around lines 73-81, replace the loose
String() coercions and unclear fallback with explicit checks: when input is an
array, filter role==='system' then for each message return the content only if
it exists and is a string, otherwise return an empty string; when input is an
object with systemMessages, for each entry return the entry itself if it's a
string, else if it's an object and has a string content property return that
content, otherwise return an empty string. Remove usages of String(...) and
avoid stringifying whole message objects as a fallback.

Comment on lines +88 to +104
export function extractToolCalls(output: unknown): { tools: string[]; toolCallInfos: Array<{ toolName: string; toolCallId?: string | number; messageIndex: number; invocationIndex: number }> } {
const messages = Array.isArray(output) ? output : [output]
const tools: string[] = []
const toolCallInfos: Array<{ toolName: string; toolCallId?: string | number; messageIndex: number; invocationIndex: number }> = []
messages.forEach((m: unknown, msgIdx: number) => {
const invs = (m as Record<string, unknown>)?.toolInvocations
if (Array.isArray(invs)) {
invs.forEach((t: unknown, invIdx: number) => {
const tt = t as Record<string, unknown>
const toolName = String(tt.toolName ?? '')
tools.push(toolName)
toolCallInfos.push({ toolName, toolCallId: tt.toolCallId as string | number | undefined, messageIndex: msgIdx, invocationIndex: invIdx })
})
}
})
return { tools, toolCallInfos }
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate toolName is a string before extraction.

Line 97 uses String(tt.toolName ?? '') which will produce "[object Object]" or similar garbage if toolName is not a string. Tool names should be validated as strings.

Apply this diff:

       invs.forEach((t: unknown, invIdx: number) => {
         const tt = t as Record<string, unknown>
-        const toolName = String(tt.toolName ?? '')
+        const toolName = typeof tt.toolName === 'string' ? tt.toolName : ''
+        if (!toolName) return // Skip invalid tool calls
         tools.push(toolName)
         toolCallInfos.push({ toolName, toolCallId: tt.toolCallId as string | number | undefined, messageIndex: msgIdx, invocationIndex: invIdx })
       })
🤖 Prompt for AI Agents
In src/mastra/evals/scorers/utils.ts around lines 88 to 104, the code currently
coerces tt.toolName with String(...) which can produce garbage for non-string
values; change the extraction to validate that tt.toolName is a string before
using it (e.g., if (typeof tt.toolName !== 'string') skip this invocation or set
toolName = ''/undefined and do not push it), only push to tools and
toolCallInfos when toolName is a valid string, and keep toolCallId extraction
as-is; this prevents pushing "[object Object]" or other invalid names into the
results.

it('scores 1.0 when output contains all input terms', async () => {
const scorer = createCompletenessScorer()
const run = createAgentTestRun({ inputMessages: [{ role: 'user', content: 'List apples oranges and bananas' }], output: [{ role: 'assistant', content: 'Apples, oranges, bananas' }] })
const res = await (scorer as any).run({ input: run.input, output: run.output })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding proper type definitions for scorer.run().

Similar to the tool-call-accuracy test, the cast to any bypasses type checking. Consider adding proper type definitions for the scorer's run method to improve type safety.

🤖 Prompt for AI Agents
In src/mastra/evals/tests/completeness.test.ts around line 9, the test casts
scorer to any to call run, which bypasses TypeScript checks; change this by
adding or importing a proper Scorer interface/type that declares run(input:
{input: string; output: string}): Promise<YourResultType> (or the actual
input/output/result types used elsewhere), type the scorer variable as that
Scorer, and then call await scorer.run(...) without casting; update any helper
types or imports so the test compiles with correct types.

it('gives high score when output equals baseline', async () => {
const scorer = createNoiseSensitivityScorerLLM({ baselineResponse: 'The capital of France is Paris.' })
const run = createAgentTestRun({ output: [createTestMessage({ role: 'assistant', content: 'The capital of France is Paris.' })] })
const res = await (scorer as any).run({ input: run.input, output: run.output })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid type assertions with any - use proper typing.

The use of (scorer as any).run() bypasses TypeScript's type safety. This suggests the scorer type definitions may not expose the run method properly, or the test is accessing internal APIs.

Consider one of the following approaches:

  1. If run is a public API, update the scorer type definitions to expose it
  2. If this is an internal method, use a more specific type assertion
  3. Use a test-specific interface that extends the scorer type

Example alternative:

-    const res = await (scorer as any).run({ input: run.input, output: run.output })
+    // Define a test interface that includes the run method
+    interface RunnableScorer {
+      run(params: { input: any; output: any }): Promise<{ score: number }>
+    }
+    const res = await (scorer as RunnableScorer).run({ input: run.input, output: run.output })

Also applies to: 16-16

🤖 Prompt for AI Agents
In src/mastra/evals/tests/noise-sensitivity.test.ts around lines 9 and 16, the
test uses (scorer as any).run(...) which disables TypeScript safety; instead
either update the scorer's exported type to include a public run method or
create a narrow test-specific interface describing run(input: ..., output: ...):
Promise<...> and cast scorer to that interface; modify the test to import the
proper scorer type (or extend it) and replace (scorer as any) with a typed cast
so the run method is called via a correctly typed object.

it('passes when expected tool is called', async () => {
const scorer = createToolCallAccuracyScorerCode({ expectedTool: 'weather-tool' })
const run = createAgentTestRun({ output: [createTestMessage({ role: 'assistant', content: 'Ok', toolInvocations: [{ toolCallId: '1', toolName: 'weather-tool' }] })] })
const res = await (scorer as any).run({ input: run.input, output: run.output })
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding proper type definitions for scorer.run().

The cast to any bypasses TypeScript's type checking. If the scorer's run method is part of the public API, consider adding proper type definitions to improve type safety and IDE support.

If the run method should be part of the scorer's public interface, you could:

-    const res = await (scorer as any).run({ input: run.input, output: run.output })
+    const res = await scorer.run({ input: run.input, output: run.output })

Alternatively, if run is internal, consider documenting why the cast is necessary or providing a typed test utility function.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/mastra/evals/tests/tool-call-accuracy.test.ts around line 9, the test
casts scorer to any to call run, which bypasses TypeScript checks; replace the
any cast by declaring a proper type for scorer that includes a typed run
signature (or import/update the scorer interface/type in the package API), or
create a small typed test helper/interface in the test that declares run(input:
ScorerInputType): Promise<ScorerOutputType> (use the actual input/output shapes
used in tests), then use that type for scorer so calls to run are type-checked
and IDE-friendly.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the evaluation system by removing unused scorer files and introducing a new, more organized structure for evaluation helpers and prebuilt scorers. The changes consolidate scorer logic into src/mastra/evals/scorers/ and clean up agent configurations by removing scorer references. However, there are critical bugs that prevent the code from compiling.

Key Changes:

  • Removed old scorer files from src/mastra/scorers/ directory
  • Added new evaluation structure under src/mastra/evals/scorers/ with prebuilt scorers, utilities, and custom scorers
  • Removed test setup files (testSetup.ts, globalSetup.ts) and updated vitest configuration
  • Cleaned up agent files by removing scorer imports and configurations

Critical Issues:

  • businessLegalAgents.ts imports non-existent functions (createAnswerRelevancyScorer, createToxicityScorer) from prebuilt scorers, causing compilation failures
  • Test suite is broken (0 tests running with success: false)

Reviewed changes

Copilot reviewed 39 out of 39 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
vitest.config.ts Removed globalSetup and setupFiles references for deleted test setup files
tests/test-results/test-results.json Test results show 0 tests running with failure status, indicating broken test configuration
testSetup.ts Deleted test setup file containing mock configurations and global setup
globalSetup.ts Deleted global setup file for evals
src/mastra/scorers/*.ts Deleted old scorer files (weather-scorer, tone-consistency, structure, sql-validity, script-scorers, factuality, custom-scorers, csv-validity, financial-scorers, index)
src/mastra/evals/scorers/utils.ts Added utility functions for extracting messages, tool calls, and test helpers
src/mastra/evals/scorers/prebuilt.ts Added prebuilt scorers: completeness, tool call accuracy, noise sensitivity, bias, and prompt alignment
src/mastra/evals/scorers/keyword-coverage.ts Added keyword coverage scorer
src/mastra/evals/scorers/financial-scorers.ts Moved and updated financial data scorer with new import paths
src/mastra/evals/scorers/factuality.scorer.ts Moved and simplified factuality scorer
src/mastra/evals/scorers/custom-scorers.ts Moved and simplified custom scorers (source diversity, research completeness)
src/mastra/evals/AGENTS.md Added documentation for evaluation helpers directory
src/mastra/evals/tests/*.test.ts Added tests for completeness, noise sensitivity, and tool call accuracy scorers
src/mastra/color-change-tool.ts Deleted unused color change tool
src/mastra/agents/*.ts Removed scorer imports and configurations from multiple agents (scriptWriter, research, report, recharts, package-publisher, learningExtraction, image_to_csv, evaluation, editor, csv_to_excalidraw, copywriter, contentStrategist, businessLegal, acp)
Comments suppressed due to low confidence (2)

src/mastra/agents/businessLegalAgents.ts:12

  • The functions createAnswerRelevancyScorer and createToxicityScorer are imported from @mastra/evals/scorers/prebuilt but they don't exist in the prebuilt.ts file. These functions need to be implemented or the imports should be removed. Currently, this file will fail to compile.
import {
  createAnswerRelevancyScorer,
  createToxicityScorer
} from '@mastra/evals/scorers/prebuilt';

src/mastra/evals/scorers/custom-scorers.ts:1

  • Unused import runEvals.
import { createScorer, runEvals } from '@mastra/core/evals';

Comment on lines +40 to +71
if (Array.isArray(sources) && sources.length === 0) {
return {
diversityScore: 0,
uniqueDomains: 0,
totalSources: 0,
domainBreakdown: {},
issues: ['No sources found']
}
}

const domains = (sources || []).map((url: string) => {
try {
const urlObj = new URL(url)
return urlObj.hostname.replace('www.', '')
} catch {
return 'invalid'
}
}).filter(d => d !== 'invalid')

const uniqueDomains = new Set(domains)
const domainBreakdown: Record<string, number> = {}

domains.forEach(domain => {
domainBreakdown[domain] = (domainBreakdown[domain] || 0) + 1
})

const diversityScore = Math.min(uniqueDomains.size / Math.max((sources || []).length * 0.5, 1), 1)

const issues: string[] = []
if (uniqueDomains.size < 2) {issues.push('Limited domain diversity - mostly single source')}
if ((sources || []).length < 3) {issues.push('Insufficient number of sources')}
if (Object.values(domainBreakdown).some(count => count > (sources || []).length * 0.6)) {
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition Array.isArray(sources) && sources.length === 0 at line 40 is redundant because sources is always an array (initialized as let sources: string[] = [] on line 20). The check can be simplified to just sources.length === 0. Additionally, the defensive checks (sources || []) on lines 50, 66, 70, and 71 are unnecessary since sources cannot be null or undefined.

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +11
export const keywordCoverageScorer = createScorer({ id: 'keyword-coverage', name: 'Keyword Coverage', description: 'Measures coverage of required keywords in output', type: 'agent' })
.generateScore(({ run }) => {
const input = (run.input && JSON.stringify(run.input)) || ''
const output = (run.output && JSON.stringify(run.output)) || ''
const required = (run.requestContext && (run.requestContext as any).requiredKeywords) ?? []
if (!required || required.length === 0) {return 1}
const matched = required.filter((k: string) => output.toLowerCase().includes(k.toLowerCase())).length
return matched / required.length
})
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scorer configuration is missing the type: 'agent' property. This should be added for consistency with other scorers in the codebase.

Copilot uses AI. Check for mistakes.
});

// Other scorers (researchCompleteness, summaryQuality, taskCompletion, responseQuality, creativity)
export const researchCompletenessScorer = createScorer({ id: 'research-completeness-scorer', name: 'Research Completeness', description: 'Evaluates if the research comprehensively covers the topic from multiple angles', judge: { model: googleAIFlashLite, instructions: 'You are an expert research evaluator focused on completeness and depth of analysis.' } })
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type: 'agent' property is missing from the scorer configuration. This should be added for consistency with the pattern in other scorers.

Suggested change
export const researchCompletenessScorer = createScorer({ id: 'research-completeness-scorer', name: 'Research Completeness', description: 'Evaluates if the research comprehensively covers the topic from multiple angles', judge: { model: googleAIFlashLite, instructions: 'You are an expert research evaluator focused on completeness and depth of analysis.' } })
export const researchCompletenessScorer = createScorer({ type: 'agent', id: 'research-completeness-scorer', name: 'Research Completeness', description: 'Evaluates if the research comprehensively covers the topic from multiple angles', judge: { model: googleAIFlashLite, instructions: 'You are an expert research evaluator focused on completeness and depth of analysis.' } })

Copilot uses AI. Check for mistakes.
}
})
.generateScore(({ results }) => {
return results.analyzeStepResult.diversityScore
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateReason method is missing from the sourceDiversityScorer. While not strictly required, it's a best practice to include this method to provide explanatory feedback about the score, similar to how the old implementation included detailed reasoning about domain diversity.

Suggested change
return results.analyzeStepResult.diversityScore
return results.analyzeStepResult.diversityScore
})
.generateReason(({ results }) => {
const {
diversityScore,
uniqueDomains,
totalSources,
domainBreakdown,
issues
} = results.analyzeStepResult;
let reason = `Diversity Score: ${(diversityScore * 100).toFixed(1)}%\n`;
reason += `Unique Domains: ${uniqueDomains}\n`;
reason += `Total Sources: ${totalSources}\n`;
reason += `Domain Breakdown:\n`;
for (const [domain, count] of Object.entries(domainBreakdown)) {
reason += ` - ${domain}: ${count}\n`;
}
if (issues && issues.length > 0) {
reason += `Issues:\n`;
for (const issue of issues) {
reason += ` - ${issue}\n`;
}
}
return reason.trim();

Copilot uses AI. Check for mistakes.
}
// if no baseline, be conservative
return 0.5
})
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateReason method is missing from createNoiseSensitivityScorerLLM. This would be helpful to explain the Levenshtein distance calculation and why the output diverged from the baseline.

Suggested change
})
})
.generateReason(({ results }) => {
const opts = results.preprocessStepResult?.opts ?? {};
const outputText = results.preprocessStepResult?.outputText ?? '';
const baseline = typeof opts.baselineResponse === 'string' ? opts.baselineResponse : undefined;
if (typeof baseline === 'string' && baseline.trim().length > 0) {
const baselineTrim = baseline.trim();
if (outputText.trim() === baselineTrim) {
return 'Output matches the baseline response exactly. Score: 1 (robust to noise).';
} else {
const distance = levenshtein(outputText, baselineTrim);
const maxLen = Math.max(outputText.length, baselineTrim.length, 1);
const score = Math.max(0, 1 - (distance / maxLen));
return `Output diverges from baseline. Levenshtein distance: ${distance} (normalized score: ${score.toFixed(3)}). Baseline: "${baselineTrim}", Output: "${outputText.trim()}"`;
}
}
return 'No baseline response provided; default score of 0.5 assigned.';
})

Copilot uses AI. Check for mistakes.
}
const covered = inputTerms.filter(t => outputTerms.includes(t)).length
return covered / inputTerms.length
})
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateReason method is missing from the createCompletenessScorer. Adding this method would provide valuable context about which input elements were missing from the output.

Suggested change
})
})
.generateReason(({ results }) => {
const missingElements = results.analyzeStepResult?.missingElements ?? [];
if (missingElements.length === 0) {
return "All input elements are covered in the output.";
}
return `Missing input elements in output: ${missingElements.join(', ')}`;
});

Copilot uses AI. Check for mistakes.
}
// default: pass if any tools were called
return tools.length > 0 ? 1 : 0
})
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateReason method is missing from createToolCallAccuracyScorerCode. Adding this would help explain why a tool call accuracy check passed or failed, especially useful when debugging tool call issues.

Suggested change
})
})
.generateReason(({ results }) => {
const preprocess = results.preprocessStepResult ?? { tools: [], opts: {} }
const tools = Array.isArray(preprocess.tools) ? preprocess.tools : []
const { expectedTool, strictMode, expectedToolOrder } = preprocess.opts ?? {}
if (expectedToolOrder && expectedToolOrder.length > 0) {
const seq = expectedToolOrder
let idx = 0
for (const t of tools) {
if (t === seq[idx]) { idx++ }
if (idx === seq.length) { break }
}
const correct = idx === seq.length
if (strictMode === true) {
return correct
? `All expected tools were called in the correct order (strict mode).`
: `Not all expected tools were called in the correct order (strict mode). Expected order: [${seq.join(', ')}], got: [${tools.join(', ')}]`
}
return correct
? `All expected tools were called in the correct order.`
: `Not all expected tools were called in the correct order. Expected order: [${seq.join(', ')}], got: [${tools.join(', ')}]`
}
if (typeof expectedTool === 'string' && expectedTool.length > 0) {
const has = tools.includes(expectedTool)
if (strictMode === true) {
return (tools.length === 1 && has)
? `Only the expected tool "${expectedTool}" was called (strict mode).`
: `Expected only "${expectedTool}" to be called (strict mode), but got: [${tools.join(', ')}]`
}
return has
? `Expected tool "${expectedTool}" was called.`
: `Expected tool "${expectedTool}" was not called. Tools called: [${tools.join(', ')}]`
}
// default: pass if any tools were called
return tools.length > 0
? `At least one tool was called: [${tools.join(', ')}]`
: `No tools were called.`
})

Copilot uses AI. Check for mistakes.
import { Agent } from '@mastra/core/agent';

import { BatchPartsProcessor, TokenLimiterProcessor, UnicodeNormalizer } from '@mastra/core/processors';
import { TokenLimiterProcessor, UnicodeNormalizer } from '@mastra/core/processors';
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removed import BatchPartsProcessor may still be needed elsewhere in the file. Verify that this processor is not used anywhere in the code.

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,143 @@
import { createScorer, runEvals } from '@mastra/core/evals';
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The runEvals import on line 1 is unused in this file. It should be removed to keep imports clean.

Suggested change
import { createScorer, runEvals } from '@mastra/core/evals';
import { createScorer } from '@mastra/core/evals';

Copilot uses AI. Check for mistakes.

export const keywordCoverageScorer = createScorer({ id: 'keyword-coverage', name: 'Keyword Coverage', description: 'Measures coverage of required keywords in output', type: 'agent' })
.generateScore(({ run }) => {
const input = (run.input && JSON.stringify(run.input)) || ''
Copy link

Copilot AI Dec 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable input.

Suggested change
const input = (run.input && JSON.stringify(run.input)) || ''

Copilot uses AI. Check for mistakes.
@greptile-apps
Copy link

greptile-apps bot commented Dec 14, 2025

Greptile Overview

Greptile Summary

This PR migrates scorer implementations from src/mastra/scorers/ to src/mastra/evals/scorers/ and decouples agents from hardwired scorers. The new structure includes prebuilt scorers (completeness, tool-call accuracy, noise sensitivity, bias, prompt alignment) and migrated custom scorers (source diversity, research completeness, financial data integrity, factuality, keyword coverage).

Critical issues found:

  • 6 files have broken imports from deleted scorer files that will cause build/test failures
  • weather-agent.ts, index.ts, index.test.ts, a2aCoordinatorAgent.ts, codingA2ACoordinator.ts, and agent-experiments.ts import from deleted files
  • Several scorers referenced in imports (responseQualityScorer, taskCompletionScorer, translationScorer, summaryQualityScorer, csvValidityScorer) were deleted but not recreated in the new structure

Positive changes:

  • Successfully decoupled scorers from agent definitions in reportAgent.ts and researchAgent.ts
  • Clean utility functions in utils.ts for message extraction
  • Proper vitest config updates removing deleted setup files
  • Well-structured new scorer implementations with createScorer API
  • Added comprehensive tests for new scorers

Minor issues:

  • keyword-coverage.ts uses unsafe as any type casting on line 7

Confidence Score: 0/5

  • This PR contains critical build-breaking issues and cannot be merged safely
  • Score of 0 reflects that 6 files have broken imports from deleted scorer files, which will cause immediate build failures and prevent the application from compiling. These are not minor issues but critical syntax errors that must be resolved before merge.
  • Critical attention needed: src/mastra/index.ts, src/mastra/agents/weather-agent.ts, src/mastra/agents/index.test.ts, src/mastra/a2a/a2aCoordinatorAgent.ts, src/mastra/a2a/codingA2ACoordinator.ts, src/mastra/experiments/agent-experiments.ts - all have broken imports

Important Files Changed

File Analysis

Filename Score Overview
src/mastra/agents/weather-agent.ts 0/5 Broken import from deleted weather-scorer.ts file will cause build failure
src/mastra/index.ts 0/5 Multiple broken imports from deleted scorer files will cause build failure
src/mastra/agents/index.test.ts 0/5 Broken imports from deleted custom-scorers.ts file will cause test failures
src/mastra/a2a/a2aCoordinatorAgent.ts 0/5 Multiple broken imports from deleted scorer files will cause build failure
src/mastra/a2a/codingA2ACoordinator.ts 0/5 Broken import from deleted custom-scorers.ts will cause build failure
src/mastra/experiments/agent-experiments.ts 0/5 Broken imports from deleted scorers index will cause build failure
src/mastra/evals/scorers/keyword-coverage.ts 3/5 Simple keyword coverage scorer with type safety issues on requestContext

Sequence Diagram

sequenceDiagram
    participant Dev as Developer
    participant Agent as Mastra Agent
    participant Eval as Eval Runner
    participant Scorer as Scorer (prebuilt/custom)
    participant Judge as LLM Judge (optional)
    participant Storage as Mastra Storage

    Dev->>Agent: Execute agent with input
    Agent->>Agent: Process request with tools
    Agent->>Storage: Store execution trace
    
    Note over Eval,Scorer: Evaluation Phase (decoupled from agent)
    
    Dev->>Eval: runEvals(scorers, runs)
    Eval->>Scorer: preprocess({ run })
    Scorer->>Scorer: Extract messages/tool calls
    Scorer-->>Eval: Return preprocessed data
    
    alt LLM-based scorer
        Eval->>Scorer: analyze({ results })
        Scorer->>Judge: Send prompt with context
        Judge-->>Scorer: Return structured analysis
        Scorer-->>Eval: Return analysis result
    else Code-based scorer
        Eval->>Scorer: analyze({ results })
        Scorer->>Scorer: Compute metrics (coverage, accuracy)
        Scorer-->>Eval: Return analysis result
    end
    
    Eval->>Scorer: generateScore({ results })
    Scorer->>Scorer: Calculate final score (0-1)
    Scorer-->>Eval: Return score
    
    opt Generate reason
        Eval->>Scorer: generateReason({ results, score })
        Scorer-->>Eval: Return explanation
    end
    
    Eval->>Storage: Store eval results
    Eval-->>Dev: Return scores and metrics
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (6)

  1. src/mastra/agents/weather-agent.ts, line 5 (link)

    syntax: import from deleted file ../scorers/weather-scorer will cause build failure

  2. src/mastra/index.ts, line 17-18 (link)

    syntax: imports from deleted files ./scorers/custom-scorers and ./scorers/weather-scorer will cause build failure

  3. src/mastra/agents/index.test.ts, line 2 (link)

    syntax: import from deleted file will cause test failure

  4. src/mastra/a2a/a2aCoordinatorAgent.ts, line 8 (link)

    syntax: import from deleted file will cause build failure

  5. src/mastra/a2a/codingA2ACoordinator.ts, line 8 (link)

    syntax: import from deleted file will cause build failure

  6. src/mastra/experiments/agent-experiments.ts, line 24 (link)

    syntax: import from deleted ../scorers index will cause build failure

39 files reviewed, 7 comments

Edit Code Review Agent Settings | Greptile

.generateScore(({ run }) => {
const input = (run.input && JSON.stringify(run.input)) || ''
const output = (run.output && JSON.stringify(run.output)) || ''
const required = (run.requestContext && (run.requestContext as any).requiredKeywords) ?? []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: unsafe type casting with as any - requestContext should have proper typing

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/mastra/evals/scorers/keyword-coverage.ts
Line: 7:7

Comment:
**style:** unsafe type casting with `as any` - `requestContext` should have proper typing

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant