Skip to content

Conversation

@quanru
Copy link
Collaborator

@quanru quanru commented Oct 17, 2025

Summary

This PR enforces that Planning functionality only supports vision language models (VL mode)
and removes DOM-based planning support. This change ensures better consistency and performance
by requiring visual models for planning operations.

Changes

Core Changes

  • ModelConfigManager Validation (packages/shared/src/env/model-config-manager.ts)

    • Added validation in getModelConfig() to enforce VL mode for Planning intent
    • Provides clear error message with all supported VL modes and configuration examples
    • References documentation link for users to learn more
  • Planning Implementation Simplification (packages/core/src/ai-model/llm-planning.ts)

    • Removed DOM mode support (describeUserPage, markupImageForLLM)
    • Simplified image processing logic to only support VL mode paths
    • Added assertion to ensure VL mode is configured
    • Removed conditional DOM description in message construction
  • Type Documentation (packages/shared/src/env/types.ts)

    • Added comprehensive JSDoc comments to IModelConfigForPlanning
    • Clearly documents VL mode requirement and lists all supported modes

Test Coverage

  • Added 6 new unit tests for Planning VL mode validation:

    1. Isolated mode - should throw error when planning has no VL mode
    2. Isolated mode - should succeed when planning has valid VL mode
    3. Normal mode - should throw error when planning has no VL mode
    4. Normal mode - should succeed when planning has valid VL mode
    5. Planning validation failure doesn't affect other intents
    6. Should accept all valid VL modes (tests all 7 VL mode values)
  • Fixed existing tests to provide VL mode for Planning intent

Supported VL Modes

  • qwen-vl
  • qwen3-vl
  • gemini
  • doubao-vision
  • vlm-ui-tars
  • vlm-ui-tars-doubao
  • vlm-ui-tars-doubao-1.5

Breaking Change ⚠️

Planning without VL mode configured will now throw an error.

Before this change, Planning could work with DOM-based mode. After this change,
users must configure a VL mode for Planning intent.

Error message example:

Planning requires a vision language model (VL model). DOM-based planning is not supported.

Please configure one of the following VL modes:
  - doubao-vision
  - gemini
  - qwen-vl
  - qwen3-vl
  - vlm-ui-tars
  - vlm-ui-tars-doubao
  - vlm-ui-tars-doubao-1.5
  
Configuration examples:
  - Environment variable: MIDSCENE_PLANNING_VL_MODE=qwen-vl
  - Or use modelConfig function with planning intent
  
Learn more: https://midscenejs.com/choose-a-model

Migration Guide

Users need to configure VL mode for Planning in one of two ways:

Option 1: Environment Variable

export MIDSCENE_PLANNING_VL_MODE=qwen-vl
export MIDSCENE_PLANNING_MODEL_NAME=qwen-vl-plus
export MIDSCENE_PLANNING_OPENAI_API_KEY=your-api-key

Option 2: modelConfig Function

const agent = new Agent({
  modelConfig: ({ intent }) => {
    if (intent === 'planning') {
      return {
        MIDSCENE_PLANNING_MODEL_NAME: 'qwen-vl-plus',
        MIDSCENE_PLANNING_OPENAI_API_KEY: 'your-api-key',
        MIDSCENE_PLANNING_VL_MODE: 'qwen-vl',
      };
    }
    // ... other intents
  },
});

Test Results

✅ All tests passing (17/17)
✅ Build successful
✅ TypeScript compilation clean

Impact Analysis

  • VQA not affected: VQA's domIncluded parameter functionality remains unchanged
  • Other intents not affected: Planning validation only applies to Planning intent
  • Configuration-level validation: Fails fast with clear error messages at initialization

🤖 Generated with Claude Code

This change ensures that Planning functionality only supports vision
language models (VL mode) and removes DOM-based planning support.

Changes:
- Add validation in ModelConfigManager.getModelConfig() to require
  VL mode for Planning intent
- Remove DOM mode logic from llm-planning.ts (describeUserPage,
  markupImageForLLM)
- Simplify image processing to only support VL mode paths
- Add comprehensive JSDoc documentation for Planning VL mode
  requirement
- Add 6 new unit tests covering Planning VL mode validation in both
  isolated and normal modes
- Fix existing tests to provide VL mode for Planning intent

Breaking Change:
- Planning without VL mode configured will now throw an error with
  clear instructions
- Error message includes all supported VL modes and configuration
  examples

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Enforce that Planning requires a VL mode and remove DOM-based planning paths to ensure consistency and performance.

  • Add validation in ModelConfigManager to require VL mode for planning and surface a detailed error message
  • Simplify planning implementation to pure vision pathways and remove DOM markup/description logic
  • Document VL mode requirement for planning in types and extend unit tests to cover validation and accepted modes

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
packages/shared/src/env/model-config-manager.ts Adds VL mode validation for planning with detailed guidance in the thrown error
packages/core/src/ai-model/llm-planning.ts Removes DOM-based planning support, enforces VL mode, and simplifies image processing paths
packages/shared/src/env/types.ts Documents VL mode requirement and supported values for planning intent
packages/shared/tests/unit-test/env/modle-config-manager.test.ts Updates existing tests and adds new tests to validate VL mode enforcement and accepted modes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

const manager = new ModelConfigManager(modelConfigFn);

expect(() => manager.getModelConfig('planning')).toThrow(
'Planning requires a vision language model (VL model). DOM-based planning is not supported.',
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion is brittle against the multi-line error message thrown by ModelConfigManager. Prefer a regex to assert the key part of the message, e.g. toThrow(/Planning requires a vision[- ]?language (model|mode)/), so formatting or additional guidance lines won't break the test.

Suggested change
'Planning requires a vision language model (VL model). DOM-based planning is not supported.',
/Planning requires a vision[- ]?language (model|mode)/,

Copilot uses AI. Check for mistakes.

// Planning should fail
expect(() => manager.getModelConfig('planning')).toThrow(
'Planning requires a vision language model',
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a regex for this partial-match assertion to avoid accidental false negatives if the error message format changes (e.g. toThrow(/Planning requires a vision[- ]?language/)).

Suggested change
'Planning requires a vision language model',
/Planning requires a vision[- ]?language/i,

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +140
// Validate Planning must use VL mode
if (intent === 'planning' && !config.vlMode) {
throw new Error(
`Planning requires a vision language model (VL model). DOM-based planning is not supported.
Please configure one of the following VL modes:
${VL_MODES.map((mode) => `- ${mode}`).join('\n ')}
Configuration examples:
- Environment variable: MIDSCENE_PLANNING_VL_MODE=qwen-vl
- Or use modelConfig function with planning intent
Learn more: https://midscenejs.com/choose-a-model`,
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first sentence mixes 'VL model' with 'VL mode', while the rest of the message instructs users to configure VL modes. For consistency and clarity, consider: 'Planning requires a vision-language mode (VL mode). DOM-based planning is not supported.'

Copilot uses AI. Check for mistakes.
{ vlMode },
);
// Planning requires VL mode (validated by ModelConfigManager.getModelConfig)
assert(vlMode, 'Planning requires vlMode to be configured.');
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align this assertion message with the shared error wording used in ModelConfigManager for consistency (e.g. 'Planning requires a VL mode to be configured.').

Suggested change
assert(vlMode, 'Planning requires vlMode to be configured.');
assert(vlMode, 'Planning requires a VL mode to be configured.');

Copilot uses AI. Check for mistakes.
Comment on lines +62 to 66
// Reserved for qwen3-vl specific processing
// const paddedResult = await paddingToMatchBlockByBase64(imagePayload, 32);
// imageWidth = paddedResult.width;
// imageHeight = paddedResult.height;
// imagePayload = paddedResult.imageBase64;
Copy link

Copilot AI Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid keeping commented-out code; either remove it or replace with a TODO reference that links to a task/issue. If you plan to implement qwen3-vl specific processing, leave a concise TODO comment and remove the inactive code.

Suggested change
// Reserved for qwen3-vl specific processing
// const paddedResult = await paddingToMatchBlockByBase64(imagePayload, 32);
// imageWidth = paddedResult.width;
// imageHeight = paddedResult.height;
// imagePayload = paddedResult.imageBase64;
// TODO: Implement qwen3-vl specific image processing. See issue #1234.

Copilot uses AI. Check for mistakes.
@quanru quanru merged commit 80a2c97 into 1.0 Oct 17, 2025
2 checks passed
@quanru quanru deleted the feat/planning-vl-mode-only branch October 17, 2025 09:44
quanru added a commit that referenced this pull request Oct 17, 2025
This change ensures that Planning functionality only supports vision
language models (VL mode) and removes DOM-based planning support.

Changes:
- Add validation in ModelConfigManager.getModelConfig() to require
  VL mode for Planning intent
- Remove DOM mode logic from llm-planning.ts (describeUserPage,
  markupImageForLLM)
- Simplify image processing to only support VL mode paths
- Add comprehensive JSDoc documentation for Planning VL mode
  requirement
- Add 6 new unit tests covering Planning VL mode validation in both
  isolated and normal modes
- Fix existing tests to provide VL mode for Planning intent

Breaking Change:
- Planning without VL mode configured will now throw an error with
  clear instructions
- Error message includes all supported VL modes and configuration
  examples

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>
quanru added a commit that referenced this pull request Oct 23, 2025
This change ensures that Planning functionality only supports vision
language models (VL mode) and removes DOM-based planning support.

Changes:
- Add validation in ModelConfigManager.getModelConfig() to require
  VL mode for Planning intent
- Remove DOM mode logic from llm-planning.ts (describeUserPage,
  markupImageForLLM)
- Simplify image processing to only support VL mode paths
- Add comprehensive JSDoc documentation for Planning VL mode
  requirement
- Add 6 new unit tests covering Planning VL mode validation in both
  isolated and normal modes
- Fix existing tests to provide VL mode for Planning intent

Breaking Change:
- Planning without VL mode configured will now throw an error with
  clear instructions
- Error message includes all supported VL modes and configuration
  examples

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants