Skip to content

Add provider for CUA Cloud V2 batch job execution#1079

Open
r33drichards wants to merge 3 commits intomainfrom
claude/connect-cuabench-cloud-vm-tEIzq
Open

Add provider for CUA Cloud V2 batch job execution#1079
r33drichards wants to merge 3 commits intomainfrom
claude/connect-cuabench-cloud-vm-tEIzq

Conversation

@r33drichards
Copy link
Collaborator

@r33drichards r33drichards commented Feb 12, 2026

Summary

This PR adds support for running CUABench evaluations on Incus VMs through the CUA Cloud V2 API. The new IncusProvider enables batch job execution with automatic VM provisioning and solver container orchestration.

Key Changes

  • New IncusProvider class (libs/cua-bench/cua_bench/sessions/providers/incus.py):

    • Implements SessionProvider interface for Incus VM-based benchmark execution
    • Supports CUA Cloud V2 /v1/batch-jobs API for creating and managing batch jobs
    • Each batch job provisions N Incus VMs (one per task) running cua-xfce desktop + solver container
    • Handles API authentication via CUA_API_KEY environment variable or stored credentials
    • Implements core session lifecycle methods: start_session, get_session_status, stop_session, get_session_logs, get_results, and list_tasks
  • Updated provider factory (libs/cua-bench/cua_bench/sessions/manager.py):

    • Added support for "incus" and "cloudv2" provider names in make() factory function
    • Updated error message to document all supported providers

Notable Implementation Details

  • Authentication: Supports both environment variable (CUA_API_KEY) and SQLite credential storage (~/.cua/cli.sqlite)
  • Configuration: Accepts flexible solver configuration including agent name, model, max steps, parallelism, and VM image selection
  • Environment variables: Automatically passes through API keys (Anthropic, OpenAI, Google) to solver containers
  • Status mapping: Maps CUA Cloud batch job phases to local status values (pending, starting, running, completed, failed, stopped)
  • Async HTTP client: Uses aiohttp with proper session management and timeout handling
  • Error handling: Provides specific error messages for authentication failures, rate limiting, and connection issues

https://claude.ai/code/session_01N62q5oNTPtXfTNZXsqiCyH

Summary by CodeRabbit

  • New Features
    • Introduced new CUA Cloud provider enabling benchmark execution on cloud virtual machines with support for both API key and local credential authentication
    • Cloud provider variant supports comprehensive session management including batch job submission, status tracking, log retrieval, and results collection with optional pagination
    • Seamlessly integrates with existing provider selection mechanism

Add a new session provider that hits the /v1/batch-jobs API to run CUABench
evaluations on Incus VMs via the CloudV2 infrastructure. Supports arbitrary
solver images, configurable parallelism, and per-task timeouts.

Register as 'incus' or 'cloudv2' provider in the session manager.

https://claude.ai/code/session_01N62q5oNTPtXfTNZXsqiCyH
@vercel
Copy link
Contributor

vercel bot commented Feb 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
docs Ready Ready Preview, Comment Feb 12, 2026 4:57am

Request Review

@coderabbitai
Copy link

coderabbitai bot commented Feb 12, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🏷️ Required labels (at least one) (1)
  • rabbit

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch claude/connect-cuabench-cloud-vm-tEIzq

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

📦 Publishable packages changed

  • pypi/bench

Add release:<service> labels to auto-release on merge (+ optional bump:minor or bump:major, default is patch).
Or add no-release to skip.

@r33drichards r33drichards changed the title Add Incus provider for CUA Cloud V2 batch job execution Add provider for CUA Cloud V2 batch job execution Feb 12, 2026
Rename the provider file from incus.py to cua_cloud.py and the class
from IncusProvider to CuaCloudProvider. Register as 'cua_cloud' or
'cloudv2' in the session manager.

https://claude.ai/code/session_01N62q5oNTPtXfTNZXsqiCyH
@github-actions
Copy link
Contributor

📦 Publishable packages changed

  • pypi/bench

Add release:<service> labels to auto-release on merge (+ optional bump:minor or bump:major, default is patch).
Or add no-release to skip.

@github-actions
Copy link
Contributor

📦 Publishable packages changed

  • pypi/bench — will auto-release on merge

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@libs/cua-bench/cua_bench/sessions/providers/cua_cloud.py`:
- Around line 58-75: The provider opens an aiohttp.ClientSession in
_get_http_client but never ensures _close_http_client is called, leaking
connections; add async context manager support on the provider (implement
__aenter__ to call/return self and __aexit__ to await self._close_http_client())
or expose a public async close() that awaits _close_http_client and update
callers to use "async with <Provider>()" or call await provider.close();
reference the existing methods _get_http_client and _close_http_client when
adding the lifecycle methods so the session is always closed.
🧹 Nitpick comments (4)
libs/cua-bench/cua_bench/sessions/providers/cua_cloud.py (4)

40-51: SQLite connection should use context manager; silent exception swallowing may hide issues.

The connection isn't guaranteed to close if an exception occurs between connect() and close(). Additionally, catching all exceptions silently could mask important errors (e.g., permission issues, corrupted database).

♻️ Proposed fix using context manager
         if creds_path.exists():
             try:
                 import sqlite3

-                conn = sqlite3.connect(str(creds_path))
-                cursor = conn.cursor()
-                cursor.execute("SELECT value FROM credentials WHERE key = 'api_key'")
-                row = cursor.fetchone()
-                conn.close()
-                if row:
-                    return row[0]
-            except Exception:
-                pass
+                with sqlite3.connect(str(creds_path)) as conn:
+                    cursor = conn.cursor()
+                    cursor.execute("SELECT value FROM credentials WHERE key = 'api_key'")
+                    row = cursor.fetchone()
+                    if row:
+                        return row[0]
+            except (sqlite3.Error, OSError):
+                pass  # Fall through to raise ValueError below

220-236: Phase mapping is case-sensitive; consider normalizing.

If the API returns phases with different casing (e.g., "Pending" vs "pending"), the mapping will fall through to the raw value, potentially causing inconsistent status handling downstream.

♻️ Normalize phase to lowercase
         # Map batch job phase to local status
-        phase = result.get("phase", "unknown")
+        phase = result.get("phase", "unknown").lower()
         status_map = {

267-291: Method returns status summary, not logs; consider clarifying.

The method name get_session_logs suggests log retrieval, but it returns a status summary. This is likely adapting to the SessionProvider interface where actual logs aren't available from the batch API. The docstring correctly describes the behavior, but consider adding a note explaining why logs aren't available.


347-367: Client-side pagination could be inefficient for large task lists.

All results are fetched before applying the status filter and pagination locally. For batch jobs with many tasks, this fetches more data than needed. If the API supports server-side filtering/pagination, consider using it.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 12, 2026

📦 Publishable packages changed

  • pypi/bench — will auto-release on merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release:pypi/bench Release pypi/bench on merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants