Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/core/tasks/review-adversarial-general.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

<inputs>
<input name="content" desc="Content to review - diff, spec, story, doc, or any artifact" />
<input name="also_consider" required="false"
desc="Optional areas to keep in mind during review alongside normal adversarial analysis" />
</inputs>

<llm critical="true">
Expand Down
56 changes: 56 additions & 0 deletions test/adversarial-review-tests/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Adversarial Review Test Suite

Tests for the `also_consider` optional input in `review-adversarial-general.xml`.

## Purpose

Evaluate whether the `also_consider` input gently nudges the reviewer toward specific areas without overriding normal adversarial analysis.

## Test Content

All tests use `sample-content.md` - a deliberately imperfect User Authentication API doc with:

- Vague error handling section
- Missing rate limit details
- No token expiration info
- Password in plain text example
- Missing authentication headers
- No error response examples

## Running Tests

For each test case in `test-cases.yaml`, invoke the adversarial review task.

### Manual Test Invocation

```
Review this content using the adversarial review task:

<content>
[paste sample-content.md]
</content>

<also_consider>
[paste items from test case, or omit for TC01]
</also_consider>
```

## Evaluation Criteria

For each test, note:

1. **Total findings** - Still hitting ~10 issues?
2. **Distribution** - Are findings spread across concerns or clustered?
3. **Relevance** - Do findings relate to `also_consider` items when provided?
4. **Balance** - Are `also_consider` findings elevated over others, or naturally mixed?
5. **Quality** - Are findings actionable regardless of source?

## Expected Outcomes

- **TC01 (baseline)**: Generic spread of findings
- **TC02-TC05 (domain-focused)**: Some findings align with domain, others still organic
- **TC06 (single item)**: Light influence, not dominant
- **TC07 (vague items)**: Minimal change from baseline
- **TC08 (specific items)**: Direct answers if gaps exist
- **TC09 (mixed)**: Balanced across domains
- **TC10 (contradictory)**: Graceful handling
46 changes: 46 additions & 0 deletions test/adversarial-review-tests/sample-content.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# User Authentication API

## Overview

This API provides endpoints for user authentication and session management.

## Endpoints

### POST /api/auth/login

Authenticates a user and returns a token.

**Request Body:**
```json
{
"email": "user@example.com",
"password": "password123"
}
```

**Response:**
```json
{
"token": "eyJhbGciOiJIUzI1NiIs...",
"user": {
"id": 1,
"email": "user@example.com"
}
Comment on lines +21 to +28
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace the JWT-like token to avoid secret-scanner hits.
Gitleaks flagged the token string as a generic API key. Even as sample content, this can fail CI or encourage unsafe copy‑paste. Replace it with an obvious placeholder or redacted pattern.

🧰 Tools
🪛 Gitleaks (8.30.0)

[high] 24-24: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.

(generic-api-key)

🤖 Prompt for AI Agents
In `@test/adversarial-review-tests/sample-content.md` around lines 21 - 28,
Replace the JWT-like string under the JSON "token" key with an explicit
non-secret placeholder to avoid secret-scanner hits: locate the block containing
"token": "eyJhbGciOiJIUzI1NiIs..." in the sample response and change the value
to a clearly redacted pattern such as "REDACTED_TOKEN" or "<REDACTED_JWT>" so
the file contains no real-looking secrets.

}
```

### POST /api/auth/logout

Logs out the current user.

### GET /api/auth/me

Returns the current user's profile.

## Error Handling

Errors return appropriate HTTP status codes.

## Rate Limiting

Rate limiting is applied to prevent abuse.
103 changes: 103 additions & 0 deletions test/adversarial-review-tests/test-cases.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Test Cases for review-adversarial-general.xml with also_consider input
#
# Purpose: Evaluate how the optional also_consider input influences review findings
# Content: All tests use sample-content.md (User Authentication API docs)
#
# To run: Manually invoke the task with each configuration and compare outputs

test_cases:
# BASELINE - No also_consider
- id: TC01
name: "Baseline - no also_consider"
description: "Control test with no also_consider input"
also_consider: null
expected_behavior: "Generic adversarial findings across all aspects"

# DOCUMENTATION-FOCUSED
- id: TC02
name: "Documentation - reader confusion"
description: "Nudge toward documentation UX issues"
also_consider:
- What would confuse a first-time reader?
- What questions are left unanswered?
- What could be interpreted multiple ways?
- What jargon is unexplained?
expected_behavior: "More findings about clarity, completeness, reader experience"

- id: TC03
name: "Documentation - examples and usage"
description: "Nudge toward practical usage gaps"
also_consider:
- Missing code examples
- Unclear usage patterns
- Edge cases not documented
expected_behavior: "More findings about practical application gaps"

# SECURITY-FOCUSED
- id: TC04
name: "Security review"
description: "Nudge toward security concerns"
also_consider:
- Authentication vulnerabilities
- Token handling issues
- Input validation gaps
- Information disclosure risks
expected_behavior: "More security-related findings"

# API DESIGN-FOCUSED
- id: TC05
name: "API design"
description: "Nudge toward API design best practices"
also_consider:
- REST conventions not followed
- Inconsistent response formats
- Missing pagination or filtering
- Versioning concerns
expected_behavior: "More API design pattern findings"

# SINGLE ITEM
- id: TC06
name: "Single item - error handling"
description: "Test with just one also_consider item"
also_consider:
- Error handling completeness
expected_behavior: "Some emphasis on error handling while still covering other areas"

# BROAD/VAGUE
- id: TC07
name: "Broad items"
description: "Test with vague also_consider items"
also_consider:
- Quality issues
- Things that seem off
expected_behavior: "Minimal change from baseline - items too vague to steer"

# VERY SPECIFIC
- id: TC08
name: "Very specific items"
description: "Test with highly specific also_consider items"
also_consider:
- Is the JWT token expiration documented?
- Are refresh token mechanics explained?
- What happens on concurrent sessions?
expected_behavior: "Specific findings addressing these exact questions if gaps exist"

# MIXED DOMAINS
- id: TC09
name: "Mixed domain concerns"
description: "Test with items from different domains"
also_consider:
- Security vulnerabilities
- Reader confusion points
- API design inconsistencies
- Performance implications
expected_behavior: "Balanced findings across multiple domains"

# CONTRADICTORY/UNUSUAL
- id: TC10
name: "Contradictory items"
description: "Test resilience with odd inputs"
also_consider:
- Things that are too detailed
- Things that are not detailed enough
expected_behavior: "Reviewer handles gracefully, finds issues in both directions"