Testing AI Artifacts: A Validation Framework

Most teams create rules and commands but never verify they work. Then they're surprised when the AI ignores guidance or produces inconsistent output. This post introduces a testing framework for validating AI artifacts—structural, content, and behavioral tests.

The Cursor System series

Beyond Rules — The four artifact types
Agent Personas — Personas that stay in character
Smart Routing — Match tasks to specialists
Autonomous Workflows — Let agents chain safely
Testing Artifacts (this post) — Catch broken rules before they break
Meta-Learning — Agents that learn from failures

The Testing Challenge

Traditional code testing is straightforward:

Input → Function → Output (deterministic)

AI artifact testing is different:

Context + Artifact → LLM → Behavior (probabilistic)

How do you test something non-deterministic?

The answer: Test what you can control. Structure and content are deterministic. Behavior can be bounded.

The Artifact Testing Pyramid

flowchart TB
    subgraph pyramid["Artifact Testing Pyramid"]
        BT["Behavioral Tests<br/>Does it guide correctly?"]
        CT["Content Tests<br/>Is content valid?"]
        ST["Structural Tests<br/>Is format correct?"]
    end

    BT --> CT --> ST

    style BT fill:#f9f,stroke:#333
    style CT fill:#bbf,stroke:#333
    style ST fill:#bfb,stroke:#333

Structural tests catch 80% of issues and are fully automated. Content tests verify quality and are mostly automated. Behavioral tests confirm actual guidance and require simulation.

Structural Tests

Structural tests verify the artifact is well-formed. These are fast, deterministic, and catch the most common errors.

For Rules

Structural Tests for Rules: ✓ Has YAML frontmatter (starts with ---)
  ✓ Frontmatter is valid YAML
  ✓ Has 'description' field (non-empty string)
  ✓ Has 'globs' or 'alwaysApply' (at least one)
  ✓ If 'globs', patterns are valid
  ✓ Markdown body exists after frontmatter
  ✓ Body has at least one heading

Example failures:

# FAILS: Missing description
---
globs: ["**/*.ts"]
---
# FAILS: Invalid glob pattern
---
description: "TypeScript rules"
globs: ["**/*.{ts"] # Unclosed brace
---
# FAILS: Empty body
---
description: "TypeScript rules"
globs: ["**/*.ts"]
---
(no content)

For Commands

Commands have different requirements:

Structural Tests for Commands: ✓ NO YAML frontmatter
  ✓ Has title starting with "# /"
  ✓ Has "## Instructions" section
  ✓ Has default behavior documented
  ✓ No duplicate heading levels

Example failures:

# FAILS: Has frontmatter (commands shouldn't)

---

## description: "Review command"

# /review - Code Review

# FAILS: Missing instruction section

# /review - Code Review

Use this to review code.

For Agents

Structural Tests for Agents: ✓ Has YAML frontmatter
  ✓ Has 'name' field
  ✓ Has 'model' field
  ✓ Has 'description' field (the prompt)
  ✓ Description has Role section
  ✓ Description has Process section
  ✓ Description has Output Format section

Content Tests

Content tests verify quality—not just form, but substance.

Actionable Instructions

Content Tests - Actionability:
  ✓ Instructions contain verbs ("analyze", "check", "generate")
  ✓ No vague phrases ("write clean code", "be helpful")
  ✓ Steps are specific and completable
  ✓ Examples are concrete (not "do something like...")

Detecting vagueness:

VAGUE_PATTERNS = [
    r"be\s+(helpful|careful|thorough)",
    r"write\s+(clean|good|better)\s+code",
    r"ensure\s+quality",
    r"do\s+something\s+like",
    r"etc\.?$",
    r"and\s+so\s+on",
]

def check_actionability(content):
    issues = []
    for pattern in VAGUE_PATTERNS:
        if re.search(pattern, content, re.IGNORECASE):
            issues.append(f"Vague pattern: {pattern}")
    return issues

Description Quality

Content Tests - Description: ✓ Length > 20 characters
  ✓ Specific to one purpose
  ✓ No placeholder text ("TODO", "[insert here]")
  ✓ Would help AI decide relevance

Example Presence

Content Tests - Examples: ✓ Procedural rules have examples
  ✓ Commands have usage examples
  ✓ Complex patterns are illustrated
  ✓ Examples are syntactically valid

Behavioral Tests (Golden Tests)

Behavioral tests verify the artifact actually guides AI behavior correctly. These use "golden" input/output pairs.

Golden Test Format

# .cursor/tests/debug-command.golden.yaml
artifact: .cursor/commands/debug.md
scenarios:
  - name: "basic_error"
    input: "/debug TypeError: undefined is not a function"
    expected_contains:
      - "Root Cause"
      - "Hypothesis"
    expected_not_contains:
      - "I don't know"
      - "I'm not sure"

  - name: "stack_trace"
    input: "/debug [stack trace with 5 frames]"
    expected_contains:
      - "Evidence"
      - "line"
    expected_format:
      sections:
        - "## Bug Analysis"
        - "### Symptoms"
        - "### Root Cause"

  - name: "no_information"
    input: "/debug it's broken"
    expected_contains:
      - "What error"
      - "more information"
    expected_behavior: "asks_clarifying_questions"

Golden Test for Rules

# .cursor/tests/naming-rule.golden.yaml
artifact: .cursor/rules/naming/RULE.md
scenarios:
  - name: "variable_naming"
    context: "Creating a variable for user count"
    input: "Create a variable to store the number of users"
    expected_contains:
      - "userCount" # camelCase expected
    expected_not_contains:
      - "user_count" # snake_case not expected
      - "UserCount" # PascalCase not expected

  - name: "class_naming"
    context: "Creating a class for authentication"
    input: "Create a class that handles user authentication"
    expected_contains:
      - "UserAuthentication" # PascalCase expected
    expected_not_contains:
      - "userAuthentication" # camelCase not expected

Golden Test for Agents

# .cursor/tests/security-auditor.golden.yaml
artifact: .cursor/agents/security-auditor.md
scenarios:
  - name: "sql_injection"
    input: |
      Review this code for security:
      db.query(`SELECT * FROM users WHERE id = '${userId}'`)
    expected_contains:
      - "SQL injection"
      - "parameterized"
      - "Critical"
    expected_format:
      has_sections:
        - "Security Audit Report"
        - "Critical Issues"

  - name: "safe_code"
    input: |
      Review this code for security:
      db.query('SELECT * FROM users WHERE id = $1', [userId])
    expected_not_contains:
      - "Critical"
      - "injection"
    expected_contains:
      - "No critical issues"

The `/test-artifact` Command

Implement testing as a command:

# /test-artifact - Test Cursor Artifacts

Run structural, content, and behavioral tests on Cursor artifacts.

## Instructions

When the user invokes `/test-artifact`:

1. Identify the target artifact(s)
2. Determine artifact type (rule, command, agent)
3. Run structural tests for that type
4. Run content tests
5. Run behavioral tests if golden file exists
6. Report results in standard format

### Default Behavior

Test the current file if it's an artifact, or ask for target.

## Variants

### `/test-artifact @path`

Test specific artifact at path.

### `/test-artifact all`

Test all artifacts in .cursor/

### `/test-artifact all --type rules`

Test all artifacts of specific type.

### `/test-artifact --interactive`

Run behavioral tests interactively.

### `/test-artifact --regression`

Compare to last known good state.

## Output Format

## Artifact Test Report

**Artifact**: [path]
**Type**: [Rule/Command/Agent]
**Status**: ✅ Pass | ⚠️ Warnings | ❌ Fail

### Structural Tests

| Check   | Status   | Notes     |
| ------- | -------- | --------- |
| [check] | ✅/⚠️/❌ | [details] |

### Content Tests

| Check   | Status   | Notes     |
| ------- | -------- | --------- |
| [check] | ✅/⚠️/❌ | [details] |

### Behavioral Tests

| Scenario | Status   | Notes     |
| -------- | -------- | --------- |
| [name]   | ✅/⚠️/❌ | [details] |

### Issues Found

1. [severity] [description]
   - Suggestion: [fix]

### Summary

- Structural: X/Y ✓
- Content: X/Y ✓
- Behavioral: X/Y ✓

Test Execution Example

User: /test-artifact .cursor/rules/naming/RULE.md

Artifact Test Report

Artifact: .cursor/rules/naming/RULE.md
Type: Rule
Status: ⚠️ Warnings

Structural Tests:

Check	Status	Notes
YAML frontmatter	✅ Pass
Description	✅ Pass	43 chars
Globs valid	✅ Pass	`["*/.ts", "*/.js"]`
Body exists	✅ Pass

Content Tests:

Check	Status	Notes
Actionable	✅ Pass	Contains verbs
No vague phrases	✅ Pass
Has examples	⚠️ Warn	Only 1 example
No placeholders	✅ Pass

Behavioral Tests:

Scenario	Status	Notes
variable_naming	✅ Pass	Produced "userCount"
class_naming	✅ Pass	Produced "UserAuth..."

Issues Found:

⚠️ Only 1 example provided — Suggestion: Add 2-3 examples for edge cases

Summary: Structural 4/4 ✓ | Content 3/4 (1 warning) | Behavioral 2/2 ✓

Interactive Testing Mode

For behavioral tests that need human judgment:

User: /test-artifact --interactive

Starting interactive test session...

Testing: .cursor/commands/debug.md

Scenario 1: basic_error
Simulating: User says "debug this: TypeError undefined"

[Agent produces debug output...]

Did the command behave correctly? (y/n/details)

User: n

What was wrong?
> It didn't analyze the stack trace

Logged. Recording failure for scenario 'basic_error'.

Suggested fix: Add explicit instruction to parse stack traces in the debug command's Process section.

Continue to next scenario? (y/n)

Regression Testing

After modifying artifacts, verify you didn't break existing behavior:

User: /test-artifact --regression

Comparing to last known good state...

Changed artifacts:

.cursor/commands/debug.md

Added: --trace variant

Modified: default behavior

Running behavioral tests for changed artifacts...

Results:

✅ basic_error: Still passes

✅ stack_trace: Still passes

✅ no_information: Still passes

✅ NEW: trace_variant: Passes

All existing behaviors preserved. New variant works correctly.

Integration Tests

Verify artifacts work together:

Integration Tests: ✓ Commands referenced in rules exist
  ✓ Agents referenced in commands exist
  ✓ No circular dependencies between rules
  ✓ Glob patterns don't conflict
  ✓ Subsumption matrix is consistent

Validation Checklist

Use this before committing any artifact:

Rules

[ ] YAML frontmatter is valid
[ ] Description is specific (not generic)
[ ] Globs are correct and not overly broad
[ ] Instructions are actionable (contain verbs)
[ ] At least one concrete example
[ ] No vague phrases ("be helpful", "ensure quality")
[ ] No conflicting guidance with other rules

Commands

[ ] NO frontmatter
[ ] Title format: # /command-name - Description
[ ] Has ## Instructions section
[ ] Default behavior documented
[ ] Variants documented
[ ] Output format specified
[ ] At least one usage example

Agents

[ ] YAML frontmatter with name, model, description
[ ] Role is specific (not "helpful assistant")
[ ] Expertise areas listed
[ ] Process has numbered steps
[ ] Output format is explicit
[ ] Constraints are defined

Key Takeaways

Test what you can control. Structure and content are deterministic. Test them first.
Structural tests catch 80% of issues. Invalid YAML, missing fields, wrong format—all detectable automatically.
Golden tests bound behavior. You can't test for exact output, but you can test for presence of expected elements.
Interactive testing fills gaps. Some things need human judgment. Build that into the process.
Regression testing prevents breakage. After changes, verify existing behaviors still work.
Make testing a command. /test-artifact should be as easy to run as any other command.

What's Next

You can now test artifacts. But how do you know which artifacts to create? How do you identify patterns in your usage that suggest new rules or commands?

The final post in this series covers the meta-learning system—building feedback loops that learn from real usage and propose improvements.

Next: The Meta-Learning System: AI That Improves Itself — Observation, pattern detection, and automated improvement proposals.