Testing AI Artifacts: A Validation Framework
Most teams create rules and commands but never verify they work. Then they're surprised when the AI ignores guidance or produces inconsistent output. This post introduces a testing framework for validating AI artifacts—structural, content, and behavioral tests.
The Cursor System series
- Beyond Rules — The four artifact types
- Agent Personas — Personas that stay in character
- Smart Routing — Match tasks to specialists
- Autonomous Workflows — Let agents chain safely
- Testing Artifacts (this post) — Catch broken rules before they break
- Meta-Learning — Agents that learn from failures
The Testing Challenge
Traditional code testing is straightforward:
Input → Function → Output (deterministic)
AI artifact testing is different:
Context + Artifact → LLM → Behavior (probabilistic)
How do you test something non-deterministic?
The answer: Test what you can control. Structure and content are deterministic. Behavior can be bounded.
The Artifact Testing Pyramid
flowchart TB
subgraph pyramid["Artifact Testing Pyramid"]
BT["Behavioral Tests<br/>Does it guide correctly?"]
CT["Content Tests<br/>Is content valid?"]
ST["Structural Tests<br/>Is format correct?"]
end
BT --> CT --> ST
style BT fill:#f9f,stroke:#333
style CT fill:#bbf,stroke:#333
style ST fill:#bfb,stroke:#333
Structural tests catch 80% of issues and are fully automated. Content tests verify quality and are mostly automated. Behavioral tests confirm actual guidance and require simulation.
Structural Tests
Structural tests verify the artifact is well-formed. These are fast, deterministic, and catch the most common errors.
For Rules
Structural Tests for Rules: ✓ Has YAML frontmatter (starts with ---)
✓ Frontmatter is valid YAML
✓ Has 'description' field (non-empty string)
✓ Has 'globs' or 'alwaysApply' (at least one)
✓ If 'globs', patterns are valid
✓ Markdown body exists after frontmatter
✓ Body has at least one heading
Example failures:
# FAILS: Missing description
---
globs: ["**/*.ts"]
---
# FAILS: Invalid glob pattern
---
description: "TypeScript rules"
globs: ["**/*.{ts"] # Unclosed brace
---
# FAILS: Empty body
---
description: "TypeScript rules"
globs: ["**/*.ts"]
---
(no content)
For Commands
Commands have different requirements:
Structural Tests for Commands: ✓ NO YAML frontmatter
✓ Has title starting with "# /"
✓ Has "## Instructions" section
✓ Has default behavior documented
✓ No duplicate heading levels
Example failures:
# FAILS: Has frontmatter (commands shouldn't)
---
## description: "Review command"
# /review - Code Review
# FAILS: Missing instruction section
# /review - Code Review
Use this to review code.
For Agents
Structural Tests for Agents: ✓ Has YAML frontmatter
✓ Has 'name' field
✓ Has 'model' field
✓ Has 'description' field (the prompt)
✓ Description has Role section
✓ Description has Process section
✓ Description has Output Format section
Content Tests
Content tests verify quality—not just form, but substance.
Actionable Instructions
Content Tests - Actionability:
✓ Instructions contain verbs ("analyze", "check", "generate")
✓ No vague phrases ("write clean code", "be helpful")
✓ Steps are specific and completable
✓ Examples are concrete (not "do something like...")
Detecting vagueness:
VAGUE_PATTERNS = [
r"be\s+(helpful|careful|thorough)",
r"write\s+(clean|good|better)\s+code",
r"ensure\s+quality",
r"do\s+something\s+like",
r"etc\.?$",
r"and\s+so\s+on",
]
def check_actionability(content):
issues = []
for pattern in VAGUE_PATTERNS:
if re.search(pattern, content, re.IGNORECASE):
issues.append(f"Vague pattern: {pattern}")
return issues
Description Quality
Content Tests - Description: ✓ Length > 20 characters
✓ Specific to one purpose
✓ No placeholder text ("TODO", "[insert here]")
✓ Would help AI decide relevance
Example Presence
Content Tests - Examples: ✓ Procedural rules have examples
✓ Commands have usage examples
✓ Complex patterns are illustrated
✓ Examples are syntactically valid
Behavioral Tests (Golden Tests)
Behavioral tests verify the artifact actually guides AI behavior correctly. These use "golden" input/output pairs.
Golden Test Format
# .cursor/tests/debug-command.golden.yaml
artifact: .cursor/commands/debug.md
scenarios:
- name: "basic_error"
input: "/debug TypeError: undefined is not a function"
expected_contains:
- "Root Cause"
- "Hypothesis"
expected_not_contains:
- "I don't know"
- "I'm not sure"
- name: "stack_trace"
input: "/debug [stack trace with 5 frames]"
expected_contains:
- "Evidence"
- "line"
expected_format:
sections:
- "## Bug Analysis"
- "### Symptoms"
- "### Root Cause"
- name: "no_information"
input: "/debug it's broken"
expected_contains:
- "What error"
- "more information"
expected_behavior: "asks_clarifying_questions"
Golden Test for Rules
# .cursor/tests/naming-rule.golden.yaml
artifact: .cursor/rules/naming/RULE.md
scenarios:
- name: "variable_naming"
context: "Creating a variable for user count"
input: "Create a variable to store the number of users"
expected_contains:
- "userCount" # camelCase expected
expected_not_contains:
- "user_count" # snake_case not expected
- "UserCount" # PascalCase not expected
- name: "class_naming"
context: "Creating a class for authentication"
input: "Create a class that handles user authentication"
expected_contains:
- "UserAuthentication" # PascalCase expected
expected_not_contains:
- "userAuthentication" # camelCase not expected
Golden Test for Agents
# .cursor/tests/security-auditor.golden.yaml
artifact: .cursor/agents/security-auditor.md
scenarios:
- name: "sql_injection"
input: |
Review this code for security:
db.query(`SELECT * FROM users WHERE id = '${userId}'`)
expected_contains:
- "SQL injection"
- "parameterized"
- "Critical"
expected_format:
has_sections:
- "Security Audit Report"
- "Critical Issues"
- name: "safe_code"
input: |
Review this code for security:
db.query('SELECT * FROM users WHERE id = $1', [userId])
expected_not_contains:
- "Critical"
- "injection"
expected_contains:
- "No critical issues"
The /test-artifact Command
Implement testing as a command:
# /test-artifact - Test Cursor Artifacts
Run structural, content, and behavioral tests on Cursor artifacts.
## Instructions
When the user invokes `/test-artifact`:
1. Identify the target artifact(s)
2. Determine artifact type (rule, command, agent)
3. Run structural tests for that type
4. Run content tests
5. Run behavioral tests if golden file exists
6. Report results in standard format
### Default Behavior
Test the current file if it's an artifact, or ask for target.
## Variants
### `/test-artifact @path`
Test specific artifact at path.
### `/test-artifact all`
Test all artifacts in .cursor/
### `/test-artifact all --type rules`
Test all artifacts of specific type.
### `/test-artifact --interactive`
Run behavioral tests interactively.
### `/test-artifact --regression`
Compare to last known good state.
## Output Format
## Artifact Test Report
**Artifact**: [path]
**Type**: [Rule/Command/Agent]
**Status**: ✅ Pass | ⚠️ Warnings | ❌ Fail
### Structural Tests
| Check | Status | Notes |
| ------- | -------- | --------- |
| [check] | ✅/⚠️/❌ | [details] |
### Content Tests
| Check | Status | Notes |
| ------- | -------- | --------- |
| [check] | ✅/⚠️/❌ | [details] |
### Behavioral Tests
| Scenario | Status | Notes |
| -------- | -------- | --------- |
| [name] | ✅/⚠️/❌ | [details] |
### Issues Found
1. [severity] [description]
- Suggestion: [fix]
### Summary
- Structural: X/Y ✓
- Content: X/Y ✓
- Behavioral: X/Y ✓
Test Execution Example
User: /test-artifact .cursor/rules/naming/RULE.md
Artifact Test Report
Artifact:
.cursor/rules/naming/RULE.md
Type: Rule
Status: ⚠️ Warnings
Structural Tests:
| Check | Status | Notes |
|---|---|---|
| YAML frontmatter | ✅ Pass | |
| Description | ✅ Pass | 43 chars |
| Globs valid | ✅ Pass | ["**/*.ts", "**/*.js"] |
| Body exists | ✅ Pass |
Content Tests:
| Check | Status | Notes |
|---|---|---|
| Actionable | ✅ Pass | Contains verbs |
| No vague phrases | ✅ Pass | |
| Has examples | ⚠️ Warn | Only 1 example |
| No placeholders | ✅ Pass |
Behavioral Tests:
| Scenario | Status | Notes |
|---|---|---|
| variable_naming | ✅ Pass | Produced "userCount" |
| class_naming | ✅ Pass | Produced "UserAuth..." |
Issues Found:
- ⚠️ Only 1 example provided — Suggestion: Add 2-3 examples for edge cases
Summary: Structural 4/4 ✓ | Content 3/4 (1 warning) | Behavioral 2/2 ✓
Interactive Testing Mode
For behavioral tests that need human judgment:
User: /test-artifact --interactive
Starting interactive test session...
Testing:
.cursor/commands/debug.mdScenario 1: basic_error
Simulating: User says "debug this: TypeError undefined"[Agent produces debug output...]
Did the command behave correctly? (y/n/details)
User: n
What was wrong?
> It didn't analyze the stack traceLogged. Recording failure for scenario 'basic_error'.
Suggested fix: Add explicit instruction to parse stack traces in the debug command's Process section.
Continue to next scenario? (y/n)
Regression Testing
After modifying artifacts, verify you didn't break existing behavior:
User: /test-artifact --regression
Comparing to last known good state...
Changed artifacts:
.cursor/commands/debug.md
- Added: --trace variant
- Modified: default behavior
Running behavioral tests for changed artifacts...
Results:
- ✅ basic_error: Still passes
- ✅ stack_trace: Still passes
- ✅ no_information: Still passes
- ✅ NEW: trace_variant: Passes
All existing behaviors preserved. New variant works correctly.
Integration Tests
Verify artifacts work together:
Integration Tests: ✓ Commands referenced in rules exist
✓ Agents referenced in commands exist
✓ No circular dependencies between rules
✓ Glob patterns don't conflict
✓ Subsumption matrix is consistent
Validation Checklist
Use this before committing any artifact:
Rules
- [ ] YAML frontmatter is valid
- [ ] Description is specific (not generic)
- [ ] Globs are correct and not overly broad
- [ ] Instructions are actionable (contain verbs)
- [ ] At least one concrete example
- [ ] No vague phrases ("be helpful", "ensure quality")
- [ ] No conflicting guidance with other rules
Commands
- [ ] NO frontmatter
- [ ] Title format:
# /command-name - Description - [ ] Has
## Instructionssection - [ ] Default behavior documented
- [ ] Variants documented
- [ ] Output format specified
- [ ] At least one usage example
Agents
- [ ] YAML frontmatter with name, model, description
- [ ] Role is specific (not "helpful assistant")
- [ ] Expertise areas listed
- [ ] Process has numbered steps
- [ ] Output format is explicit
- [ ] Constraints are defined
Key Takeaways
-
Test what you can control. Structure and content are deterministic. Test them first.
-
Structural tests catch 80% of issues. Invalid YAML, missing fields, wrong format—all detectable automatically.
-
Golden tests bound behavior. You can't test for exact output, but you can test for presence of expected elements.
-
Interactive testing fills gaps. Some things need human judgment. Build that into the process.
-
Regression testing prevents breakage. After changes, verify existing behaviors still work.
-
Make testing a command.
/test-artifactshould be as easy to run as any other command.
What's Next
You can now test artifacts. But how do you know which artifacts to create? How do you identify patterns in your usage that suggest new rules or commands?
The final post in this series covers the meta-learning system—building feedback loops that learn from real usage and propose improvements.
Next: The Meta-Learning System: AI That Improves Itself — Observation, pattern detection, and automated improvement proposals.