Designing for Agents: Rules Make (or Break) Code Quality
Sep 12, 2025
Great agent output isn’t luck—it’s the product of programmable context. Treat design docs, goal intent, architecture, API contracts, user flows, and rules not as paperwork but as infrastructure your agents consume. The best teams don’t prompt harder; they architect the context.
Teams that front-load goal docs, lightweight architecture sketches, explicit API contracts, concrete user flows, and house rules see measurably fewer iterations and cleaner first-pass code.
Below is a plug-and-play section you can drop into your article, plus tool-specific examples for Claude Code, OpenAI Codex, and Cursor.
Core Patterns (what the community says actually works)
1) Context Architecture (the Context Pyramid)
Successful teams layer context so the agent always has the right information at the right granularity.
Strategic Context → Project vision, success metrics, non-goals
System Context → Architecture, decisions, tech stack, API/database contracts
Module Context → Component purpose, dependencies, test coverage, change history
Task Context → workplan, acceptance criteria, edge cases, constraints
Code Context → similar examples, recent changes, .cursorrules / repo rules
Minimal artifact set that maps to the pyramid
project-root/
├─ README.md # Vision, problem, success criteria, non-goals
├─ docs/
│ ├─ architecture.md # Components, Data flows, ADRs summary
│ ├─ decisions.md # ADR details (what/why/when)
│ ├─ api/
│ │ └─ openapi.yaml # API contracts; single source of truth
│ └─ db/
│ └─ schema.sql # DB schema / migrations reference
├─ src/
│ ├─ <module>/
│ │ ├─ AI_SUMMARY.md # Module context (purpose, deps, invariants)
│ │ └─ examples/ # “Good style” code patterns for the agent
│ └─ ...
├─ .cursorrules # Explicit rules & constraints (see below)
├─ CONTRIBUTING.md # Code style, tests, CI, commit/tag rules
└─ TASKS/
└─ <YYYY-MM-DD>-<task-name>.md # Workplans w/ acceptance criteria
Why this works: agents don’t need your whole codebase; they need curated context. This layout lets you hand each task a focused slice.
2) Task Decomposition (the Goldilocks rule)
Make tasks atomic & verifiable (15–30 minutes of work).
Good
“Add email validation to the signup form”
“Create
Button
component with hover + disabled states and tests”
Bad
“Build authentication system”
“Fix all bugs”
Each task file in TASKS/
should include: objective, constraints, steps, acceptance criteria, test plan, out-of-scope.
3) Progressive Enhancement (ship a skeleton, then strengthen)
Stage work in predictable layers:
Foundation: files, types, minimal UI
Core: happy path, basic tests, API wiring
Robustness: validation, loading/error states, edge cases
Optimization: perf, caching, lazy loading, bundle size
Refinement: docs, a11y, polish, animations
This preserves a working state and gives agents immediate feedback.
4) Explicit Constraints (tell the agent what not to do)
Put constraints right next to the task:
## Constraints
- Do NOT modify existing tests
- Do NOT change the DB schema
- ONLY touch files in `src/auth/*` and `tests/auth/*`
- Preserve all public API signatures
This single section eliminates most “helpful refactors.”
Templates you can copy-paste
A. Project Intent (Strategic Context)
# Project Intent: Personal Weather Dashboard
## Vision
A clean, minimalist dashboard showing weather for multiple cities I care about.
## Problem
I check 3+ cities across different apps daily; I want one unified view.
## Success Criteria
- [ ] Shows current weather for 3+ cities
- [ ] Auto-updates every 30 minutes
- [ ] Desktop + mobile responsive
- [ ] TTI < 2s
## Constraints
- Technical: free weather APIs only
- Time: one weekend
- Resources: solo dev, modern web stack
## Non-Goals
- No accounts/personalization
- No >5-day forecasts
- No native apps
B. User Flow (System Context for UX tasks)
Bullet points are perfect—agents don’t need BPMN:
Homepage → Search/City List → City Weather → Add to Dashboard → Dashboard
Checkout flow:
- Add to cart
- Apply coupon (optional)
- Enter billing + address
- Fraud check (Team A)
- Confirmation
C. API Contract (Single Source of Truth)
Keep it in docs/api/openapi.yaml
and reference it in prompts.
openapi: 3.0.3
info: { title: Weather API, version: 1.0.0 }
paths:
/weather:
get:
parameters:
- in: query
name: city
required: true
schema: { type: string }
responses:
'200':
content:
application/json:
schema:
type: object
required: [city, tempC, updatedAt]
properties:
city: { type: string }
tempC: { type: number }
updatedAt: { type: string, format: date-time }
'404':
description: City not found
D. Task Workplan (Task Context)
# TASK: Add email validation to signup
## Objective
Reject invalid emails client-side and server-side, with helpful messaging.
## Constraints
- Do NOT modify DB schema
- ONLY edit `src/auth/signup/*` and `tests/auth/*`
## Steps
1) Add client-side `validateEmail(email)` w/ RFC-lite regex + tests
2) Show inline error state + aria-describedby
3) Add server-side validation in `POST /signup`
4) Tests: unit (client), integration (API), a11y snapshot
## Acceptance Criteria
- Invalid emails blocked client & server
- Error message meets a11y (role=alert)
- Tests pass: `pnpm test:auth`
Anti-patterns (avoid these traps)
Magic Prompt Fallacy: one mega-prompt won’t build your app → iterate with checkpoints
Context Overload: don’t paste the repo → curate module + task context
Yes-Man Trap: never accept changes without tests & review
Scope Creep Enablement: constraints stop “I also refactored X…”
Metrics (how you’ll know it’s working)
Goal: Make AI work observable and actionable. Track a few metrics, wire them to automatic checks, and define what happens when they drift.
Velocity
Iterations per task: % of tasks merged in ≤3 review cycles. Target: ≥80%.
Context prep time: % of task time spent on intent/flows/contracts/module summaries. Target: 10–20%.
Quality
Tests included: % of AI PRs that add/modify tests and keep coverage ≥ baseline. Target: ~100%, non-decreasing coverage.
Major revision rate: AI PRs needing substantial rework (>30% lines changed after first review or out-of-scope files touched). Target: <20%.
Process
Bug intro rate: Bugs per 1k LOC within 14 days of merge, AI vs human. Target: AI ≤ manual baseline.
Docs:Code ratio: LOC in docs/context vs LOC in code for each PR. Target: ≈1:4 (team-tunable band).
How to Track (minimal)
One PR per task, title
TASK:<id>
.Label PRs
source:AI
vssource:human
.CI gates:
Block if code changed and no tests changed.
Coverage must not drop.
Post Doc:Code ratio on PR.
Count review cycles from
CHANGES_REQUESTED
events.Attribute bugs to PRs; compute bugs/kloc by source label.
What To Do When Metrics Move
Signal | Likely Cause | First Action |
Iterations >3 | Tasks too big/fuzzy | Split into 15–30 min atoms; add explicit constraints |
Major revisions >20% | Weak module context | Refresh |
Missing tests / coverage ↓ | Process gap | Block merge; have agent generate tests first |
AI bugs/kloc > human | Context gaps / big diffs | Require repro test first; add static analysis gates |
Docs:Code << target | Context starvation | Add acceptance criteria, flows, and constraints |
Docs:Code >> target | Doc bloat | Consolidate into OpenAPI/ADRs; remove stale docs |
Tiny PR Template (keeps you honest)
## Task
TASK:<id> — <title>
## Acceptance
- [ ] …
## Evidence
- Tests: <files/test names>
- Coverage: <auto from CI>
- Scope paths touched: <auto>
- Docs updated: [ ] AI_SUMMARY.md [ ] OpenAPI [ ] ADR
Cadence: Review dashboard weekly (iterations, tests/coverage, bugs/kloc, Doc:Code). In the monthly retro, convert top outliers into rule/template updates.
Tool-specific implementations
A) Claude Code (Anthropic)
Claude Code natively reads CLAUDE.md
files and can be shaped via allowed tools, slash commands, and headless mode. Use these to encode your design/goal docs and rules. Anthropic
Place these files:
CLAUDE.md
(repo root): paste “GOAL / ARCH / RULES” summaries, common commands, test scripts, code style, gotchas; Claude will automatically load them. Anthropic.claude/commands/*.md
: reusable, parameterized runbooks (e.g., “fix issue #…”, “apply API client migration”). Anthropic.mcp.json
: pre-wire external tools (Puppeteer, Sentry, internal APIs) so every engineer—and Claude—has the same capabilities. Anthropic
Example: CLAUDE.md
(excerpt)
# Code style
- TS strict; no any in app/
- Prefer adapter seams & feature flags for migrations
# Workflow
- Always run: npm run typecheck && npm run test:unit
- For session work: use FEATURE_SESSION_V2 flag; adapter ISessionClient
# Commands (use slash menu)
- /project:fix-github-issue <id>
- /project:migrate-session-v2
Permissions & safety: pre-allow Edit
, git commit
, and safe bash ranges via /permissions
or settings; keep destructive tools on “ask”. Anthropic
Automate checks: run headless mode in CI to lint PR descriptions or triage issues with the same rules you coded in CLAUDE.md
. Anthropic
Why this fits: Anthropic’s best-practices explicitly recommend CLAUDE.md
for style/tests/workflow and curating allowed tools; teams also use custom commands to standardize multi-step tasks. Anthropic
B) OpenAI Codex (2025)
The latest Codex ships as a CLI + IDE extension (VS Code, Cursor, etc.) and can run locally or in an OpenAI cloud sandbox with approval modes. It consumes IDE context, so short prompts + strong repo docs work best. OpenAI+1
Wire in your docs/rules:
Keep
GOAL.md
,ARCH.md
,FLOWS.md
, andopenapi.yaml
at repo root; open them in the IDE so Codex ingests them as context.Start in Read-only/Chat, then escalate to Agent/Full Access only after Codex restates the plan and fast-fail checks. Habr
Prefer sandbox runs for long tasks (test suites, migrations); promote PRs from the sandbox. Habr
Kickoff prompt (IDE chat):
Read GOAL.md, ARCH.md, and API/openapi.yaml.
Restate assumptions and propose a step-by-step plan (no edits yet).
List fast-fail checks (commands + expected signals). Wait for approval.
Approval prompt (after review):
Proceed in Agent mode, local environment.
Scope: ApiClient.ts, SessionService.ts only.
Outputs: minimal diff, updated tests, and a validation script.
Why this fits: Codex’s IDE extension uses open files and selections as implicit context; the new releases emphasize agentic workflows and multi-hour tasks—plans/approvals keep it aligned and safe. (Benchmarks and product notes also highlight agentic coding and IDE integration.) TechRadar+1
Prompt-engineering note: OpenAI’s guidance stresses concrete instructions, up-to-date models, and structured prompts—mirrored by the plan→validate pattern above. OpenAI Help Center
C) Cursor (IDE)
Cursor supports persistent Project Rules via .cursor/rules/*.mdc
or the Rules UI. Treat these as your “policy as code” file set—encoding style, patterns, and step-by-step runbooks the agent and inline edits must follow. Cursor+1
Create rule files:
./cursor/rules/session-v2.mdc
---
description: Migrate to Session API v2 via adapter+flag
globs: ["app/**", "lib/**"]
alwaysApply: true
---
# Constraints
- Use ISessionClient adapter; do not touch UI text
- Guard with FEATURE_SESSION_V2; default false
- Validate via "pnpm test e2e:login_flow"
# Steps
1) Update ApiClient.ts to call /api/session/v2 (see API/openapi.yaml)
2) Implement ISessionClient.getMe(), .login()
3) Add contract tests using example payloads
4) Run fast-fail commands; paste logs in PR
Operationalize knowledge:
Use the Rules panel to attach or toggle rules per workspace; reference domain docs via
@DocName
for retrieval-augmented edits. Cursor - Community Forum+1Keep a top-level core rule (“no
any
in app/”, “Zod at boundaries”, “Given–When–Then tests”). Community guidance shows.cursorrules
/Rules reduces repetitive corrections. Medium+1For Composer/Agent workflows, several teams run a PRD-driven loop (PRD in repo, rules force one story at a time,
.ai/
folder for history). Cursor - Community Forum
Why this fits: Cursor’s docs and community threads describe Rules as system-level, persistent instructions that the Agent and Inline Edit follow—exactly the place to encode your guidelines and flows. Cursor+1
Validation scaffolding the agent can run
Add a machine-runnable checklist so agents can prove alignment:
# validate.sh
set -euo pipefail
pnpm install --frozen-lockfile
pnpm typecheck
pnpm test:unit
pnpm test:e2e --filter=login_flow
pnpm lint
echo "OK: typecheck, unit, e2e(login_flow), lint all green"
Claude: call from a custom slash command (
/project:validate
) or in headless CI. AnthropicCodex: run in sandbox or local Agent task; attach logs to PR. Habr
Cursor: reference from a Rule or have the Agent run it post-edit. Cursor
Pitfalls this structure avoids
Spec ambiguity → rework: API schema & examples prevent hallucinated fields and “works on mock” regressions. codecentric AG+1
Style drift: Repo-level rules (
CLAUDE.md
, Cursor Rules) keep edits consistent across files and sessions. Anthropic+1Unsafe automation: Tool allowlists (Claude) and approval modes (Codex) gate risky actions until the plan is reviewed. Anthropic+1