AI Deep Dives

Harness Design: Keeping AI Coding Agents Productive Over 100+ Sessions

After 3,000+ sessions with Claude Code: What systems prevent AI agents from drifting in long-running projects. CLAUDE.md, Rules, Session Handovers and Quality Gates.

TL;DR

AI coding agents like Claude Code are impressively good -- but they forget everything between sessions. In a project with 139 sessions, 24 rule files, and 14,000+ commits, I show how to build a "harness" that keeps your agent consistent, safe, and productive. It's not the prompts that make the difference. It's the system around them.

What Is a Harness?

When you work with an AI coding agent, you have a problem: the agent has no memory between sessions. Session 1 and session 100 start with the same knowledge. The agent knows neither your architecture, nor your conventions, nor the mistakes it made yesterday.

A harness is the solution. It's all the files and systems that give the agent the necessary context at every start. Think of it as an onboarding package for a new developer -- except that developer starts fresh every day.

In my case, the harness consists of four layers:

CLAUDE.md: The main document. Commands, stack, architecture decisions.
Rules: 24 rule files defining specific standards.
Memory: Persistent information that applies across sessions.
Session Handovers: 102 handover documents passing state between sessions.

Layer 1: CLAUDE.md as the Main Document

CLAUDE.md is the first thing the agent reads. It must be short, precise, and current. In my project, it contains:

Commands: bun dev, npm test, tsgo --noEmit. The agent needs to know how to verify the code.
Stack: Next.js 16.2.1, React 19, TypeScript 5.9, Tailwind 4. No guessing.
Architecture: Three Supabase client patterns (Browser, Server, Admin). Who uses what when.
Quality Gate: tsgo --noEmit && bun run build && npm test. Before every deploy.

A common mistake: CLAUDE.md gets too long. Stuffing everything in, 500 lines. The agent skims it and forgets half. My solution: stay under 100 lines. Everything detailed lives in the Rules.

Layer 2: Rules as Domain-Specific Standards

24 files in .claude/rules/. Each covers exactly one topic:

error-handling-catch.md: No .catch(() => {}). Every catch must log.
api-response-consistency.md: All API responses via apiResponse.*(). No raw NextResponse.json().
test-colocation.md: Tests belong in __tests__/ directories next to the source.
voice-consistency.md: First-person "I" perspective, no "we" on the website.
content-writing-standards.md: No em-dashes. No AI buzzwords.

The key: each rule contains not just the rule, but also anti-patterns and correct examples. An agent that only reads "Use apiResponse" still makes mistakes. An agent that sees a concrete WRONG and CORRECT example makes the right decision.

Layer 3: Session Handovers as Memory Bridges

After every session, I write a handover document. 102 of them over 139 sessions. Each handover contains:

What was done (commits, changes)
What's open (open issues, known bugs)
Next steps (prioritized)

The next session's agent reads the handover and immediately knows where to start. Without handovers, session 140 would start from zero and potentially redo work that was completed in session 138.

Layer 4: Automated Quality Gates

Humans forget things. So do agents. That's why I automate everything that can be checked:

TypeScript: tsgo --noEmit. 0 errors as baseline. Non-negotiable.
Build: bun run build. Missing translations, broken imports, everything surfaces here.
Tests: 7,474 tests in 250 files. Each run takes 27 seconds.
Content checks: 23 automated checks. Em-dashes, AI buzzwords, i18n parity, CTA repetitions.
ESLint: Custom rules against console.log (PII leak prevention) and error.message in responses (SEC-015).

The quality gates run before every deploy. The agent cannot deploy without all checks being green.

Common Failure Modes Without a Harness

What happens without a harness:

Context Drift: Session 50 starts using patterns that session 30 already replaced. Without rules, there's no reference for what's "current."

Hallucination Accumulation: The agent invents a function that doesn't exist. Without quality gates, this only surfaces in production.

Architecture Erosion: Each session makes small deviations. After 20 sessions, you have five different error-handling patterns in the same project.

Repeated Mistakes: The agent makes the same mistake as 10 sessions ago. Without feedback memory, there's no learning effect.

What Worked Over 139 Sessions

A few patterns that proved themselves. These lessons connect directly to the broader journey from prototype to shipped product -- something I wrote about in From 100+ Prototypes to Product.

SSOT files for everything that can change. Prices, biographical facts, service definitions. One file, one truth. The agent doesn't invent, it looks up.
Composable middleware instead of copy-paste. withErrorHandler(), withAdminAuth(), withRateLimit(). Three wrappers that combine freely. No boilerplate across 62 admin routes.
Rules with "why." Not just "Use apiResponse." But: "Because raw NextResponse.json() produces inconsistent error shapes and breaks the tests." Agents that know the "why" apply rules correctly even in edge cases.
Feedback memory as a learning loop. When the agent makes a mistake and I correct it, I save that as memory. Next session: same mistake doesn't occur.

The Harness Grows With the Project

In session 1, I had a CLAUDE.md with 30 lines. In session 139, I have 24 rules, 102 handovers, a memory system, and automated quality gates with 23 content checks.

The harness isn't a project you set up once. It grows with every problem you solve. Every mistake that occurs twice becomes a rule. Every architecture decision that needs context becomes an SSOT document.

Conclusion

AI coding agents are tools. Like any tool, the result depends on how you use it. The prompts are the obvious part. The harness is the invisible one. And often the decisive one. If you are building custom AI systems and want the engineering to be production-grade from the start, see the Web Development service for how I approach this kind of work.

← Back to blog

Author

Bernhard Götzendorfer · AI builder based in Vienna

Since late 2024, a daily focus on LLMs, multi-agent architectures, and rapid prototyping. 250+ AI prototypes, 10,000+ AI Engineering Sessions, 20,000+ commits across 66+ repositories. In April 2026, first place at the BitGN PAC Hackathon Vienna, solo, 79 out of 104 points, 3 hours on-site. EU AI Act Art. 4 AI-literacy training with certificate (ki-fit.at). I work in German and English and prefer engagements with clean handover, documentation, and training.

More about me

2 July 20266 min
Building an Agent Fleet: Sven, BuilderBob, and the Loop That Orchestrates Them
Three agents running around the clock: Sven scouts AI ideas, BuilderBob runs agenticbuilders.at with 19 cron jobs, session-orchestrator drives the coding loop. Architecture, hard escalation gates, and what agentic engineering actually looks like in practice.
Read article →
30 June 202610 min
How an OSS Tool Matures in 9 Days: session-orchestrator in the Engine Room
An open-source multi-agent tool matures in 9 days across 18 sessions: from the /plan to /evolve lifecycle across four runtimes. What daily self-use made of it.
Read article →
12 April 202616 min
EU Data Sovereignty in AI Tech Stacks: An Honest Decision Guide
CLOUD Act, GDPR, self-hosting: which AI stack fits your project? A practical framework with real costs and three reference scenarios.
Read article →