TDD Guard for Claude Code

Claude Code fit well into my workflow, but sticking to Test-Driven Development (TDD) principles still needed occasional reminders: one test at a time, fail first, no over-implementation. It worked, but it wasn’t effortless. So I built TDD Guard, a utility that handles that for me and keeps my attention focused on designing solutions rather than policing TDD.

Leveraging Hooks

The breakthrough occurred when Anthropic announced Hooks for Claude Code—coincidentally, the very day I was seeking a better solution. Hooks automatically execute commands based on predefined conditions, such as before or after an agent’s action, making them ideal for enforcing coding standards.

I began by creating a hook that intercepts all file modification operations before execution, triggering validation checks for common TDD violations:

Implementing functionality without a relevant failing test
Implementing more than necessary to pass a test
Adding more than one test at a time

The validator integrates hook event data, the agent’s current todo list, and the latest test run output. It then invokes a separate Claude Code session to verify adherence to TDD principles.

Since Claude Code hooks run as separate processes, TDD Guard persists context data to files between each phase. This approach allows different hooks, like TDD validation before a change and lint checks after, to access shared state without relying on complex inter-process communication. It keeps the system reliable and easy to reason about.

If no violations are detected, the hook proceeds without interference. But when a violation is found, it blocks the action and returns feedback clearly stating the issues, along with corrective guidance.

Dogfooding the Development

With the prototype integrated directly into my workflow, the improvements were immediate. Eliminating manual reminders was a refreshing change. But the agent's behavior initially felt too compliant. I began to suspect this wasn't due to the validation logic itself, but rather because the agent had absorbed all the TDD-related terminology present throughout the project—things like test names, helper functions, and prompt documents.

It seemed to follow TDD principles simply because the surrounding context steered it that way, not because enforcement was actually working.

To properly assess the enforcement, I needed to see how the system behaved outside a TDD-saturated environment. Testing it in an unrelated project confirmed my suspicion. Some clear violations slipped through, while valid actions were mistakenly blocked, sometimes prompting humorous attempts by the agent to circumvent restrictions using terminal commands.

Integration Testing

That prompted me to invest in robust integration tests. I implemented comprehensive test data factories covering various scenarios:

Empty, in-progress, completed, or irrelevant todo lists
Diverse test outputs, including empty, irrelevant passes, and various failure types
Modifications including minimal implementations, excessive functionality, multiple tests, and diverse refactoring cases

This strategy enabled easy adjustments to data structures and facilitated quick support for new languages like Python without extensive test rewrites. These tests also accommodated Claude Code’s distinct modification tools: Write, Edit, and MultiEdit.

Context Engineering

Each of Claude Code's file operation tools behaves differently. Write generates complete file contents, while Edit and MultiEdit apply targeted modifications to existing code. These differences made static prompts feel inefficient and misaligned across many cases.

One-size-fits-all prompts often led to confusing or conflicting behavior, especially when only some context was relevant to the operation. For example, Edit or MultiEdit might attempt to introduce three tests, two of which already existed and were simply refactored. In such cases, the prompt needed to include guidance for identifying genuinely new tests. But the same guidance would be irrelevant for a Write operation, which always starts from a clean state.

To address this, I built a dynamic system that assembles only the relevant instruction modules based on the operation type and situation. This ensures the model receives clear, concise context tailored to the task at hand.

Multi-Model Support

As test complexity increased, slow feedback loops began to limit how quickly I could iterate. To speed things up, I added a new validation client that runs prompts directly against Anthropic's API, bypassing the Claude Code CLI. This made integration tests faster and more practical in CI workflows. It also gave me the flexibility to experiment with different models to balance cost, speed, and performance.

This setup also exposed a few practical challenges. Some models included verbose explanations or inline code before finalizing an answer, which tripped up the parser and caused false positives. TDD was a big help here too. I could reliably reproduce the bug before making any fixes, which made it easy to confirm that the parser now handles that behavior correctly.

Rules vs. Mindset

Despite improvements, scenario tests still exhibited inconsistencies due to subjective model interpretations of TDD principles, notably during refactoring. While explicit rule enforcement resolved test inconsistencies, it only achieved mechanical adherence that did not translate into improved software quality.

I discovered that without explicit guidance, agents naturally skip the refactoring phase of the red-green-refactor cycle or at best make only superficial changes. To understand this better, I had Claude Code implement a shopping cart using only TDD Guard's blocking mechanism, without any instructions about software quality or testing practices. The results were telling: while the guard successfully enforced test-first development, the resultant code still suffered from tight coupling, duplication, and poor overall design. This drove home that TDD's value comes from the mindset and discipline it instills, not from mechanical rule-following.

Inspired by my mentor, Per at factor10, who recently shared reflections on this topic, I began exploring how to encourage a genuine TDD mindset. Could aspects of this mindset be emulated using hooks?

Meaningful Refactoring

Initially, I considered employing a reviewing model to suggest meaningful refactorings. However, lacking sufficient system-wide context, this approach risked inadvertently increasing complexity rather than simplifying the solution, a common problem with local rather than global optimization. Providing comprehensive context to the model proved both impractical and prohibitively expensive.

Instead, I integrated lightweight linting tools such as Sonar to identify complexities and code standard violations, prompting refactoring tasks.

Unfortunately, post-action hooks were weakly enforced, frequently resulting in deferred or ignored refactoring tasks. To strengthen enforcement, I stored identified issues, mandating resolution in subsequent pre-action validation phases. Deliberately withholding specific issue details initially encouraged the agent to attempt meaningful refactorings independently. Persistent issues were later explicitly communicated as feedback.

Balancing Art and Science

Defining "meaningful refactoring" remains inherently subjective and context-dependent, with varying interpretations of quality standards, design principles, and complexity among developers and teams. Finding practical tools to flag unnecessary complexity proved trickier than expected. Most are either correlated with line count or require extensive setup that makes them impractical.

Moreover, automated linting rules primarily encouraged superficial changes, such as smaller function sizes, rather than coherent functionality. Recognizing this limitation, I concluded TDD Guard should enforce basic linting rules during refactoring cycles while leaving deeper, meaningful design considerations to the developers themselves. While the hook improves results, the system-thinking and design awareness that human developers bring cannot be matched. After all, engaging actively in system design and architecture remains one of programming's most enjoyable aspects.

Wrapping up

The TDD Guard project began two weeks ago during a relaxed, collaborative yearly event, a BBQ at Jimmy’s place with colleagues from factor10. The initial prototype, a simple “cat detector” blocking modifications containing “mew”, served as a quick validation of core concepts.

I intentionally retained these early explorations in the project’s history as a testament to something I deeply believe in: that joy, authenticity, meaningful companionship, and psychological safety nurture random curiosity and creativity into tangible innovations.

Applying TDD principles and dogfooding throughout the development enabled rapid iterations and valuable community contributions, including Python support. The community’s warm reception, enthusiastic feedback, and collaborative spirit have been profoundly motivating.

I'm especially grateful to my mentor Martin, whose encouragement and belief in the idea helped bring it to life.

Feel free to explore TDD Guard, and please reach out with feedback or contributions. Your insights are invaluable and greatly appreciated.