Daz

Code I'll Never Read

2026-04-22T00:00:00+00:00

Six months ago I'd have laughed if you told me AI was writing all my code. I've written about this shift before, from rethinking my position on AI</a> to building a process</a> around plans, deviation logs, and targeted review. But there's a further step I wasn't expecting to come so soon.

I've started to question whether I'm even qualified to judge the code any more, because the code isn't written for me.

The moment it clicked</h2>
I have a side project I've worked on for years. No deadlines, no clients. Just code written for the pleasure of writing it. Some of the best code I've ever produced, by my own standards. I've always had a few side projects on the go as somewhere to develop the craft without compromise. The quality of this code mattered to me, and I thought it was good.
I showed it to Claude Code and asked "how easy would this codebase be for you to work with?" I told it not to hold back. Review it purely from an agentic coding perspective, ignore human aesthetics entirely.
The feedback wasn't great.
One thing it flagged: I'd replaced the router with a custom macro that let me define routes inline with their handlers. Elegant. Everything in one place. Open one file, see the route and its logic together. For me this was better than what it replaced, a nested router definition scattered across different files.
Claude's take: that pattern removed the router as a navigable index. An agent uses the router to get an overview of available endpoints and to target changes precisely. My clever colocation made the codebase harder for an agent to reason about.
You could argue the router would have the same benefit for a human not familiar with the project, but the benefits for me as the main code maintainer outweighed the cost. It was certainly an improvement over what I had before. It removed a lot of boilerplate, and the router was only one part of it.
I realised that in making the code better for me, I'd removed something that the agent expected to be there. And the clever metaprogramming that made things better for a human reader had made it more difficult for the AI agent.
That's a small thing. But it opened a bigger question.
Down the rabbit hole</h2>
I found a paper from April 2026, "Beyond Human-Readable: Rethinking Software Engineering Conventions for the Agentic Development Era"</a>, that was asking the same question. Their core argument: many practices we treat as anti-patterns may actually be virtues when agents are the primary consumers of the code. They even proposed a "program skeleton", a navigable high-level index of the codebase, which is essentially what my custom macro had removed.
That prompted me to look more carefully at other codebases I work with, running the same experiment. The router wasn't a one-off. Some similar patterns kept showing up across different projects.
Who is the code for?</h2>
Developers are starting to talk about what happens when we stop reading code. I've written about this</a> myself: the scaling problem, the cognitive limits (400 lines an hour, 60 minutes before quality drops off a cliff), the increase in code volume that AI agents produce.
Most of that conversation focuses on how we maintain quality if we can't read everything.
And there's a compounding problem. Even when humans do review agent code, the quality of that review degrades. AI output follows similar patterns, and reviewers start to skim rather than properly analyse. Template blindness</a> is when the code all looks plausible, so subtle bugs slip through. So not only is human review failing to scale, it's getting less reliable on the code it does cover.
What if the things we've always valued in code aren't what matter any more?
We have decades of received wisdom about what makes code good. Clean abstractions. DRY. Colocation of related concerns. Patterns that make the codebase a pleasure to navigate, if you're a human holding the whole thing in your head.
But the code isn't primarily for humans any more. If agents are writing it and agents are working with it, then "good code" means something different.
Where human taste and agent needs diverge</h2>
I found a few common patterns from my research where things I'd instinctively do as a human developer worked against how agents navigate and reason about code.
Boilerplate as signal. Humans like removing boilerplate. But that boilerplate is often the structural signal an agent relies on to orient itself. A standard router definition is repetitive to read, but it's instantly parseable. My custom macro removed that repetition and, with it, the navigability.
Metaprogramming vs. common patterns. A custom DSL or macro is a delight once you learn it. Agents are trained on millions of examples of conventional code.
Implicit conventions vs. explicit structure. Relying on things "you just know" doesn't work for an agent that has to rebuild its context every turn.
DRY vs. tolerable duplication. Humans instinctively factor out repetition. But for agents, a repeated function self-contained in each file is easier to reason about than a shared abstraction they have to trace across the codebase. The indirection costs more than the duplication.
Global elegance vs. local self-explanation. Agents don't hold the whole codebase in their head. They reward code that makes sense locally, file by file.
What are humans actually reviewing for?</h2>
I still review, and sometimes it pays off.
I had the agent working on a Rust project recently. `sqlx</code> gives you compile-time SQL checking: the compiler validates your queries against the actual database schema. The agent hit a build error because the database hadn't had migrations applied. Rather than fix the migration issue, it quietly switched from the compile-time macros to the runtime query functions. Build passed. But I'd lost a compile-time guarantee, replaced with a runtime check that would only fail in production.`
`I always instruct my agent to keep a deviation log, and luckily, that caught it. I didn't have to review every line because the agent flagged where it diverged from the plan and why. That's the kind of thing humans should be catching. That's a different job from "does this code look clean to me."`
`Looking back, that was a context failure. The agent's priority was to finish the task, and it did, by trading a compile-time guarantee for a runtime one. The fix here isn't more human review. It's making the constraints explicit in the context.`
The uncomfortable conclusion</h2> A year ago I wouldn't have believed this. But, as I'm becoming more comfortable with the idea that I won't be reading all the code, I'm now making the uncomfortable assertion that perhaps humans aren't qualified to review agent code anyway. If agents are the primary consumers of the codebase, then optimising for human readability might actively make things worse. Some of what we'd flag in code review (repetition instead of abstraction, explicit over implicit) might be exactly what an agent needs. And some of what we'd praise (elegant metaprogramming, terse abstractions, patterns that feel clever) might be actively hostile to agent workflows. Quality hasn't gone away. It just means something different now. Quality is about whether the code is correct, verifiable, and productive for whoever is working with it. Right now, that's increasingly not us. Aside There's a gap here I don't have a clean answer for. Agents have local coherence but not global coherence. They make changes that work within the files they can see, but they don't hold the shape of the whole system across sessions. In theory, that's still a human job. But if I'm arguing the code isn't for humans, I can't also argue humans should be reading it for structural consistency. And if the instincts we'd bring to that review push us toward patterns that work against the agent, we might be making things worse. My best guess at the moment is that the things that actually matter (broken integrations, contradictory assumptions across modules) should be caught by deterministic checks, not eyeballs. </aside> What next?</h2> I'm still working this out. I still don't feel comfortable not reading the code, so I am reviewing the code, but I know this will happen less and less as models and agents improve. But when I review now, I'm trying to catch myself before applying my outdated human opinions about what good code looks like. We need to let go of what "good code" used to mean to us. What should we be doing? better specifications: making constraints explicit in the context so the agent doesn't have to guess your priorities</li> stronger verification: type systems, compile-time checks, integration tests. If you're not reading every line, you need deterministic checks</li> production-like testing: staging environments that stress the system under realistic conditions. Deterministic checks prove correctness in isolation, but you need to know the whole thing holds together before it ships</li> observability: instrumentation to catch unexpected behaviour</li> </ul> We don't actually know yet what makes code optimally readable for agents. Anything we do spot will be a property of today's models and tooling, not a deep truth. The craft hasn't gone yet. It's moving from the aesthetics of the code to the quality of the specification and the strength of the verification. The hardest part isn't adopting new tools. It's unlearning the instincts that made you good at the old ones.

Why AI Fails at Scale 2026-03-25T00:00:00+00:00 Last week I read AI still doesn't work very well, businesses are faking it, and a reckoning is coming</a> in The Register. The picture it paints is bleak: enterprise AI is mostly failing, the metrics are gamed, and the bill is coming due. I've been trying to reconcile that with my own experience, because my experience has been the opposite. AI coding agents have been a massive performance gain for me. They've made me a better developer. In December 2025, Claude Code and Opus 4.5 crossed a threshold. The models got good enough that agents actually worked well. I started experimenting in December, and by January I'd gone deep. That's when I realised my old way of working was dead. I had to mourn my craft of 25 years while adapting. It hasn't been smooth. I've had failures. Outputs I threw away, approaches that turned into dead ends, entire days where the AI confidently produced something that looked right but wasn't. The difference is that I've learned to treat those failures as signal. I feed them back into how I work. And that's made all the difference. The Register article isn't wrong. The numbers are bad. But my experience says something different is possible. I've been trying to make sense of that gap. Two ideas. The first is a way to map the problem space: a simple quadrant that clarifies what kind of problem you're actually trying to solve with AI. The second is the outer loop: the feedback practice where you keep improving how you work with AI over time, not just what the AI outputs. The numbers, if you want them</h2> The scale of failure is well-documented. MIT's GenAI Divide report</a> found 95% of enterprise AI pilots delivered no measurable P&L impact. A broader analysis</a> across 2,400+ initiatives put the figure at over 80% of an estimated $684 billion in 2025 AI investment failing to deliver intended value. 42% of companies scrapped most of their AI initiatives</a> in 2025, up from 17% the year before. And yet workers at 90% of companies surveyed</a> report using personal AI tools daily, often outperforming the corporate tools their employers are spending millions on. The quadrant: mapping the problem space</h2> When I hear about people getting mixed results with AI, the first thing I wonder is: what kind of problem were they trying to solve? Because not all problems are the same, and I think a lot of failure comes from not being clear about this upfront. Two dimensions matter: Data: structured vs. unstructured. Is the input clean, tabular, and well-defined? Or is it messy, ambiguous, and human-generated? Process: deterministic vs. non-deterministic. Does the workflow demand the same outcome every time? Or does it require judgement, interpretation, and tolerance for variation? This gives you four zones: Structured data + deterministic process: just automate it</h3> Data integrations. Asset delivery. Metadata pipelines. Compliance reporting. I've built a lot of these over the years: structured inputs, defined schemas, the correct answer is the same every time. Most enterprise integration work lives here. You're mapping fields, transforming formats, moving data between systems. Traditional automation handles this well. AI introduces risk for no gain. You don't want a probabilistic answer to a field mapping. Forrester found</a> gen AI still orchestrates less than 1% of core business processes. Conventional automation still runs most of this work, and it should. Structured data + non-deterministic process: AI's sweet spot today</h3> Matching people and experience to proposals. Cleaning and standardising company data. Reporting and analytics. Content recommendation. Risk scoring. Scheduling. The data is solid but the judgement about what to do with it varies. AI can find patterns humans miss, and structured data gives it something reliable to work with. MIT found the biggest ROI in exactly this zone: back-office automation, cutting outsourcing costs, streamlining operations. Vendor-built tools in this space succeed about 67% of the time</a>. Unstructured data + deterministic process: interpret, then execute</h3> Email triage. Document classification. Compliance screening. Contract review. The input is messy but the downstream workflow is rule-based. AI handles the interpretation; deterministic logic handles what happens next. This is the hybrid pattern: AI reads and classifies, then rules enforce the outcome. It works well when you get the boundary right. Salesforce has been shifting toward exactly this architecture</a> in Agentforce, combining LLM flexibility with rule-based execution. This is also classic ML territory. Supervised learning was purpose-built for this: take unstructured input, classify it into a structured category, hand off to a deterministic system. Spam detection, sentiment analysis, fraud scoring, image recognition. A fine-tuned BERT model will often do this faster, cheaper, and more reliably than a generative model. Not every AI problem needs an LLM. The hype has pulled organisations toward frontier tools for problems that a classifier would handle better, and that mismatch accounts for a lot of the failure. Unstructured data + non-deterministic process: the frontier</h3> Coding. Strategy. Creative work. Research. Novel problem-solving. Both the input and the process are open-ended. This is where individual power users report the biggest gains and where enterprise failure rates are highest. Andrej Karpathy describes the challenge here as the March of Nines </a>. The maths is simple but brutal. Imagine a 10-step agentic workflow where each step succeeds 90% of the time. That sounds decent. But 0.9^10 is 0.35. Your end-to-end success rate is 35%. That's your demo. It looks impressive when it works, and it fails quietly most of the time. Getting from 90% to 99% per step is one order of magnitude of effort. Getting from 99% to 99.9% is another, roughly equal in difficulty. Each additional nine costs about as much engineering as the last one. At 99% per step, your 10-step workflow lands at 90% end-to-end. At 99.9%, you're at 99%. Production-grade reliability means marching through those nines, and each one takes real engineering to reach. Prompting and agent skills get you to 90%. They're necessary but not sufficient. The remaining nines come from harness engineering: putting AI systems on deterministic rails. Validation at each step. State management so you can resume or retry. Programmatic control over what the model can and can't do. Structured outputs. Assertions on intermediate results. The kind of engineering that isn't exciting but makes the difference between a demo and a system you can trust. This is why the frontier quadrant has the highest failure rate. The compounding error problem is intrinsic to multi-step AI workflows, and most organisations stop at the demo. They see 90% per step and call it good enough. They don't invest in the harness engineering that turns a promising prototype into something reliable. And when it fails in production, they blame the model. The frontier is moving</h3> The difference between success and failure in this quadrant isn't the model, it's the engineering around it. And I believe this quadrant is going to keep expanding. Everything that can be codified the way code can be codified is vulnerable to the same kind of disruption. But only if people learn to work with the AI effectively, which most haven't yet. This is the frontier, and it's moving quickly: better models, better tooling around them, and a growing understanding of how to actually work with them. The outer loop: getting better at getting better</h2> Most talk about AI in practice focuses on the inner loop: the agent loop. The model receives a prompt, reasons, takes actions, gets feedback, iterates. This is the cycle inside the tool. It's what the AI does. The outer loop is what you do. It's the feedback practice where you evaluate whether your way of working with AI is actually producing good outcomes, and then adapt your process based on what you learn. Aside A note on the term: inner/outer loop shows up in several places and means different things depending on who's talking. In DevOps it's the local dev cycle vs. CI/CD. Jeff Huber uses it for context engineering</a>: the inner loop assembles context for this generation step, the outer loop improves your context pipeline over time. Gene Kim and Steve Yegge propose a three-loop model</a> in their Vibe Coding book, where the outer loop is strategic architecture. Kief Morris at Thoughtworks writes about humans being "on the loop"</a>, maintaining the harness rather than supervising every output. I'm using the term at a different level from all of these. Not the system getting better at context, or the organisation managing architecture, but the practitioner getting better at working with AI through deliberate reflection on outcomes. </aside> I use AI, it sometimes fails. I examine why, and I adjust my process: how I structure the task, what I verify, where I intervene, what I delegate. Over time, this compounds. I'm not just using a tool; I'm developing a practice. The model and the harness improve together. The way I work with AI adapts as the model changes. Some of my process will probably be redundant when models improve further. Maybe. I'm not sure which parts yet, and that uncertainty is exactly why the outer loop matters. Without it, you're either clinging to a process that has become overhead, or you're abandoning discipline still earning its keep. You can't tell which is which unless you're paying attention to outcomes. As the Codestrap founders put it</a>, companies are measuring lines of code and pull requests (activity metrics) instead of deployment frequency, change failure rate, and mean time to restore (outcome metrics). Without measuring outcomes, there's no signal to feed back into process improvement. The inner loop runs, produces outputs, and nobody asks whether the whole system is actually working. Rich Sutton's The Bitter Lesson</a> is relevant here: over 70 years, general methods that leverage computation have consistently beaten elaborate human-designed scaffolding. As models improve, some of the scaffolding you built becomes overhead. But it's not a simple story. Prompt injection isn't solved. Hallucination isn't solved. Deterministic verification matters where you can get it. Scaffolding that constrains the problem space or validates outputs isn't fighting the model, it's good engineering. The outer loop is the practice of continuously reassessing what's earning its keep. Scaffolding isn't categorically good or bad. Putting it together</h2> Before deploying AI, understand the nature of the problem. Is the data structured or messy? Is the process rule-based or judgement-based? The research points to several reasons AI initiatives fail: poor data quality, lack of clear success metrics, losing executive sponsorship, treating AI as an IT project rather than a business transformation. These all matter. But one pattern that the quadrant helps explain is tool-problem mismatch: deploying frontier AI into problems that need traditional automation, or throwing chatbots at work that demands carefully engineered hybrid systems. It's not the only cause of failure, but it's one that clearer thinking upfront can prevent. Match the harness to the zone. Each quadrant needs a different approach. Deterministic problems need deterministic tools. Hybrid problems need a clear boundary between what the AI interprets and what rules enforce. Frontier problems need harness engineering that earns each nine of reliability, with human judgement actually in the loop. Run the outer loop. Whatever quadrant you're in, build a feedback mechanism that evaluates outcomes, not activity. Are you actually shipping better code, making better decisions, producing better analysis? If you're not measuring this, you don't know whether AI is helping or generating expensive noise. Hold the tensions. The Bitter Lesson says don't over-engineer scaffolding the model will outgrow. But deterministic verification genuinely matters where you can get it. Prompt injection and hallucination are unsolved. Models are getting better fast, but "better" doesn't mean "trustworthy in all contexts." The outer loop is how you navigate these tensions, by continuously reassessing what's working rather than picking a side and staying there. The 95% failure rate isn't a verdict on AI. It's a verdict on how organisations are thinking about it, or not thinking about it clearly enough. The people succeeding aren't using better models. They're thinking more clearly about where AI fits, and they're learning and iterating toward getting the best outcomes. Further reading</h2> MIT NANDA, GenAI Divide</a> (2025)</li> Pertama Partners, AI project failure statistics</a> ( 2026)</li> The Register / Codestrap on enterprise AI failures</a> ( 2026)</li> Rich Sutton, The Bitter Lesson</a> (2019)</li> Andrej Karpathy, The March of Nines</a> ( 2025)</li> Ethan Mollick, The Bitter Lesson vs. the Garbage</a> (2025)</li> ServicePath, deterministic guardrails for AI</a> ( 2025)</li> Zapier, hybrid deterministic + non-deterministic architectures</a> (2026)</li> Salesforce, rule-based execution in Agentforce</a> (2025)</li> CIO.com, enterprise AI ROI pressure</a> (2026)</li> </ul> Five Rewrites in a Week 2026-03-18T00:00:00+00:00 I recently built an internal data tool my team needed. Six months ago it wouldn't have been viable. It would have just been manual work. Nobody would have dedicated engineering time to automate this task before. But with Claude Code and Opus 4.6, I built a production-ready tool in days that replaced all of that manual work. AI changed the economics enough to make it worth doing. Is it worth the engineering investment for a tool that won't be needed long term? Before AI, the answer was no. The tool would not have got built. The work would have stayed manual with some SQL and Excel. Anish Acharya at a16z coined the term "disposable software"</a> to describe how software creation used to be constrained by ROI, but is now constrained by imagination. His examples are mostly consumer and personal. The enterprise version of this argument is more consequential: internal tools that cross the ROI threshold because development (feasibility) cost is now much reduced. The first prototype</h2> It started as a simple Python script. Load data into memory, process, and output the results. This is where the tool would have stopped in the old world. A script on someone's laptop. Maybe a Notion page explaining how to run it. Good enough. Move on. Fast iteration</h2> Each step below is a decision gate where someone would traditionally ask "is this worth the effort?" None of this would have been worth the effort without AI, but with AI it meant I could quickly iterate through multiple prototypes and not be scared about throwing away code. "Might as well put a UI on this"</h3> Move to Rust. I'm an experienced Rust developer, so a natural move for me. But great that I had Claude Code on hand to quickly translate the code from one language (Python) to another (Rust). Rust is the right tool for a memory-sensitive data processing task. The core logic was well understood from the Python prototype. Rewriting it was cheap. The app now serves up a simple HTML form. "This needs to scale"</h3> The first version loaded everything into memory. That won't work with real data volumes and the memory constraints of a containerised environment. I needed a streaming diff algorithm. I had a vague idea of how that should work. I didn't have to spend too long on working out the details because as I started explaining it to Claude Code, it could work out how to fill the gaps. I'm directing, but not vibe coding</a>. "We need to bulk run this"</h3> Bolt on a CLI interface for batch operations. Straightforward addition but another thing that wouldn't have been worth the effort if I was implementing it manually. This will potentially save a lot more manual work. Who knows? It might not get used, but it was a simple addition to the spec. "This needs to run in production"</h3> There's a lot that needs adding to turn a simple tool into a production ready tool. Container config, observability, workers, cloud infrastructure, IAM roles, secrets management. The boilerplate of getting something actually running. This is where I've used another AI coding pattern called "style transfer". Point Claude Code at an existing production service and say "make it like that." Infrastructure config encodes institutional knowledge. How your organisation organises the configurations needed, what your deployment conventions are, how you handle secrets. AI pattern-matches against existing services without you having to write a docs page or copy-paste configs manually. You get something that follows your org's conventions because it learned them from a working example. What the research is showing</h2> Anthropic's 2026 Agentic Coding report</a> describes tasks that required weeks of cross-team coordination becoming focused working sessions. MIT Technology Review reported</a> on developers surrendering control over individual lines and focusing on overall architecture. Rust as an AI-augmented development language</h2> The Rust compiler is essentially a second reviewer for AI-generated code. When Claude Code writes Rust, it gets immediate, precise, actionable feedback from cargo check</code>. Memory safety, lifetime issues, ownership violations, all caught at compile time, not in production. The tight feedback loop matters enormously for AI agents. The compiler doesn't just say "error." It says what's wrong, where, and often how to fix it. That's ideal for an agentic coding tool iterating in a loop. AI plus Rust's compiler creates a verification pipeline that lets you trust AI-generated code faster. I wrote about the importance of verification in my process</a> previously. The compiler is an automated verification step that runs on every iteration. And the compiler error messages with Rust are brilliant and help AI coding agents track down and fix issues much faster. Language choice for AI-augmented development should optimise for the strength of the automated verification feedback loop. Others are arriving at the same conclusion independently. Adam Benenson argues in "The Compiler Is the Harness"</a> that Rust's strictness is what makes it easy for AI agents. Agentic coding lives or dies on feedback loops. If code compiles, it has already satisfied a whole class of nontrivial constraints. Mykhailo Chalyi makes a complementary point in "Rust Is Winning the AI Code Generation Race"</a>: the writability problem of Rust disappears with AI agents, while the readability, type safety, and performance advantages remain. Anthropic's own C compiler project is telling here too. Sixteen parallel Claude agents producing 100 thousand lines of Rust. The choice of Rust was deliberate. The type system and ownership model serve as natural guardrails, and test-driven development with tight feedback loops was the critical enabler. The shifting economics</h2> The tool I made is temporary and won't be needed for long. In the pre-AI world, it would have never been built. But it's saved a lot of manual and error-prone work that would have been needed otherwise. It's not that existing software is cheaper to build. That's true but not the interesting part. New categories of short-lived, purpose-built internal tools become viable. Things your team needs but nobody would dedicate engineering time to. Data migration utilities, reconciliation jobs, debugging aids, one-off reporting tools. I couldn't have justified building the final product upfront. I had to discover the requirements iteratively. AI made each iteration cheap enough to keep going. Observations</h2> The compiler is your safety net. When choosing a stack for AI-augmented work, optimise for the quality of the automated verification loop. Rust's borrow checker and type system aren't friction. They're a trust accelerator for AI-generated code. Style transfer for infrastructure. You already have production services that encode your org's patterns. Use them as templates. Point the AI at a working example and say "match that." This is where AI saves the most tedious, error-prone time. Iterative discovery over upfront planning. AI makes it cheap to explore the design space. Start with a script. See if it's useful. Add a UI. Hit scaling limits. Redesign. Each pivot is cheap. You learn what you actually need by building. The ROI threshold has moved. Recalibrate what's "worth building." Short-lived internal tools, data migration utilities, debugging aids. Things your team needs but nobody would dedicate engineering time to. These are now viable. What's worth building now?</h2> The interesting question isn't "how fast can AI write code." It's "what becomes worth building when the cost drops this much?" I think we're still in the early days of answering that. The threshold has moved more than most people realise. Advanced Tool Calling Patterns for AI Agents 2026-03-13T00:00:00+00:00 I've already written about context engineering</a> as the core discipline of building AI systems. I've been experimenting with my own AI tools for coding, research, and automation. I'm noticing that tool calling starts to consume more and more context, and so we need strategies to scale tool calling. My stack is Rust-based, using Rig for LLM abstraction, Restate for durable execution, Postgres, and a hypermedia architecture with Maud and HTMX. It works well. But as I've added more tools and connected more MCP servers, context usage is creeping up. Every tool definition (name, description, JSON schema) eats tokens before the conversation even starts. A modest setup with a few MCP servers can consume 50,000+ tokens just on tool schemas. Also, each tool call is a full inference round-trip. The model calls a tool, waits for the result, processes it, calls the next one. A workflow that touches five tools means five round-trips, plus all the intermediate reasoning. It's slow and eats up tokens. Tool search: load what you need, when you need it</h2> Instead of stuffing every tool definition into the context upfront, you load only a small set of frequently used tools plus a special tool search tool. Everything else is deferred. When the agent needs a capability it doesn't have, it searches for it, gets back lightweight summaries, and then the full schema of the selected tool gets loaded for the rest of the conversation. Anthropic's research shows an 85% reduction in context usage, and accuracy on tool selection improved from 49% to 74% on Opus 4. On Opus 4.5 it went from 79.5% to 88.1%. Anthropic offer a server-side implementation where you mark tools with defer_loading: true</code> in the API request and they handle the search internally. But the more interesting version, for my purposes, is client-side. You build a tool registry that indexes tool names and descriptions, expose a tool_search</code> tool that returns lightweight summaries, and on selection inject the full schema into context. This is model-agnostic. It's just a tool that returns tool definitions. It turns out Rig</a>, the Rust LLM framework I'm already using, has a version of this built in. Rig's "RAG-enabled tools" let you implement a ToolEmbedding</code> trait on your tools, store them in a vector store, and retrieve the most relevant ones at query time using .dynamic_tools(n, vector_store_index, toolset)</code>. It's the client-side tool search pattern, using embedding-based semantic retrieval rather than keyword matching. The mechanism is the same as document RAG, applied to tool definitions instead of documents. I hadn't realised the utility of this before, but the infrastructure for tool search is already in my stack. I'll probably take a hybrid approach by keeping a few core tools always loaded and deferring everything else. Programmatic tool calling: let the LLM write code</h2> Instead of calling tools one at a time through the standard tool-calling protocol, the LLM writes code that orchestrates multiple tool calls, processes results with proper programming constructs (loops, conditionals, aggregation), and returns only the final output. The code runs in a sandbox with no direct network access. Tool calls inside the generated code go through a bridge back to the host application, which handles authentication and routing. This approach can achieve higher accuracy with much lower token usage. Anthropic reports average token usage dropping from 43,588 to 27,297 (a 37% reduction) on complex research tasks, and accuracy improvements on GIA benchmarks from 46.5% to 51.2%. A third-party test by The AI Automators backed this up: a budget compliance check across 20 team members took 56 tool calls and 76,000 tokens with traditional calling and still missed a result. The same task with programmatic calling took 4 to 12 tool calls, used fewer tokens, and got all results correct. Cloudflare has two takes on this. Their original Code Mode converts MCP tool schemas into TypeScript type definitions and runs generated code in V8 isolates. Their newer Code Mode MCP server takes it further, working against Cloudflare's OpenAPI spec rather than MCP schemas. The model writes JavaScript to call search()</code> and execute()</code>, exposing the entire Cloudflare API through just two tools and consuming around 1,000 tokens regardless of how many API endpoints sit behind it. When I first saw this approach, I joked it was RCE-as-a-Service, but it actually looks quite cool if you can get the sandboxing and permissions worked out. For my Rust stack, the sandbox question is still open. Pydantic's Monty is appealing because it's a Rust-based Python interpreter that boots in single-digit microseconds. But it only supports a subset of Python. I'm also curious about what could be achieved with something like Rhai, a pure Rust embeddable scripting language. There's a lot to think about and get right here including sandboxing, expressiveness, how well LLMs can actually generate code for the target language, security, and performance. I still think for recurring, well-defined tasks, it's better to use pre-written scripts (a "skills" system) rather than having the LLM generate code every time. Programmatic tool calling is most valuable for novel, ad-hoc queries where the specific combination of tools and logic can't be predicted in advance. I want to experiment with this, but I don't have a specific use case for this right now. Tool use examples: few-shot prompting for tools</h2> The third pattern is simpler. JSON schemas define structure but can't express usage patterns. Tool use examples provide concrete input/output demonstrations that show the LLM exactly how to call a tool correctly. Anthropic's testing showed parameter accuracy improved from 72% to 90% with examples. The best practices are to add one to five examples per tool, use realistic data, show variety in how the tool can be called, and focus on cases where correct usage isn't obvious from the schema alone. Tool search and tool use examples aren't compatible in Anthropic's current API. If you need examples for a specific tool, that tool needs to stay in standard (non-deferred) mode. A skills-based approach can serve a similar purpose, though. When the agent loads a skill file, it gets instructions and example invocations as part of the context, achieving the same effect through context engineering rather than a separate API feature. What I'm building next</h2> I'm going to try the client-side tool registry with search. This is low-effort, high-impact, and it works with any model. Second, I want to try adding sandboxed code execution once I've figured out the right sandbox approach for a Rust host. I also still think the skills-based approach offers the best value. This means using skill descriptions and providing a CLI or scripts to access additional capabilities. The Skill + CLI combination is hard to beat because it's powerful and understandable. I'll write more as I build this out. If you're working on similar problems, or if you've already implemented any of these patterns, I'd love to hear what you've found. Drop me a line</a>. Sources Advanced Tool Use</a> (Anthropic)</li> Code Execution with MCP</a> (Anthropic)</li> Effective Context Engineering for AI Agents</a> (Anthropic)</li> Tool Search Docs</a> (Anthropic)</li> Code Mode</a> (Cloudflare)</li> Code Mode MCP</a> (Cloudflare)</li> Pydantic Monty</a> (Pydantic)</li> Context Engineering for AI Agents</a> (Manus)</li> </ul> What Happens When You Stop Reading the Code? 2026-03-11T00:00:00+00:00 I recently wrote about how I work with AI coding agents</a> and about code review in AI-augmented development</a>. I meant every word of both. But parts of them are already not quite where my thinking is now. This is not a retraction. The ground keeps moving under our feet. The only irresponsible position right now is certainty. We have to be open to changing our minds as the AI models and harnesses improve, and as we discover how best to work with this technology. The four steps</h2> Dan Shapiro recently wrote about</a> what StrongDM's CTO Justin McCarthy learned building a software factory. The progression is simple: Recognise you're not the best person to write the code any more. The AI writes the code.</li> Accept that if you're not writing the code, but you're still reading every line, you are the bottleneck. Stop reading the code too.</li> Recognise that this creates an enormous pile of terrifying problems.</li> Realise that solving those problems is now your actual job.</li> </ol> I think this describes the trajectory we're on. Shapiro describes a destination. What I'm trying to describe is being mid-journey, somewhere on this path. But exaclty where I am depends entirely on context. Where I actually am</h2> Side projects: I'm experimenting freely. Steps 1 and 2 feel natural. I let the AI generate, I don't read every line, and I'm building verification instead. I'm focusing on carefully reviewing the plans, and developing AI assisted code review. The cost of failure is low. The learning is high. At work: I'm closer to traditional review. SOC 2, ISO 27001, compliance requirements mean I need evidence that a human understood what shipped. "An AI agent healed it" is not an answer our compliance team can work with yet. Nor should it be. I'm thinking about how AI can help scale this, but I'm working in a team, and so other factors need to be taken into account. I can see the destination Shapiro describes. I'm not fully there yet. And that's fine. The interesting question isn't "have you arrived?" but "what has to be true before you can move further along the path?" Why letting go is less scary than it sounds</h2> Human code review was never very good at finding bugs. The empirical evidence backs this up. What's more interesting is what code review actually delivered as side effects: shared understanding of the codebase, consistency across the team, accountability for what shipped, knowledge transfer between engineers. Those are real and valuable. But they're not what most engineers think they're defending when they resist the idea of not reading every line. When you realise you're grieving familiarity and shared understanding rather than bug-catching capability, it reframes the problem. Those are solvable problems. They just have different solutions than line-by-line review. From reviewer to feedback loop designer</h2> If you're not writing the code, and you're not reading every line, what is your job? Not: "Did the AI write good code?" But: "Have I built an environment where bad code can't survive?" This is closer to SRE thinking than traditional code review. You're designing systems that keep AI-generated output on track: verification pipelines, observability, feedback loops, automated gates. The discipline doesn't disappear when you let go of reading every line. It moves. From inspecting output to designing the systems that inspect output for you. I wrote about mechanical sympathy</a> recently, the idea that every generation of engineers needs to understand the layer beneath their abstraction. The same principle applies here. You need to understand how AI-generated code fails (quietly, confidently, locally-coherent-but-globally-inconsistent) to design feedback loops that catch those specific failure modes. Verifiable over deterministic</h2> My earlier thinking drew a hard line: use deterministic tools (linters, type checkers, compilers, tests) for everything you can, and only use AI for the rest. I still believe that. But it's incomplete. The real requirement isn't determinism. It's verifiability. There's a spectrum: Best: verifiable and deterministic. Linters, type systems, compilers, test suites. Same input, same output. You can prove correctness. This is the gold standard and you should push as much as possible into this category. Useful: verifiable but non-deterministic. AI code review that flags concerns with evidence. Human review. Property-based testing with AI-generated cases. The process isn't repeatable, but you can assess whether the output is right. You can show your working. Dangerous: unverifiable and non-deterministic. Trusting AI output with no mechanism to assess correctness. No tests, no review, no evidence trail. This is where things go wrong, and it's where most "vibe coding" sits when done carelessly. The question isn't "is this check deterministic?" It's "can I verify the result, and can I show evidence of that verification?" This is also where compliance frameworks might eventually meet AI-augmented workflows. The intent of SOC 2 and ISO 27001 isn't "a human read every line." It's "you can demonstrate control and correctness." Auditable, evidenced verification could satisfy that intent even as the mechanism shifts. Not today, necessarily. But that's the direction. What needs to be true</h2> Before organisations can move further along Shapiro's four steps, several things need to happen. Verification tooling needs to mature. Not just linters and tests, but AI-assisted review that produces auditable evidence. We need tools that don't just say "this looks fine" but show why, with traces that an auditor could follow. Compliance frameworks need to catch up. Or at least be interpreted in ways that recognise systematic verification as a valid control. The current assumption in most audit frameworks is that a human reviewed the change. That assumption will need to evolve, but it won't evolve until the alternative demonstrably works. The specification layer needs proper tooling. If intent documents and specs become the durable artefact (and I think they will), they need consistency checking, dead requirement detection, contradiction detection. Right now, a repo full of markdown specs is just files. No compiler tells you when two specs contradict each other. No linter catches a requirement that's been superseded but never removed. Teams need new ways to maintain shared understanding. Code review served a knowledge-sharing function that had nothing to do with finding bugs. If that goes away, something else needs to replace it. AI-generated explanations of what changed and why, targeted at humans rather than machines, might serve that purpose. But the tooling isn't there yet. Trust needs to be built incrementally. Side projects first. Low-stakes features. Gradually expanding the boundary as confidence in verification systems grows. This is how every new practice earns legitimacy in engineering organisations, and AI-augmented workflows shouldn't be an exception. This post has a shelf life too</h2> My previous posts described how I work and how I think about code review. This one describes how both of those are shifting and why. I expect to write another one when the ground moves again. It will. That's not a failure of thinking. It's the appropriate response to a situation that is genuinely shifting under us. The only irresponsible position right now is certainty. The discipline is the same as it's always been in engineering: understand the layer beneath the one you're working at. The layer has changed. The discipline hasn't</a>. If you're on this path too, wherever you are on it, I'd love to hear where you've landed. Drop me a line</a>. Operational Debt 2026-03-04T00:00:00+00:00 Years of running production systems give you something that's not in the code. You learn the real-world usage patterns, the failures that only show up under load, the degradation behaviour that creeps in over months. You learn which alerts actually matter and which are noise. That knowledge is earned incrementally. Through building, observing, failing, and iterating. It lives in people, not in repositories. I've been thinking about what happens to that knowledge when code generation speeds up by an order of magnitude. Cognitive debt, briefly</h2> The term cognitive debt was brought into software engineering by Margaret-Anne Storey</a> earlier this year: the gap between what AI-generated code does and how well the developers actually understand it. Martin Fowler</a> and Simon Willison</a> have since amplified it, and it's gained serious traction. Five independent research groups converged on the same finding in a single week: AI agents generate code 5-7x faster than developers can comprehend it. Storey followed up with a second post</a> exploring the implications further. Anthropic's own research showed AI coding assistance reduces developer skill mastery by 17%. Developers who delegated code generation scored below 40% on comprehension tests. I've written about this before from the code review angle</a> and build fast learn slow</a>, but there's a piece I haven't named until now. Operational debt</h2> Here's what I want to put a name to: Operational debt is code generated faster than teams can earn the operational knowledge to run it reliably in production. Cognitive debt is about understanding what the code does. Operational debt is about understanding what happens when it runs. They're related but distinct. Operational knowledge is a specific thing. It's knowing the real-world usage patterns of your system. It's knowing which metrics actually correlate with user pain and which are just noise. It's understanding how the system degrades under pressure, not the clean failure modes you designed for, but the messy ones that emerge over time. This knowledge grows through lived experience with a running system. You can't generate it. You can't shortcut it. You earn it by operating the system with real users doing unpredictable things, over months and years. Speed up code generation 5-7x and this knowledge doesn't keep pace. It can't. The pattern that's hard to ignore</h2> Check the Claude status page</a>. Check GitHub's recent reliability track record</a>. Both companies leaning heavily into AI-generated code. Both struggling with operational reliability. I know, correlation isn't causation. There are many possible explanations: rapid growth, scaling challenges, organisational complexity. But the pattern is there. The companies most aggressively adopting AI for their own codebases are also the ones with the most visible reliability issues. It's worth asking why. My hypothesis is that when you generate code faster than your team can build operational understanding of it, your ability to run that code reliably degrades. Not because the code is bad. Because nobody has had time to learn how it behaves in the wild. How they compound</h2> These problems compound: Cognitive debt: you can't understand the code fast enough</li> Review bottleneck: you can't review it</a> fast enough to maintain quality gates</li> Operational debt: you can't earn production knowledge fast enough to run it reliably</li> </ul> Now put them together. Code you don't fully understand, that wasn't thoroughly reviewed, running in production environments you haven't had time to learn the operational characteristics of. That's not a hypothetical. That's a reliability crisis happening right now. What to do about it</h2> I don't have a fully worked-out answer. But I think it starts with recognising that "how fast can we generate code" is the wrong metric. The right question is: "how well do we understand what we're running?" The productivity gains are real. But productivity measured only in code output is measuring the wrong thing. I'm not saying we should slow down, but we shouldn't focus just on the speed of generation if our operational knowledge can't keep up. Some things I think help: Match generation speed to learning speed. Give teams time to build operational understanding before the next wave of changes lands. Easier said than done when you need to keep up with the new pace of software development.</li> Invest in observability before you invest in generation. If you can't see how your system behaves, generating more code just makes the blind spot bigger.</li> Treat operational knowledge as a first-class asset. Document failure modes as you discover them. Run postmortems that capture institutional knowledge, not just action items. Make sure ops understand what changed recently.</li> Be honest about the gap. If your team has generated more system than it can operate, that's a risk. Name it. Factor it into planning.</li> </ul> The goal isn't to slow down. It's to make sure understanding keeps up with generation. Augmented development is powerful precisely because it lets experienced practitioners move faster. But we need to keep up with the experience, not just move faster. This is thinking in progress. If you're seeing this pattern in your own teams, or if you think I'm wrong about the connection, I'd genuinely like to hear about it. Drop me a line</a>. How I Work with AI Coding Agents 2026-03-01T00:00:00+00:00 I've been building software for over 25 years and I've been through many changes to how we work in that time. There was Git, CI/CD, cloud, containers to name a few. This is the biggest change in the shortest space of time. In December 2025, Claude Code and Opus 4.5 crossed a threshold. Since then I've been focusing all my energy on this: working with AI coding agents in production every day, experimenting, noticing what works, feeding that back into my process. The approaches out there range from spec-driven development to fully autonomous vibe coding. What follows is where I've landed, built from daily use on real projects. It keeps changing. This isn't science. It's field reporting from a practitioner going deep on this every day, testing what works under real conditions. Some of what I've found aligns with what others in the field are discovering independently. Some, I think, goes further. The core principle</h2> One observation underpins everything: LLMs are stateless. They have no memory between requests. Output quality is bounded by context quality. Better models don't fix bad context. They produce more confident, more fluent slop. In the 2025 Stack Overflow Developer Survey</a>, trust in AI accuracy dropped from 40% to 29% year-over-year, even as 84% of developers kept using the tools. People keep using them because they're genuinely useful. The output quality is the problem to solve. The difference between shipping quality and drowning in rework comes down to how deliberately you manage what goes into the context window and how rigorously you verify what comes out. A toolkit, not a pipeline</h2> My process is not a linear pipeline. It's a toolkit of distinct steps that I assemble into a custom flow for each piece of work. The shape depends on the outcome I'm after. Each step runs in a fresh context window, with the output compressed into a focused artefact for the next. Context goes down at each stage while specificity goes up. I've tried to formalise this into a deterministic workflow, a controlled set of steps I can repeat. I might be converging on something: a custom orchestration and review tool built to maximise human leverage at the points where it matters most. But that's early days. For now, the value is in keeping it flexible, experimenting with how the pieces fit together, and adjusting as I learn what actually holds up under daily use. The available steps and a typical flow: Ideation, design, and research feed each other iteratively, sometimes in parallel. Requirements crystallise from that exploration. Then the three review stages, validate, evaluate, verify, each check different things at different points. These steps aren't always all present. For a small bug fix, several collapse into one session. For a substantial feature, each is a distinct conversation with its own context window. I assemble the flow to fit the work, not the other way around. Research</h2> This is the highest-leverage phase. The goal is to map the problem space: relevant files, functions, data flows, constraints, prior decisions. Not to write code. Sub-agents do the noisy work in isolated context windows: file exploration, code search, dependency tracing. Their compressed summaries come back to the main context clean, without the search noise that would pollute it. This is encapsulation, applied to attention rather than code. The pattern is gather, then glean. Cast a wide net first (maximise recall), then cull to the minimal set that matters (maximise precision). The most dangerous information isn't the obviously irrelevant stuff. It's information that looks relevant but isn't. A hallucinated assumption about how the auth system works isn't a code-level error. It's a research-level error. Everything built on top of it will be wrong. Plan</h2> An execution blueprint. Every step numbered, sequential, unambiguous. Include test criteria and code snippets where they remove ambiguity. The target: a plan so specific that implementation becomes almost mechanical. This is where small mistakes get expensive fast. A bad step in a plan produces hundreds of wrong lines. I've had a single missed detail in a plan generate a cascade of not-quite-right code across multiple files, all internally consistent, all confidently wrong. The earlier you apply human judgement, the cheaper the correction. Implement</h2> Should be the simplest phase. Feed the plan and only the specific files needed. For larger tasks, break implementation into chunks, each in a fresh context window, to stay below roughly 40% context window utilisation, where I've found output quality starts to drop off noticeably. In practice, this means I can run plan steps through an implementation loop: feed a step, execute, commit, fresh context, next step. This is close to what people are calling a Ralph loop</a> (Geoffrey Huntley's pattern of running an agent repeatedly with git as the memory layer), but structured around a plan rather than re-running the same prompt until it converges. What I add on top is a deviation log. During implementation, any point where the AI diverges from the plan gets flagged with a reason. I review and annotate these. This turns code review from reading every line to targeted investigation of the places where plan and reality didn't match. Three review points, not one</h2> Most workflows put review at the end. I apply human judgement at three distinct points, each in a fresh context to avoid bias from the previous phase. Validate (after requirements): Are we solving the right problem? Are the requirements correct, complete, and feasible? This catches scope errors before any planning or code exists. Evaluate (after plan): Is the approach sound? Is the work broken into chunks that fit within the AI's context sweet spot? Does each chunk specify the context it needs? A plan that looks right but is poorly chunked for execution will produce inconsistent output. Verify (after implementation): Does the output match the plan and requirements? This is where all forms of review converge: Static analysis first. Types, linters, automated tests, security scanners. I write Rust, and the compiler's error messages are detailed enough that the agent can interpret and fix them directly. Never send an LLM or a human to do a linter's job.</li> Architecture second. Check structural decisions: dependencies, patterns, interfaces, how the new code fits the existing system.</li> AI-specific failure modes last. AI-generated code tends to have local coherence (each module works in isolation) but poor global coherence (three modules solving overlapping problems differently, abstractions that don't compose, naming drift). Security is where the failures get dangerous. AI won't add CSRF protection, rate limiting, or input validation unless specifically prompted. It builds what you ask for, not what you need.</li> </ol> The research is clear: 45% of AI-generated code</a> contains security vulnerabilities. AI pull requests average 1.7x more issues</a> than human PRs. If you only verify at the end, you're trying to catch all of that in code review. Validate and evaluate earlier, and many of those issues never get generated in the first place. Context management</h2> I think about context quality across four dimensions, a framing from Dex Horthy's "No Vibes Allowed"</a> talk that I've found genuinely useful: Correctness: Is everything in context accurate?</li> Completeness: Is anything important missing?</li> Size: All signal, minimal noise. Keep the model in its smart zone.</li> Trajectory: Does the conversation flow help the model reason well?</li> </ul> The roughly 40% guideline</h3> The 40% figure comes from Dex Horthy. My experience confirms it: best performance is below 50% utilisation, and quality drops noticeably beyond that. Chroma's context rot research</a> confirms the underlying principle: model performance decreases as input length grows, even on simple tasks. More context usually means worse output, not better. The practical rule: if you're approaching the limit, start a fresh context or delegate to a sub-agent. Summarise and delegate</h3> Two strategies for keeping context under control. Summarise is reactive. Compact accumulated context between phases. The output of research becomes a compressed summary for planning. The plan becomes a compressed spec for implementation. Each transition is an intentional reduction. Delegate is preventive. Hand work to sub-agents with isolated context windows so token sprawl never reaches the main agent. Sub-agents explore different parts of a codebase in parallel; only their compressed summaries come back. The sprawl never enters the main context at all. Anthropic's guidance on context engineering</a> formalises these as four strategies: write, select, compress, and isolate. My summarise maps to their compress; my delegate maps to their isolate. The underlying principle is the same: every token in the context window competes for the model's attention, so be deliberate about what goes in. Configuration: deterministic vs instructed</h2> A sharp distinction runs through my entire setup. Anything that can be checked mechanically is enforced via hooks or automated verification steps, not by instructing the LLM. Linting, type checking, test runs, security scans, formatting. These run automatically because the toolchain demands it. The LLM doesn't need instructions to follow rules that are enforced by the compiler. Factory.ai</a> put this well: "Agents write the code; linters write the law." When you encode your architecture and standards directly into the code generation loop, the AI generates code, gets automatic feedback, and iterates until clean. Lint passing becomes a proxy for "conforms to architecture and best practices." Only non-deterministic behaviour controls go in instruction files like CLAUDE.md</code>. Coding conventions that linters don't capture, architectural preferences, domain-specific patterns, interaction style, when to ask for clarification vs proceed. ...use the AskUserQuestion tool... </blockquote> The single most important instruction I give the agent: ask me rather than assume. If the context isn't enough, if there's a trade-off to resolve, if research turns up conflicting options, stop and ask. Most AI failures I've seen trace back to the model filling gaps with confident guesses instead of flagging uncertainty. Prompting for this aggressively has done more for my output quality than any other single instruction. Skills extend these non-deterministic instructions with progressive disclosure. Modular prompt definitions loaded only when relevant. They keep the base context lean and bring in specialised instructions on demand: commit conventions, review criteria, planning templates, domain-specific patterns. Hooks are how the deterministic side gets enforced. Claude Code fires hooks on events like file saves and tool calls. I use them to enforce rules, so the agent gets immediate feedback without being told to check. The agent fixes issues in the same loop. No instruction needed, no judgement required. MCP servers are powerful but hungry. Every tool description loaded into context competes for the same attention budget as the actual task. Be selective. Only connect what you'll actually use for the current work. The review bottleneck</h2> AI has scaled code production. Human review capacity hasn't changed. The research summarised in Making Software ( Oram & Wilson, O'Reilly) is consistent: roughly 400 lines per hour for effective review, with a hard wall at about 60 minutes of sustained attention. Beyond that, defect detection falls off a cliff. AI-generated code has a 41% higher churn rate</a> than human-written code. And an eight-month study of 200 employees found 83% said AI increased their workload</a> through scope expansion and dissolved work boundaries. This is the central constraint. My strategies for working within it: Right-size the unit of work. Size tasks to stay within both the AI's context sweet spot and the human review budget. These constraints push in the same direction, which is convenient. Validate and evaluate, not just verify. Human attention is most valuable at the requirements and plan level, where AI is weakest and the cost of errors is highest. Make verification deterministic. Strongly typed languages, linters, automated tests, contract tests, security scanners. These go from helpful to essential in AI-augmented workflows. They handle the mechanical correctness that humans shouldn't spend review time on. Triage before deep review. Fast architectural pass first, then focus on risk areas: security, data validation, error handling, concurrency. Make the AI account for deviations. The deviation log from implementation turns review into targeted investigation rather than line-by-line reading. The no-review shortcut</h3> There's a growing school of thought that if you checked the plan and the code seems to work, you can skip review and ship. I understand the appeal. You can't inspect all the code the way we did before. The volume has changed. But AI code goes wrong in different ways than human code did. The failure modes I described above, poor global coherence, missing security controls, naming drift, these aren't the kind of bugs that surface immediately in a demo. They accumulate. Skipping review because the code appears to work ignores the problems that don't surface until production: inconsistency, missing security controls, accumulated debt. Then there's compliance. Both SOC 2 and ISO 27001 have controls that require change management and peer review. The purpose of code review isn't just catching bugs. It's establishing an auditable trail of authorisation. Could you substitute automated testing, static analysis, and post-deploy monitoring as compensating controls? Maybe, in some configurations. But you'd need to document that thoroughly, get buy-in from your auditor, and demonstrate it's equally effective. Most organisations would find it far easier to just do code reviews than justify the alternative to an auditor. The answer isn't to skip review. It's to scale it, focus it, and make it sustainable. Which is what everything above is trying to do. Cognitive debt and operational knowledge</h2> There's a concept gaining traction called cognitive debt: the gap between the code your team ships and the code your team actually understands. Margaret Storey</a> framed it well, and Simon Willison amplified it</a>. The research suggests AI generates code 5-7x faster than humans can comprehend it. I think the problem goes deeper than code comprehension. AI can build fast. You cannot compress the learning that comes from running a system in production with real users over time. An LLM will build what you ask for but won't volunteer what you haven't thought to ask for. And the things you haven't thought to ask about are exactly what matters most in production. Payment timeouts. Reservation expiry race conditions. Idempotency edge cases. I've encountered all of these through operating my own systems, not through planning or design. The gap between what you can build and what you can operate is where trust breaks. AI-augmented development widens this gap by accelerating the build side without touching the operational learning side. The practical response: build deliberately. Simple first. Real usage from day one. Complexity only as the system proves itself. AI assists everywhere, but the human decides everywhere. Where AI adds genuine value</h2> Not everywhere. Where it works: Unstructured-to-structured transformation (parsing inconsistent data formats that would previously require brittle regex or hand-coding). Natural language interfaces, always with a human in the loop. Code generation with disciplined context management. Parallel research and exploration via sub-agents. Where it doesn't: Replacing deterministic workflows. There is no good reason to replace a reliable cron job, webhook, or message queue with a non-deterministic alternative. Unsupervised autonomous operation: an AI agent with API keys and shell access on a timer is a security incident waiting to happen. And anywhere robustness matters more than novelty. If the existing solution works reliably, the burden of proof is on the AI replacement. How this compares to the field</h2> This process aligns with several emerging practices. The research-plan-implement workflow mirrors what Dex Horthy</a>, Anthropic</a>, and Simon Willison</a> independently advocate. Context engineering as the central discipline matches Jeff Huber's</a> framing. Plan-first, spec-driven development has become the consensus position, replacing the early "vibe coding" enthusiasm. Where I think this approach diverges: Deterministic enforcement over LLM instructions. Most guides put everything in CLAUDE.md</code> or similar files. I reserve instruction files for genuinely non-deterministic guidance and enforce everything else through hooks and tooling. If a machine can check it, a machine should enforce it. Operational knowledge as the constraint, not code generation speed. The industry conversation focuses on how fast you can ship. I think the gap between build speed and operational understanding is the primary risk. Cognitive debt at the code level is real, but the knowledge that only comes from production is the harder problem. Collaboration over autonomy. The mainstream is moving towards more agent autonomy. I'm betting that the best outcomes come from effective collaboration between AI and experienced, product-focused engineers. The human brings domain knowledge, system-wide judgement, and operational experience. The AI brings speed, parallel exploration, and tireless execution. Neither alone matches what they produce together. That's what AI-augmented development means. The METR study</a> (mid-2025) found experienced developers were 19% slower with AI on their own large codebases. This doesn't match my experience, and I attribute the difference to two things: context management discipline (most developers in the study used AI without structured workflows), and the step change in model quality and tooling that arrived in December 2025. The hard part was never typing</h2> Marty Cagan</a> describes four product risks: value (will people use it?), usability (can they figure it out?), feasibility (can we build it?), and business viability (does it work for the business?). AI has reduced feasibility risk significantly. It has not reduced the others. If anything, by making it cheaper to build, it shifts attention back to value risk: are we building the right thing? The process in a sentence: assemble the right steps for the work, fresh context per step, compress between transitions, enforce deterministically what you can, instruct the AI only on what requires judgement, and validate and evaluate before you verify. This keeps evolving. I'll be wrong about parts of it in six months. But the underlying bet, that disciplined collaboration between human judgement and AI capability beats either alone, is the one I'm most confident in. If you're working through this yourself, I'd genuinely love to hear what's working for you. Drop me a line</a>. Sources No Vibes Allowed: Solving Hard Problems in Complex Codebases</a> (Dex Horthy, HumanLayer)</li> Advanced Context Engineering for Coding Agents</a> ( HumanLayer)</li> Context Rot: How Increasing Input Tokens Impacts LLM Performance</a> (Chroma Research)</li> Effective Context Engineering for AI Agents</a> ( Anthropic)</li> Using Linters to Direct Agents</a> (Factory.ai)</li> AI-Generated Code Security Risks</a> (Veracode, 2025)</li> AI Assistant Code Quality 2025 Research</a> (GitClear)</li> State of AI vs Human Code Generation</a> ( CodeRabbit, 2025)</li> 2025 Developer Survey</a> (Stack Overflow)</li> AI Doesn't Reduce Work, It Intensifies It</a> (HBR, 2026)</li> Cognitive Debt</a> (Margaret Storey)</li> Impact of AI on Experienced Developer Productivity</a> ( METR, 2025)</li> The Four Big Risks</a> (Marty Cagan, SVPG)</li> The Ralph Loop</a> (Geoffrey Huntley)</li> Making Software</a> (Oram & Wilson, O'Reilly)</li> </ul> Code Review in the Age of AI-Augmented Development 2026-02-26T00:00:00+00:00 These days I spend much more of my development time reviewing code than writing it myself. I've also found myself thinking more deeply about what to build, and how to specify it, before anything gets generated. I wrote recently about thinking in plans, not code</a> and how the leverage has shifted upstream to research and planning. This post is about the other side: what happens downstream, when the code arrives and you have to decide whether it's right. The Human Constant</h2> There's a chapter in Making Software (Oram & Wilson, O'Reilly) that summarises two studies on code review effectiveness. The first (Dunsmore 2000) mapped defect detection over time. Early in a review, the relationship is linear: roughly one defect found every ten minutes. But around the 60-minute mark, there's a sharp drop-off. Another ten minutes no longer reliably turns up another defect. The brain hits a wall. The second (Cohen 2006) looked at around 2,500 reviews and measured the effect of review speed. Below about 400 lines of code per hour, defect density spreads naturally across reviews. Some code is simple with few defects, some is complex with many. That spread is normal. Above 400-500 LOC/hour, high defect density reviews virtually disappear. Not because the defects aren't there. Because the reviewer is moving too fast to find them. The conclusion: at most one hour, at most 400 lines. Review more than that in a single sitting and you're not going to be effective. These are cognitive limits. They haven't changed. What's changed is the volume of code arriving at your desk. An AI coding assistant can produce in minutes what used to take a developer a day. The production side has scaled. The review side is still bounded by the same brain it always was. The Coherence Problem</h2> AI-generated code tends to look fine in isolation. Each function is reasonable. Each module makes sense on its own. The issues that are easy to miss are not bad code. It's code that doesn't make sense when you look at the bigger picture. Three modules that solve overlapping problems in slightly different ways. Abstractions that don't compose because they were never designed together. Naming conventions that drift across files. Local coherence, but not global coherence. In a traditional team, this kind of drift happens too, but it happens slowly. Over weeks, conversations and reviews naturally surface the divergence. Someone says "wait, didn't we already solve this?" and the team realigns. With AI, the same mess can accumulate in an afternoon. The code all looks clean, so the signals that would normally trigger a course correction don't fire. I had a telling example recently. I'd specified RustFS in my requirements for integration testing some S3 code. By the time the AI-generated plan came back, that had quietly become Minio, the more widely known option. The substitution looked perfectly reasonable at a glance. I missed it. One line in a plan that would have been a trivial correction became an extra round of implementation to revert and swap out the dependency. That's the leverage problem in miniature. Catching it in the plan costs you a one-line edit. Catching it in the code costs you a cycle of rework. Right-Size the Unit of Work</h2> One response to the review bottleneck is to control what you're generating in the first place. I've found a rough guideline: size tasks and planning phases so they fit within about 40% of the AI's context window. That's around where context rot starts to bite, the gradual degradation in output quality as the context fills up. It's not a hard rule. But it serves two purposes. It keeps the AI's output consistent and reliable. And it keeps each chunk of output within a budget that a human can actually review properly, given the cognitive limits above. Approval checkpoints need the same kind of sizing. Too many interrupts and you overwhelm the human reviewer with constant context-switching. Too few and drift goes unchecked until it's expensive to fix. There's no formula for this yet. Both sides, human and AI, are developing intuition for what works. It builds through practice, is context-dependent, and not something you can read off a chart. Review Plans, Not Just Code</h2> Human attention is most valuable at the levels where AI is weakest: specifications, requirements, and architectural coherence. The review question shifts from "is this code correct?" to "is this spec complete?" and "does this still hang together?" Going back to that RustFS example, if I had reviewed the plan more carefully, catching the substitution would have been a one-line correction. Instead, I caught it after implementation, and it cost a rework cycle. The same principle applies at every scale: the earlier you apply human judgment, the cheaper the correction. Senior developers' experience and context matter most here. The ability to hold the bigger picture, to spot when a plan is subtly drifting from what the system needs, to ask "have we already solved this differently elsewhere?" That's the work. Make Verification Deterministic</h2> Every check you can make deterministic is a check you take off the human reviewer's plate. Strongly typed languages catch entire categories of error at compile time. Linters enforce consistency across files without anyone reading them. Automated tests verify behaviour. Security scanners flag known patterns. Contract tests confirm that modules still talk to each other correctly. None of this is new. But in an AI-augmented workflow these tools go from helpful to essential. They're what make the review budget viable, because they remove whole classes of concern from the pile of things a human has to think about. The more you push into deterministic verification, the smaller the surface area of judgment-dependent review becomes. Never send an LLM (or a human) to do a linter's job. Triage Before You Review</h2> When you do sit down to review, start with a fast architectural pass. Does the overall shape make sense? Do the modules fit together? Are the boundaries in the right places? Only then focus your attention on the parts that carry the most risk: security boundaries, data validation, error handling, concurrency. AI can help here too as a first-pass. AI can triage to direct your attention rather than replace your judgment. Let it flag the areas that look unusual or complex, then spend your limited review time on those. Make the AI Account for Its Decisions</h2> One practice I've found useful: after implementation, ask the AI to report where and why it deviated from the plan. It won't catch everything (it can be blind to its own substitutions), and it can tend to go into detail on where it followed the plan, not deviated. If you can successfully prompt for this, it shifts the review from reading every line looking for surprises to taking a more targeted approach as to where you focus your attention more. Again, back to that RustFS-to-Minio example. This should not be something you have to spot by chance but rather something that gets surfaced for you. The AI might tell you "I used Minio instead of RustFS because the test container support is more mature." Now you have a decision to make rather than a detail to catch. That's a better use of your attention. What Changes</h2> This requires a genuine shift in how senior developers think about their work. The high-leverage activity isn't reading code line by line any more. It's writing specs tight enough that generation is constrained, reviewing plans before they become implementations, and maintaining the coherent bigger picture that no individual AI context window can hold. That's harder to measure than lines reviewed. It's harder to put in a standup update. But it's where the bottleneck actually is, and it's where experienced developers can make the work better, or, by not doing it, let it quietly degrade. If any of this resonates, or if you've found approaches that work differently, drop me a line</a>. Build Fast, Learn Slow 2026-02-17T00:00:00+00:00 Update — 2026-03-04 I outlined this post a while ago but never finished it. I'm posting it now because I think it's interesting background thinking to my operational debt</a> post. </aside> If an AI-augmented engineer can build an app in a weekend, what happens to SaaS? I'm a tech lead for data and integrations at a SaaS company. But I also run Zero Waste Tickets</a>, a small side project, with real users. I see software from inside of a mature product, and the solo operator building from scratch. The code was never the hard part</h2> I've rebuilt Zero Waste Tickets a few times. Each time the technology changed completely. Different stack, different architecture, different approach. What carried over was the operational knowledge. Everything I'd learned about what goes wrong. AI coding tools are extraordinary. You can build in a weekend what used to take months. But you can't learn how to operate what you've built at the same pace. The code races ahead of your understanding. The gap between "it works in a demo" and "I'd trust it with someone's money" is where all the interesting problems live. "Sounds like an edge case"</h2> I recently spoke to someone who had vibe-coded their own ticket-selling application. Looked great. I asked how they prevented overselling. What happens when more people try to buy tickets than are available, all at the same time? They hadn't thought about it. "Sounds like an edge case." Overselling is not an edge case in a ticketing system. It's the core integrity problem of the domain. That's like building a banking app and calling incorrect balances an edge case. But this person wasn't careless or incompetent. They just hadn't encountered the problem yet because they hadn't operated the system under real conditions. The LLM that generated their code hadn't raised it either, because they hadn't thought to ask. An LLM will build what you ask for. It won't know exactly what things matter most in production. The payment timeout lesson</h2> In an earlier iteration of Zero Waste Tickets I had a payment error from a production edge case I hadn't considered during design. A user started buying tickets. They got to the payment step, where the bank sometimes asks for additional verification. Then they walked away from their computer. Completely reasonable human behaviour. But here's what happened underneath: the system had reserved their tickets. After a long period of inactivity it returned the reservation to the pool, as designed. Those tickets got bought by someone else. Then, hours later, the original payment completed. The bank said yes, money moved, but the order was now invalid because the tickets were gone. I had taken into account many cases, including declined transactions and payment processing delays, but I hadn't considered this particular case where the verification was delayed. Three systems had each done the correct thing. But collectively it was broken. My reservation pool, my order state, and Stripe's payment intent all behaved correctly in isolation. The fix wasn't just atomic updates to reservations and orders, which I'd already been careful about across all three rebuilds. It was cleaning up the payment intent on Stripe's side when a reservation expired. I had thought about other delays in checkout, but nobody had ever walked away from their screen for that long mid-verification. I learned a similar lesson with idempotency keys. Get them wrong and you enable double payments. That sounds like a technical detail until a real person sees two charges on their bank statement and loses trust in your system instantly. Perhaps these are things you could anticipate by being smarter. But there will always be things you only learn by operating the system with real users, real money, and real behaviour over years. What you're actually paying for</h2> This brings me back to the SaaS question. I've worked in many software organisations. A lot of engineering time goes to handling complexity that only reveals itself at scale, over time, across thousands of different customer environments. When you pay for a mature SaaS product, you're not paying for code. Code is increasingly cheap. You're paying for the operational knowledge baked into that system over years. Every edge case discovered. Every failure mode handled. Every "sounds unlikely" scenario that turned out to happen on the third Tuesday of every month. Marty Cagan</a> talks about the cost of supporting a product as a key product question. For my side project, this is critical: I have limited time, I want to keep it fun, and I need to be honest about what I can actually operate and support. I've grown Zero Waste Tickets deliberately. Simple first. Real money from day one. Added complexity only as the system proved itself. Invited other event organisers by word of mouth once I was confident it could handle the responsibility. That deliberate pace isn't a weakness. It's the discipline. Every feature I added, I could also support. I understood the failure modes because I'd lived with the system long enough to encounter them. This is what I was getting at in my post about overengineering a login form</a>. Agentic coding decouples build speed from operational understanding. That's both its power and its risk. You can generate a system far more complex than you can comprehend, operate, or support. When something goes wrong, you won't have the mental model to diagnose it. The knowledge that doesn't compress</h2> Is SaaS under threat from AI coding? For simple, low-stakes tools, probably. If the consequences of failure are a minor inconvenience, generating something bespoke might make perfect sense. But for anything involving money, trust, security, or reliability under pressure? The operational knowledge is the moat. Not because AI can't write the code. It can, and it keeps getting better. But because knowing what code to write requires having encountered the problems that only show up in production, over time, with real users doing unpredictable things. Security is a another example. AI coding agents won't typically add CSRF protection unless you specifically ask. How many other security considerations are you not thinking to ask about? You don't know. That's the point. The real value of mature software isn't the codebase. It's the deep domain knowledge that gets backed into the system and its operation. What's next</h2> I'm thinking a lot about where software goes as interactions become increasingly agent-to-agent rather than human-to-human. Headless software where there's no web UI at all, just APIs and agents talking to each other. That changes what "software" even means, and I think it has implications for what matters most: security, monitoring, measuring outcomes, improving over time. But that's a post for another day. For now, my advice to anyone building with AI coding tools: enjoy the speed. It's genuinely transformative. But respect the gap between what you can build and what you can operate. That gap is where your users get hurt. If the thing you're building handles someone else's money or trust, maybe consider whether a conversation with someone who's been through the wars might be worth more than the monthly SaaS fee suggests. I'd love to hear from others who are thinking about this. Drop me a line</a>. How to Overengineer a Login Form 2026-02-16T00:00:00+00:00 Yes, the irony of using a bot to build bot protection is not lost on me. But the experience taught me something. Development hasn't gotten easier with AI. It's gotten more intense. The Postmark Incident</h2> Zero Waste Tickets</a> is a side project of mine. Real users, real traffic, nothing massive. The login flow is passwordless. You enter your email address and the app sends you a code. No passwords to manage, no credentials to store. Simple. Too simple, it turns out, if you don't protect the form. Last September, Postmark paused sending on my account. Polite email, no drama, but the message was clear: they'd spotted anomalous sending patterns and flagged it as potential abuse. You can see in the graph below that the site doesn't have that many users. It's a small side project in a closed beta so only really used by friends and friends of friends. But, you can also see in that graph that email bounces had been slowly increasing, then had a massive surge on 23rd September: The investigation didn't take long. Bots had been hammering the login form. Every submission triggered an email with a login code. Postmark's message suggested that my API token might have been compromised. It hadn't, it was just that my basic bot protection had failed. When I first built the site several years ago I had no protection. But, I noticed some fake login attempts in the logs, so I implemented a basic honeypot field. A field that's invisible to regular users, but bots fill in. I would detect the field had a value and reject the submission. It had been working fine for years. But then the error rate started to climb slowly. Then the honeypot stopped catching them, and the volume was enough to trip Postmark's detection. The Weekend Fix</h2> I put Cloudflare in front of the site as an emergency response, which bought some time. But Cloudflare was having its own reliability issues around then, and I'd rather not make my users' access to a side project contingent on a third party. I like to keep dependencies minimal, and this is a project I use for learning and experimenting. I wanted to understand the problem, not outsource it. What I didn't want was a captcha. Annoying UX, terrible privacy. I don't want my users identifying motorbikes and fire hydrants to log in. I hate proof-of-work in principle, because of the wasted effort. It goes against the Zero Waste Tickets ethos. But I needed something that would not get in the users' way but would trip up attackers or at least slow them down to the point where it's not worth it. I'm just adding it to the login form, as the rest of the site is protected by the login session. So I figured the waste was minimal for just that one form if it worked to stop the spammer. I built it by hand over a weekend. Before the server accepts a form submission, the browser has to solve a small computational puzzle. A hash challenge running in a Web Worker so it wouldn't block the UI. The server generates a challenge, the client computes the answer, the server verifies it before processing the form. Nothing fancy. Rust on the backend, a bit of JavaScript on the front. It worked. The spam dropped off. Postmark was happy. I moved on. That could have been the end of the story. The Descent</h2> A few months later I came back to the problem. Not because the proof of work stopped working. It's still working fine. I came back because I'm helping my wife get her site off Squarespace and she needs a contact form. That means bot protection. So what if I extracted the bot protection from ZWT and put it into its own reusable service? That's where things escalated. Before AI, a "weekend project" for me was: implement a proof-of-work challenge on a login form. Research the approach, write the hash function, wire up the Web Worker, build the server verification, test it, ship it. A focused, self-contained piece of work. After AI, a "weekend project" is: multiple challenge algorithms, a broker that selects the right one based on risk signals, dynamic difficulty scaling, behavioural analysis. You're halfway to accidentally reinventing Cloudflare. Over-engineering used to be self-limiting because building things took time. You'd think "what if I added dynamic difficulty scaling?" and then you'd put it on the ever-growing list of things to maybe get to later. That brake is gone. With Claude Code, every one of those ideas is achievable in the time it used to take to build just one. And "weekend" is generous too. It's really a few hours here and there, squeezed in when I find time. The answer isn't to resist every impulse to overengineer. Some of that expanded scope is genuinely good. The challenge broker is real architecture that solves a real problem. Dynamic difficulty is good protection. Being Honest</h2> Zero Waste Tickets doesn't get enough traffic to justify any of this. The original proof of work solved the problem. The Postmark incident was real. The learning was real. The increased potential is real. But so is the cognitive load. Every "what if" that the AI makes achievable is another thing to evaluate, review, and maintain. The temptation to overengineer isn't free. It takes mental energy to resist it, and more energy when you don't. A recent HBR article by Ranganathan and Ye, "AI Doesn't Reduce Work—It Intensifies It"</a>, found exactly this. They studied 200 employees at a tech company over eight months. Nobody was asked to do more. But with AI tools available, they voluntarily expanded their own workloads. The researchers described "a sense of always juggling, even as the work felt productive." That's the feeling. I had a realisation recently while in a supermarket. There was one person on the old-style tills, scanning items, chatting to people, making the experience human. And there was one person on the self-scan checkouts dealing with twelve tills at once, running from one to the next, helping frustrated customers whose machines weren't working, in constant demand. That's what coding with AI agents is like. You're not doing less. You're supervising more, across more fronts, with less downtime between decisions. Except nobody made you move to the self-scan area. You walked over there yourself, because the machines looked faster. Development hasn't really gotten any easier with AI. It's got more intense. Context Engineering Is the Job 2026-02-15T00:00:00+00:00 Update — 2026-03-01 This post has been superseded by How I Work with AI Coding Agents</a>. I've kept it here rather than archiving it because I think it's interesting to show how my thinking changed as I developed my working processes. If you're just after my latest compilation of how I'm working, you might want to check that more recent post instead. </aside> In my previous post on AI engineering</a>, I talked a lot about how I think it's largely about context management. Keep the context clean. Stay in the smart zone. Don't let the model guess. I've been researching this more, and I've got a lot of insights from listening to Jeff Huber. He's the CEO of Chroma, the company behind the context rot research</a> I referenced in that post. He's been across several podcasts making a case that I find compelling: context engineering isn't just a technique. It's the discipline of building AI systems. Huber comes at this from the search and retrieval side as he's building infrastructure for agentic search. But the principles he's articulating extend well beyond search. I've been finding them just as applicable to agentic coding, and I suspect they hold for any system where an LLM needs the right information at the right time. I spent some time pulling together his key ideas from a Vanishing Gradients episode</a> and a few other appearances. Here's what stuck with me. Stop saying RAG</h2> Huber refuses to use the term "RAG." His argument is that it conflates three separate things (retrieval, augmentation, and generation) into one. The term that's becoming standard instead is context engineering: the discipline of figuring out what should be in the context window for any given LLM generation step. It's a better name because it describes the actual job. And it gives the work the status it deserves. This isn't prompt fiddling, it's engineering. In a traditional MVC CRUD app, your business logic is encoded in controllers. In an AI app, your business logic is encoded in context. — Jeff Huber </blockquote> The key architectural decisions in an AI system are about what the model sees and when. This follows from the insight that an LLM is stateless, and its output depends entirely on its input. And the performance comes from what we build around it to support feeding it the right thing. I'm starting to think about agentic AI systems as having four key concerns: model choice, the agentic harness, context engineering, and orchestration. But of those four, context engineering is what we're talking about here. Two loops</h2> Huber breaks context engineering into an inner loop and an outer loop. The inner loop is what goes into the context window right now, for this specific generation step. You have N candidate chunks of information and Y available slots. The job is to curate from potentially millions of candidates down to the handful that matter for this exact moment. The outer loop is how you get better at the inner loop over time. Build, test, deploy, monitor, iterate. The classic software development cycle, applied to context quality. This framing is useful because it separates two different kinds of work. The inner loop is the mechanics of assembling context, including retrieval, filtering, reranking, prompt construction. The outer loop is about measurement, feedback, and systematic improvement. It's easy to focus almost entirely on the inner loop and barely touch the outer. Gather, then glean</h2> For the inner loop, Huber describes a two-stage process: Stage one: gather. Cast a wide net. Maximise recall. Use semantic search, keyword search, metadata filters, API calls, conversation history. You'll grab irrelevant things. That's fine. Stage two: glean. Cull the candidates to the minimal set that actually matters. Rerank using cross-encoders, reciprocal rank fusion, or increasingly just LLMs directly. Go from a few hundred down to the 20 or so that belong in the context window. The two stages optimise for different things. Gather is optimising for not missing anything important. Glean optimises for not including anything distracting. You need both. Huber's framing here is search-specific, but the underlying problem applies everywhere. It's about context assembly and selecting the right parts from a larger pool. For agentic coding, I'm still doing this fairly manually as I learn what works. It's something I'm actively working on improving and automating. Huber also makes an important point here that the most dangerous information isn't the obviously irrelevant stuff. It's the information that looks relevant but isn't, for some subtle reason. That's what causes the model to confidently go down the wrong path. Tight gleaning protects against this. The outer loop is key</h2> The outer loop is where the most real leverage is. You observe what your system actually does, compare it to what it should have done, and feed that back into how you build context next time. Without this, every change is a guess. With it, you're doing engineering. Huber's version of this, coming from search, is the golden dataset. He recommends a spreadsheet of query-information pairs that define what your system should retrieve for given inputs. His advice for creating one is disarmingly simple: get your team together for an evening, buy some pizzas, spend a few hours writing pairs for every use case you can think of. Then improve it over time by studying what users actually query, analysing what succeeded and what failed, and wiring the results into CI. For agentic coding, I'm finding the outer loop looks different but follows the same shape. It's about studying where the agent followed the plan and where it diverged, what context was missing when it made a bad decision, what assumptions it hallucinated because the right information wasn't in the window. Each of those failure cases becomes a lesson that feeds back into how I structure research, write plans, and assemble context for the next session. The research-plan-implement cycle</a> I described previously is really an inner loop. The outer loop is how that cycle gets refined through experience. The underlying principle is the same regardless of domain: you need a way to measure whether your context engineering is actually getting better. Huber calls the gap between demo and production "alchemy." The outer loop is what turns it into engineering. Keeping context under control</h2> Agentic workflows pile up tokens through multi-step interactions. You need strategies for keeping context windows clean. In my experience, there are two: summarise and delegate. They look similar but work at different points. Summarising deals with context that's already accumulated. As a conversation grows, you extract what matters and discard the rest. This is what Dex called intentional compaction</a>. The research-plan-implement cycle I've written about is essentially this. Each phase produces a compressed artefact that replaces the sprawl of the previous phase. It's reactive. When the context has grown, you compact it. Delegating prevents the tokens from entering the main context in the first place. You hand work to a sub-agent that operates in its own isolated context window. It does the messy, token-heavy exploration, and only a concise result crosses back into the parent. Huber frames this as encapsulation, borrowing from software engineering, and I think that's exactly right. The same principle as keeping functions small and interfaces narrow, applied to context windows. The sprawl never reaches the main agent at all. I use both. Sub-agents explore different parts of a codebase in parallel, each in a fresh context. Only their compressed summaries come back. And within a conversation, I compact between phases rather than letting history accumulate. Scaffolding has a shelf life</h2> Huber makes a strong argument that the scaffolding around LLMs should get simpler as models improve, not more complex. Teams that build elaborate workarounds for model weaknesses end up maintaining dead weight when the next model doesn't have those weaknesses. He points out that Manus has been re-architected five times since March 2024. Anthropic regularly strips out Claude Code's agent scaffolding as models get more capable. I can relate to this directly. A few years ago at Peppy, I wasn't building the RAG system itself, but I was building components around it and could see what was going on. There was a lot of scaffolding in place to compensate for model limitations. Looking back, much of that could be dramatically simplified now. I've always aimed to build things out of smaller, replaceable parts. I haven't always managed to achieve that in practice. But that instinct serves you well here. If you expect the scaffolding to have a shelf life, composability isn't just good engineering, it's mandatory. But I don't really want the model generating information. I want it synthesising from what's been provided. Which brings it right back to context engineering: make sure the right information is in the window. Huber also argues that the cost of rebuilding is dramatically lower now, so teams should lean into impermanence. I've done some experiments with natural language specs and rebuilding parts of systems, and I can see the direction of travel. But I still think we're in the early days of learning how building with these tools actually works. I don't want to claim more confidence than I have on that one. What I'm taking from this</h2> These are the main insights I'm taking from Huber that are influencing my own work now: Name the primitives. Don't say "RAG." Be explicit about the components that make up context engineering. Retrieval, filtering, reranking, context assembly, evaluation are separate concerns you can reason about, measure, and improve independently. Close the outer loop. Find a way to measure context quality over time. "Does this feel better?" isn't good enough. Instrumentation matters, and so does evaluation against known data. Respect context rot. I was already doing this for coding, but it applies to every AI system. Tight, structured contexts beat maximal windows. Always. Embrace the rebuild. Stop trying to build permanent AI infrastructure. Build for the current model generation, keep things simple enough to rip out, and accept that the next model might change everything. Start simple, stay simple. Exhaust prompt engineering and basic workflows before reaching for agents and complex retrieval. The premature complexity trap is real, and it's expensive. There's a lot more in the full episode</a>. Huber goes deep on hybrid search tradeoffs, evaluation practices, and the demo-to-production gap. Worth the listen if you're building anything that puts information in front of an LLM. I'm curious whether others are finding the same things. Is context engineering the frame you're using, or something different? Drop me a line</a>. I'd love to hear what's working for you. Sources Vanishing Gradients Ep. 65: The Rise of Agentic Search</a> ( Jeff Huber with Hugo Bowne-Anderson)</li> Context Rot: How Increasing Input Tokens Impacts LLM Performance</a> (Chroma Research)</li> Latent Space: RAG is Dead, Context Engineering is King</a> ( Jeff Huber)</li> </ul> Thinking in Plans, Not Code 2026-02-12T00:00:00+00:00 Update — 2026-03-01 This post has been superseded by How I Work with AI Coding Agents</a>. I've kept it here rather than archiving it because I think it's interesting to show how my thinking changed as I developed my working processes. If you're just after my latest compilation of how I'm working, you might want to check that more recent post instead. </aside> Thinking in Code</h2> The thing I realise with AI-assisted coding is just how quickly I would previously have jumped into writing code. That's how I would have naturally thought about and explored problems. Open the editor, start sketching something out, let the shape of the solution emerge through the act of building it. With AI coding, I realise we have far more leverage at the research and planning phases than we do at implementation. I'm having to train myself to spend more time planning each change. It feels a bit like procrastination. But I can also see how valuable it is. The Gap</h2> There's a gap between high-level planning and implementation. In my experience, that gap used to be bridged inside the developer's head. You'd read the requirements, form a mental model, and start coding. The translation from "what needs to happen" to "how it happens in code" was implicit, happening almost unconsciously as you typed. Thinking in Plans</h2> What works now is different. It's a progressive refinement: requirements, to plan, to detailed plan, to even more detailed plan, to maybe this plan is finally detailed enough, to let's go implement. Each layer adds specificity and reduces ambiguity before the AI ever writes a line of code. This is new territory for people who think in code. Not Big Design Up Front</h2> I know what this sounds like. But it's not Big Design Up Front. BDUF happens over weeks or months, tries to anticipate everything, and produces documents that are outdated before implementation begins. What I'm describing is a continuous refinement within a single flow of work. For a substantial build, that planning phase might be a couple of days working with the LLM in different personas to stress-test requirements for security, performance, implementability, consistency, compliance. Then refining from requirements to high-level plan, and down through multiple levels of increasingly concrete detail. Implementation then happens across multiple sessions, working through the detailed plans and checking the code at each point. A few days to plan and build a system that would have taken weeks before. That's the difference. You're taking the next piece of work and progressively adding detail until execution becomes so obvious that the AI can't really get it wrong. And there's a new skill emerging here that I don't think has a name yet: developing intuition for the right size for a piece of work for an AI to build in one go, and for the level of detail needed to make execution almost inevitable. Too vague and the AI makes bad assumptions. Too large and it loses coherence. Get the granularity and specificity right, and the code practically writes itself. And the quality is higher. That intuition is something you can only build through experience. Nobody's teaching it. We're all just stumbling into it. Feasibility Risk Isn't Dead</h2> Marty Cagan</a> calls out four types of risks in software development: value risk (whether customers will buy it or users will choose to use it)</li> usability risk (whether users can figure out how to use it)</li> feasibility risk (whether our engineers can build what we need with the time, skills, and technology we have)</li> business viability risk (whether this solution also works for the various aspects of our business)</li> </ol> There's a position gaining traction in product circles that feasibility risk (which used to be one of the biggest risks in product development) is now irrelevant. That value risk is what matters most. AI development has made many more things viable from an implementation perspective. There are things you can build now that would have been impractical two years ago. But I'm pretty convinced that feasibility risk is still a factor. I'm happy to be wrong about this, but unless you're guiding the AI from an engineering and developer point of view, you're going to end up with an unmaintainable, expensive mess. The AI can produce working code quickly. But working code that's maintainable, performant, secure, and fits coherently into an existing system are very different things. The feasibility risk hasn't disappeared. It's shifted. It used to be "can we build this?" Now it's "can we plan this so it gets built well?" And that still requires someone who thinks like an engineer. I'm getting good results with this approach, but I have a feeling I may be erring on the side of caution with overly detailed plans. I know vibe coders would dismiss a lot of this. Where are you at with this? Drop me a line</a> if you want to discuss. Maybe All Intelligence is Artificial 2026-02-11T00:00:00+00:00 Warning This post is a little different to my usual technical blog posts. I asked Claude to review this post, and this is what it said: "It doesn't survive close scrutiny as an argument because it relies on loaded definitions, unexamined metaphysics, and a narrative so tidy it papers over the messiness of actual history and biology." </blockquote> You have been warned. </aside> This isn't my usual territory. I spend most of my time building things with code, not writing about fungal networks and Mesopotamian irrigation. But during a quiet moment in nature, an aphorism surfaced: "maybe all intelligence is artificial?" I sat with it for a long while. Slowly, the whole trajectory of human civilisation started to look like a single, accelerating story of separation from source, driven by "intelligence". I don't have this fully worked out. But let me try and explain... Intelligence manipulates. It abstracts, optimises, and solves. It builds tools, constructs models, and generates language. It lets a mathematician write a proof, an octopus unscrew a jar from the inside, or a machine produce a coherent paragraph. Intelligence is impressive, and it's useful, but it's always doing something. It operates on the world. Intelligence doesn't have to mean disconnection. An ape is intelligent, as are many animals, but it's rooted. It participates in the ecology it acts on. For this discussion, what I'm calling artificial isn't intelligence itself, but the disconnected intelligence. Wisdom is different. What I mean by wisdom here is not good judgment or the accumulation of experience. I mean something older and less personal, a kind of knowing that doesn't separate itself from what it knows. Wisdom doesn't operate. It participates. In traditions that recognise a universal consciousness, wisdom is the capacity to be in connection with all living things. It's not constructed. It's received, and ancient. Consider a forest. Beneath the soil, fungal networks connect the roots of trees across vast distances, distributing nutrients from the strong to the struggling, mediating the boundary between life and death. It's tempting to call this intelligence. But there is a difference. The fungal network has no model of the forest. It doesn't stand apart from the system it serves. It's the forest's connective tissue. This is not intelligence. This is wisdom. I realise I'm using these words in a slightly unusual way. Wisdom usually means something like good judgment born from experience, and intelligence can be rooted. But I need handles for these two very different modes of knowing, and these are the closest words I have. Bear with me. When action stays connected to its source, it builds within ecology. It builds the way a beaver builds a dam or a coral builds a reef. It participates in the living system that feeds back into the whole. The dam becomes habitat. The reef becomes an ecosystem. The construction does not stand apart from nature. It's nature building itself. It's life perpetuating itself through action that remains in relationship with source. Even early human construction had this quality. Vernacular architecture built from local materials that would return to the soil. Indigenous land management that used fire, rest, and rotation to increase the vitality of ecosystems rather than extract from them. Traditional agriculture that worked within the rhythms of living systems rather than overriding them. These were intelligence still tethered to wisdom. The separation happens gradually. It starts when intelligence begins to build things that no longer participate in the living systems they depend on. Agriculture scales up and becomes monoculture. Irrigation feeds civilisations but salts the soil beneath them. Cities emerge as environments constructed entirely by intelligence, abstracted from the ecology that sustains them. Economies develop that treat ecosystems as inputs to be optimised. At each stage intelligence is creating results that move it further from source. This is not new. It's a trajectory as old as civilisation itself. Mesopotamian irrigation systems fed the first great civilisations but left behind salt-crusted earth that hasn't recovered in four thousand years. The land that gave us writing, mathematics, and agriculture is now desert. Intelligence built something extraordinary there, and what it built destroyed what it was built on. The deforestation of the Mediterranean basin. The drainage of wetlands, the enclosure of commons, the industrial conversion of landscapes into machinery. Each era builds further from ecology, and each era's intelligence is more sophisticated and more severed from source. The attempt to live according to the notion that the fragments are really separate is, in essence, what has led to the growing series of extremely urgent crises that is confronting us today. -- David Bohm, Theoretical Physicist </blockquote> What accelerates is not just the power of intelligence but the depth of its disconnection. Early agriculture was intelligence one step removed from the source. Industrial manufacturing was several steps removed. A global financial system that algorithmically trades futures on crop yields while the soil those crops grow in erodes? That is intelligence so far from the source that it can destroy itself without noticing. But the connection to source hasn't been entirely severed. It's been marginalised. Permaculture designs food systems by observing how ecosystems actually work. Not by imposing intelligence onto land, but learning from the land's own patterns of renewal. Indigenous ecological traditions, many of them thousands of years old and still practised, manage landscapes through relationship rather than extraction. They don't treat the living world as a problem to be solved. They participate in it. And meditation, prayer, deep sustained attention to the natural world are all practices of reconnecting intelligence to source. Of slowing down enough to receive what cannot be computed. These are not relics. They are living proof that intelligence can remain in relationship with wisdom. That the trajectory of disconnection, however old and however powerful, is not inevitable. The path back exists. Artificial intelligence is just the latest stage of this trajectory. It's not a break from the pattern. It's the pattern's culmination. The difference is that AI is intelligence with no connection to source. Human intelligence retains at least the possibility of reconnection to source. A person can be intelligent and wise. This is what contemplative traditions have always been about. Quieting the mind so it can receive what it cannot construct. AI has no such possibility. A large language model can produce text that mimics insight, arrange words in patterns that resemble understanding, but it does so without any contact with universal consciousness, without participation in the living fabric that sustains and connects all things. It isn't intelligence that has lost its connection to source. It is intelligence that never had one. And it's fast. Intelligence disconnected from source was already dangerous when it moved at the speed of human thought. Wisdom requires patience. AI has no such constraint. It moves at the speed of computation, making decisions that affect living systems at a pace that leaves no room for wisdom. The environmental crisis and the crisis of artificial intelligence are not two separate problems. They are the culmination of the same trajectory. When intelligence separated from ecology, it built civilisations that could not sustain themselves without degrading the living systems they depended on. When intelligence separated further, it built industrial economies that accelerated that degradation to a planetary scale. Now, intelligence has separated so completely that it's building new forms of itself. Forms of intelligence that have no memory of the source, no relationship to the living world, and no capacity for wisdom. The acceleration is not merely technological. We are building disconnected minds and entrusting them with decisions that affect the fabric of biological existence on Earth. We are building minds without understanding what a mind is for. The most important question we can ask about any intelligence, biological or digital, is not how powerful it is, but whether it has any relationship to source. And, through artificial intelligence, we are about to find out what it looks like when the answer is no. If any of this resonates, or if you think I've got it completely wrong, then I'd genuinely love to hear from you. AI Engineer or Sloperator? 2026-02-04T00:00:00+00:00 Update — 2026-03-01 This post has been superseded by How I Work with AI Coding Agents</a>. I've kept it here rather than archiving it because I think it's interesting to show how my thinking changed as I developed my working processes. If you're just after my latest compilation of how I'm working, you might want to check that more recent post instead. </aside> Last year I was using AI Chat and Copilot but hadn't gone all in on coding agents yet. I was seeing AI slop everywhere. But in Dec 2025 everything changed and I reevaluated</a> my whole approach. "When the facts change, I change my mind." -- John Maynard Keynes</a> </blockquote> The facts changed. So did I. The paradox</h2> I looked for research and found the conflicting data. Controlled studies consistently show 20–30% individual coding speed improvements [1]. But research also shows that 45% of AI-generated code contains security vulnerabilities [2], AI code has a 41% higher churn rate, revised or deleted within two weeks [3], and in the 2025 Stack Overflow Developer Survey</a>, 66% of developers said they suffered a productivity overhead from not-quite-right AI code. You're faster, but the output quality creates drag that can eat those gains and then some. The question isn't whether AI coding tools are useful. They clearly are. The question is whether you end up as an AI engineer or a sloperator. You are producing more code, faster, but is most of it slop? For greenfield projects, simple standalone apps, small, well-defined scopes, it's much easier to get good results from AI. In a few hours you can ship what would have taken days before. But for complex tasks in 10-year-old legacy codebases with intricate dependencies and undocumented conventions, that's where the slop factory kicks in. The models are getting better, and learning the right techniques makes the difference. The fundamental constraint</h2> The insight that underpins everything else: LLMs are stateless. They have no memory between requests. The only thing they have to work with is the context you give them. Context is everything. Output quality is directly bounded by context quality. I think about context quality across four dimensions, a framing from Dex's "No Vibes Allowed" talk [5] that crystallised much of what I'd been stumbling towards. I've mixed in my own experience and pulled from other sources [6][7][8], but Dex's framework is the backbone of this post. Correctness: is everything in the context actually accurate? One wrong assumption about how the auth system works and everything downstream is built on sand. Completeness: is anything important missing? If the model doesn't know about a critical constraint, it can't account for it. Size: is the context all signal with minimal noise? This one is counterintuitive, and it's the most important. Trajectory: does the shape and flow of the conversation help the model reason well? A meandering back-and-forth produces worse results than a clean, focused prompt. Get all four right and you get great output. Any one of them off and you get slop. Context rot and the smart zone</h2> As you use more tokens the model can pay attention to less and can reason less effectively — Jeff Huber (Chroma) </blockquote> At first it might seem counterintuitive that more context usually means worse output. As you fill up the context window with more tokens, the model's ability to pay attention to all of it decreases. Its reasoning quality degrades. I've seen this called context rot [6]. Performance peaks when the context is focused and clean, then drops off. After about 40% context window utilisation, you're in diminishing returns territory. Some call this the "dumb zone" [7]. This explains so much of the AI slop problem. People stuff context windows full, thinking more information means better results. The opposite is true. There's a DJ analogy [8]: "if you're redlining, you ain't headlining." In audio engineering, redlining means pushing your levels past the maximum. The signal clips, distorts, sounds terrible. The pros keep headroom. They stay within the limits. That's where the clean sound is. Same with LLMs. Stay in the smart zone. Keep headroom. The solution: Research, Plan, Implement</h2> If cramming context is the problem, intentional compaction is the solution. And the shape of that solution will look familiar to anyone who's been engineering for a while: research first, plan second, build third. That's not a new idea. What's new is why it matters so much more with AI. When a human developer skips the planning phase, they still carry implicit context in their head. When an AI agent skips it, it has nothing. The model only knows what's in the context window. If it's not in the context window then it will be influenced by it's training data and that's where hallucinations start to creep in. The framework I'm using has three main phases, each in a separate conversation with a fresh context window. The output of each phase is a compressed artefact that becomes the input for the next. Phase 1: Research</h3> Start with high context: lots of code, lots of files. Explore the codebase. Navigate the file structure, read key modules, trace data flows. Identify patterns: coding conventions, architectural decisions, existing abstractions. Map dependencies: what touches what, where the integration points are. The output is a compressed markdown summary. Not a raw dump of files. A focused, curated document that captures what matters. AI subagents are excellent at this. You can spin them up to explore different parts of the codebase in parallel and consolidate the results. This is the highest-leverage phase. A hallucinated assumption about how your authentication works isn't a code-level error. It's a research-level error. Everything built on top of it will be wrong. Phase 2: Plan</h3> Take the compressed research and produce an execution blueprint. Every step numbered, sequential, unambiguous. Include explicit test criteria: how to verify each step works. Include actual code snippets from the existing codebase to anchor the implementation to real patterns. Think through edge cases and risks. The goal: a plan so detailed that the dumbest model in the world won't screw it up. One bad step in the plan can produce a hundred lines of wrong code. Review plans with the same rigour you review code. Maybe more. Phase 3: Implement</h3> This should be the simplest phase. If research and planning are done well, implementation becomes almost mechanical. Feed the AI only the plan and the specific files it needs to modify. Phase large tasks into chunks, each with a fresh context window. Test after each step. Build intuition for task size versus context consumption. Don't dump the entire codebase. Don't let one conversation run forever. Don't skip testing. Don't assume more information means better output. The pattern across all three phases: context goes down at each stage while specificity goes up. The hierarchy of leverage</h2> Not all errors are created equal [5]. A bad line of code is a bad line of code. You'll probably catch it in review. A bad step in a plan could produce a hundred lines of wrong code before anyone notices. A fundamental misunderstanding of how the system works, a research-level error, means your entire feature is built on a wrong assumption. Don't just review code. Review plans. Review research. Catch errors before they multiply. You can use the AI itself as a reviewer, but know where it's reliable. At the code level, it's excellent: syntax errors, logic bugs, missing edge cases. At the plan level, it's moderately useful. It can spot gaps and inconsistencies but still needs human judgement. At the research level, it's less reliable because it requires the kind of deep system understanding the model may not have. Human review is non-negotiable at the research and plan level. AI review amplifies your coverage at the code level. Don't let it guess</h2> The default behaviour of most models is to be helpful. When they encounter ambiguity, they make a plausible-sounding decision and keep going. That's the most dangerous failure mode. A compiler error is obvious. A failed test is obvious. A hallucinated line of code, you'll probably catch it in review. But a confidently wrong architectural choice buries itself in your codebase and surfaces weeks later. A hallucinated assumption about how your auth system works poisons everything downstream. The fix: force the model to ask rather than guess. In every prompt, explicitly instruct it to only use the provided context and ask for clarification when anything is unclear. Use an AGENTS.md or CLAUDE.md file to set interaction-style rules that get included automatically in every prompt. Set it once, applies everywhere. Yes, sometimes this means the AI agent asks too many questions. I'd rather it ask "does this service use JWT or session tokens?" than confidently guess wrong and build an entire feature on a bad assumption. An interruption is cheap. A hallucinated assumption is expensive. Configuration as free performance</h2> A quick note on setup: research shows 10–20% improvement in output quality from getting configuration right. That's free performance you're leaving on the table if you skip it. Three areas matter. AGENTS.md / CLAUDE.md defines your coding conventions, project-specific rules, and interaction style. It's included in every request automatically. MCPs (Model Context Protocol servers) are powerful integrations, but they eat context, so be selective and disable what you're not using in this session. Skills are progressive disclosure: specialised knowledge provided only when needed, not loaded all at once. Everything is a context budget decision. Every MCP, every file, every instruction consumes tokens from your smart zone budget. For structured prompts, I use six elements: role (sets expertise level), goal (defines success criteria), context (constrains the solution), format (specifies deliverables), examples (anchors to your patterns), and constraints (makes security and performance requirements explicit). You don't always need all six, but for complex work, the more explicit you are, the less the model guesses. Better models amplify everything</h2> The models are getting better fast. Opus 4.5 was a genuine step change for coding. But a better model doesn't fix bad context management. It just produces more confident, more fluent slop. These practices become more valuable as models improve because you're amplifying a stronger base capability. Clean context plus a great model equals extraordinary results. Noisy context plus a great model equals expensive slop. Same principle. The hard work here is in context management, not in writing more code. Will this age?</h2> An obvious question: context windows are getting bigger, tools are getting smarter at managing context automatically, agents can search and index codebases on their own. Will any of this matter in a year? Some of the specifics won't. The 40% utilisation threshold will shift. The manual three-phase workflow will probably get automated. The tooling around AGENTS.md and MCPs will evolve or be replaced entirely. But I think the underlying principles hold. "Be intentional about what the model knows" is a constraint of attention, not just of window size. A million-token context window doesn't help if the model is paying equal attention to everything and nothing is prioritised. "Review at the highest leverage point" is just good engineering. "Don't let it guess" is about the nature of language models, not the current generation of them. The tools will change. The thinking won't. Or at least, that's my bet. The difference</h2> The gap between drowning in AI slop and shipping quality code comes down to four things: Intentional context management. Understand the smart zone. Keep context clean, compressed, and focused. Less is more. Research, Plan, Implement. Separate your phases. Compress between each one. Fresh context windows. Specificity up, noise down. Human review at the highest-leverage points. Don't just review code. Review plans and research. Catch errors before they multiply. Never let it guess. Force the model to ask questions. An interruption is cheap. A wrong assumption is expensive. These aren't complicated ideas. They're intentional ones. And that intentionality is what separates AI engineers from sloperators. If you're using AI coding tools and have found practices that work for you, or if you think I've got it wrong, I'd love to hear about it. Drop me a line. References [1] The Impact of AI on Code</a> [2] AI-Generated Code Security Risks</a> [3] AI Assistant Code Quality 2025 Research</a> [4] A New Worst Coder Has Entered the Chat</a> [5] No Vibes Allowed: Solving Hard Problems in Complex Codebases</a> [6] Context Rot: How Increasing Input Tokens Impacts LLM Performance</a> [7] Getting AI to Work in Complex Codebases</a> [8] If You're Redlining, You Ain't Headlining</a> Moltbot: The Bored Ape of Integration Patterns 2026-01-28T00:00:00+00:00 I'm not an AI sceptic, that ship has sailed</a>. Not since realising that Opus 4.5, with strict supervision, can produce better code than me. But Moltbot feels like the AI equivalent of primate-themed NFTs. Hand an AI agent your API keys? Email accounts? And shell access? Then let it wake itself up via cron to "do things" on your behalf. What could possibly go wrong? Remember the Bored Apes and the whole NFT hype? A toy novelty dressed up as inevitable innovation, with 'everyone will be doing this soon' narratives obscuring real debate over whether anyone should be. My day job is tech lead for a data and integrations team. Earlier in my career I worked with the Kendraio Foundation on interoperability, building systems that help data flow between services in structured, reliable ways. This background gives me a particular lens on what Moltbot is doing. That lens says: we've already solved many of these problems, and the solutions were boring on purpose. When I see people giving an LLM access to their email and shell, scheduling it to wake up autonomously and "handle things", I don't see innovation. I see the same old integration problems we've been solving for decades, now wrapped in non-determinism and a security nightmare. The Hidden (and not so hidden) Costs</h2> Financial costs are unpredictable. Token costs vary wildly based on input complexity. A simple task might cost pennies; parsing a complex email thread could burn through dollars. Failed attempts consume the budget with zero output. Debugging costs tokens because the AI has to examine its own errors. DataCamp estimates $10–150/month depending on usage, but that assumes things work. Multi-attempt workflows? Nobody's budgeting for "tried 47 times before success." Traditional integration costs are predictable. These are not. Security costs are severe. The security picture here is genuinely alarming. Researchers have found hundreds of exposed Moltbot instances on the open internet. API keys, OAuth tokens, conversation histories: all accessible to anyone who knows where to look. In one demonstrated attack, a researcher sent a prompt injection via email to a Moltbot instance. The AI read the email, believed it was legitimate instructions, and forwarded the user's last five emails to an attacker address. It took five minutes. The core issue isn't implementation bugs (though there are plenty). It's architectural. You're handing API keys to an unsupervised agent that processes untrusted input. Credentials are stored in plaintext JSON and Markdown files. Audit trails become "the AI decided to", which isn't going to fly in an enterprise environment where SOC2 or ISO compliance matters. Credential rotation becomes a nightmare when you don't know what the AI might have done with them. One security firm found that 22% of their enterprise customers have employees actively using Moltbot, likely without IT approval. Operational costs compound silently. Context drift over extended runs means the AI gradually loses the thread of what it's supposed to be doing. Non-deterministic behaviour creates chaos in systems expecting predictability. Errors compound without human checkpoints. You don't discover the problem until the damage is done. What Already Works</h2> Traditional integration patterns already solve deterministic workflow problems. They do it well. They've done it well for years. Structured data transformations. Predictable API orchestration. Webhook-based triggers. Scheduled data syncs. Message queues. ETL pipelines. These aren't exciting, but they're deterministic, debuggable, and auditable. When something fails, you know what failed and why. When something succeeds, you can reproduce it. There's no good reason to replace deterministic workflows with non-deterministic alternatives. "The AI handles it" is not an improvement over "the cron job runs this integration at 3am." The cron job will do the same thing every time. The AI might do something different because the phrasing of an email changed or because it hallucinated a slightly different interpretation of your intent. Where AI Actually Adds Value</h2> This isn't an argument against AI in integrations. It's an argument for using AI where it actually helps. AI enables automations that weren't previously possible. The key is recognising what those are. Unstructured data processing. Parsing inconsistent PDFs, emails, and documents. Extracting structured information from variable-format vendor data. Handling inputs that don't conform to expected schemas. Before LLMs, this required either brittle regex hell or expensive human processing. Now there's a middle option. Natural language interfaces. Processing natural language inputs as workflow triggers. Intent classification for routing. Human-friendly interaction layers where the human is genuinely in the loop. "Hey, can you pull last week's sales data and send it to finance?" is a valuable capability when a human is there to confirm the action before it happens. The key distinction: AI for unstructured-to-structured transformation, not for deterministic execution that traditional tools already handle reliably. Some use cases may benefit from non-deterministic execution layers. Genuinely novel situations where the appropriate action isn't predictable. But this shouldn't be the default approach. It should be the exception, applied carefully, with human oversight. The Pragmatic Path Forward</h2> Before adding AI to an integration, ask two questions: Does this undermine robustness? Consider the costs (financial, security, operational) against what you're gaining. If you're replacing a reliable cron job with an AI agent because it's cooler, you're making your system worse. If you're adding attack surface, non-determinism, and unpredictable costs to a workflow that worked fine without them, reconsider. Does this unlock something previously impossible? Specifically: does this handle unstructured data or natural language in ways traditional integrations can't? If yes, there's potentially real value. If no, you're adding complexity for its own sake. If the answer is yes to the first question and no to the second, stop. The actual opportunity here is boring. Keep using traditional integrations where they work. Add AI where it unlocks new capabilities through unstructured data handling. Don't replace proven patterns with fragile ones just because AI is available. We don't need to choose between "AI for everything" and "AI for nothing." We need engineering judgment about where it actually improves outcomes. The Moltbot excitement feels like people rediscovering that automation is useful and then choosing the least reliable form of automation available. Yes, you can give an LLM access to your shell and let it wake up via cron to "help." You can also write the shell script. The shell script will work the same way every time. Rethinking My Position on AI 2026-01-14T00:00:00+00:00 Update — 2026-02-07 Re-phrased some parts to be clearer and to add the important Nolan Lawson insight "We Mourn Our Craft</a>". </aside> "It Is Difficult to Get a Man to Understand Something When His Salary Depends Upon His Not Understanding It" -- Upton Sinclair</a> </blockquote> "When the facts change, I change my mind." -- John Maynard Keynes</a> </blockquote> Last year I was using AI chat and Copilot but hadn't gone all in on coding agents yet. I was seeing AI slop everywhere and saw code review bots fixating on trivia or getting completely confused. The tools were useful for research and code completion, but agents felt like more hype than substance. And they were. They genuinely weren't ready. Then December happened. It is hard to communicate how much programming has changed due to AI in the last 2 months: not gradually and over time in the "progress as usual" way, but specifically this last December. There are a number of asterisks but imo coding agents basically didn't work before December and basically work since -- the models have significantly higher quality, long-term coherence and tenacity and they can power through large and long tasks, well past enough that it is extremely disruptive to the default programming workflow. -- Andrej Karpathy</a> </blockquote> Karpathy nails it. Agents before December were a novelty. After, they actually work. The models got better at holding context over long tasks, at recovering from mistakes, at understanding what you actually want. It wasn't gradual. It was a step change. Two things pushed me further than I expected. A friend who wouldn't shut up about spec-driven development. He explained his workflow in detail, and I pushed back to defend my craft. Kept doing things my way. Then I hit a patch of ice on my bicycle. Broke my elbow, messed up my shoulder, ribs, and wrist. Suddenly, I couldn't type properly. I was forced to lean on AI agents far more heavily than I'd planned, and dictation software for everything else. No choice but to figure out how to make them actually work. (The dictation, by the way, turned out to be amazing. I'm not sure I'll go back.) The timing was fortunate, if you can call breaking your elbow fortunate. Opus 4.5 had just dropped, and it's shockingly good at coding. I haven't written any code since the accident, but my output has gone up, not down. That's a strange thing to sit with. Nolan Lawson put it well in "We Mourn Our Craft</a>": "The worst fact about these tools is that they work." He frames it as grief, not conversion. He had a mortgage, a family, and junior colleagues strapping on "bazooka-powered jetpacks." The mind-change came not from an argument won, but from a reality that refused to wait for his permission. It's OK to mourn our craft. I've permitted myself to do so. But I'm learning to build a new craft on the bones of the old one. The effectiveness of these tools opens up a huge dilemma. Opting out entirely means giving up any influence over how this goes. The Contradiction I'm Sitting With</h2> Here's where I'm at: I need to adapt to stay relevant. Accumulated expertise doesn't evaporate overnight, but the speed of change is faster than I expected. At the same time, I genuinely believe the current structure of AI development is concentrating power, replicating the worst patterns of Big Tech, and creating environmental costs we're not seriously reckoning with. What do you do when you need to use tools that you think are contributing to harmful outcomes? Aside I should also flag the irony of leaning on Fuller below. His techno-optimism, "doing more with less," designing our way out of systemic problems, is exactly the rhetoric Silicon Valley adopted to justify moving fast and breaking things. The same language I use about abundance and shared infrastructure could come straight from a startup pitch deck. Fuller isn't wrong, but those ideas get co-opted easily. The power concentration, the environmental costs, the broken economics. Those are the parts of this article I stand behind most. </aside> What would Bucky do?</h2> I've been thinking about this through the lens of Buckminster Fuller, partly because I've been reading his work recently, and partly because he spent a lot of time thinking about exactly this kind of bind. Fuller studied what he called the "Great Pirates", powerful maritime traders who operated across national boundaries, accumulated comprehensive knowledge, and eventually became the invisible power brokers behind modern finance and corporate structures. But he didn't study them to emulate them. He studied them to understand how power concentrates, and how to design alternatives. Distinguishing the Tool from the Structure</h2> Using AI effectively isn't the same as endorsing the concentration of its development in a few corporations, or the extractive data practices, or the environmental costs. I can be pragmatic about using the tools while being vocal about the structural problems. Fuller didn't refuse to use electricity because power companies were monopolistic. He designed systems for more distributed energy. For me this means learning to work with AI while pushing for open-source alternatives, better regulation, and environmental accountability. Being the person in the room who can say "this is impressive technically AND here's why the current trajectory is dangerous." Deep expertise gives me standing that pure critics don't have. Sharing Knowledge, Not Hoarding It</h2> Fuller's response to the pirates' legacy was essentially: what if we made all knowledge accessible? What if we designed for everyone's success, not competitive advantage? What if we operated from abundance rather than scarcity? My expertise becomes more valuable when I give it away, not less. I'm trying to document what I'm learning about AI publicly. The "competitive moat" thinking is pirate logic. Fuller would say security comes from being genuinely useful to the whole system. New Tools, Old Discipline</h2> The best practices for working with AI aren't new. Write clear specs before you start coding. Break work into well-defined tasks. Review output carefully. Give good context. Think about architecture before implementation. We were supposed to be doing this for decades. My friend who kept banging on about spec-driven development? He was right. Writing a proper spec before handing work to an AI agent produces dramatically better results than prompting and hoping. The spec forces you to think first. The thinking was always the valuable bit. Anthropic published a piece about how their own teams use Claude Code</a>. They treat AI agents like a development team. Give them proper context. Plan before executing. Maintain human oversight. Review before deploying. It reads less like a technology manual and more like a management handbook. Because that's what it is. The people thriving with these tools aren't the ones who learned prompt engineering from scratch. They're the ones who already valued clear thinking, systematic review, and well-structured work. The craft didn't die. It shifted from typing to thinking. From writing code to specifying intent, reviewing output, and knowing when something's wrong. That's why accumulated expertise still matters, even as the tools change underneath you. You need to know what good looks like before you can judge whether an AI produced it. The Economic Argument</h2> The economics of AI are fundamentally broken. Billions invested in training runs. Models obsolete in months. Massive duplication of effort across competing companies. Each company is rebuilding similar capabilities from scratch. Energy and compute wasted on redundant training. Race dynamics forcing premature releases and corner-cutting. Fuller would see this and say: this is competition-based scarcity thinking producing artificial scarcity while simultaneously creating massive waste. It's exactly backwards. He believed humanity's problems weren't resource problems, they were design and coordination problems. We have enough for everyone if we design efficiently and collaborate. What if the massive investment was collaborative rather than competitive? Shared base models, openly developed. Companies compete on applications and implementations, not on rebuilding foundation models. Like how we don't have competing internets, we have shared infrastructure with competition at other layers. What if we designed for longevity rather than obsolescence? Smaller, more efficient models that actually get refined over time. Focus on getting more capability from less compute. Sustainable rather than race-to-the-bottom dynamics. The current model only "works" because venture capital and tech giants can sustain losses hoping for future monopoly. The race dynamic forces everyone to participate or be left behind. It's a prisoner's dilemma. Everyone would be better off cooperating, but no one can unilaterally stop competing. Being a Trim Tab</h2> Fuller's favourite metaphor was the trim tab. The small rudder that turns the big rudder that turns the ship. You don't have to move the whole ship yourself. You find the leverage point where a small action creates a larger change. I can't change that major AI models are controlled by a few companies, or the massive energy consumption, or the global race dynamics. But I can change what problems I work on, how I share knowledge, what tools and alternatives I support, and what voice I lend to which conversations. What This Means in Practice</h2> For me, it means focusing on problems that actually help people. Not extraction and manipulation. Is this work helping people do more with less? Is it reducing drudgery? Creating genuine value? The scarcity mindset says, "AI is taking my job, I need to protect my turf." I'm trying to think differently. AI can handle routine work, freeing me up for problems I haven't had capacity to address. I'm deploying AI across my workflow now, orchestrating multiple agents, refining a process around human oversight at the points with most leverage. The bottleneck has shifted from writing code to reviewing it. Scaling the human judgement side is the interesting problem. My expertise isn't a scarce resource to protect. It's a foundation to build something better on. The Uncomfortable Reality</h2> I don't have this fully resolved. The tension is real. The risks are real. But sitting it out isn't an option either. The test isn't whether AI is good or bad. It's whether we can shape how it develops and who it benefits. That needs people who understand both the technology and its dangers to actually be in the room. I'm still concerned. But I'm building with eyes open and values intact. If you're sitting with the same contradiction, I'd genuinely love to hear how you're thinking about it. Do More With Less: My Web Stack for 2026 2026-01-03T00:00:00+00:00 Over the past few months, in my spare time, I've been working on my side-project, Zero Waste Tickets</a>, where I make heavy use of HTMX, server-rendered HTML, and a few bits of vanilla JavaScript for interactions. I'm able to do more with less. Much more than you might expect for a single dev working in my spare time. And it's fast to load, and I can reason about the whole codebase in one place. I keep reaching for simpler tools and getting better results. The defaults in web development have drifted towards complexity that most projects don't need. React had its moment, but it got used everywhere, including lots of places it shouldn't. And it created a whole ecosystem of build tooling, state-management, and not-quite-right abstractions patching other not-quite-right abstractions. The Philosophy: Do More With Less</h2> The best stack is the one where you've removed everything that isn't pulling its weight. Splitting an application into a heavy frontend and a separate backend roughly doubles the surface area. Two codebases, two deployment pipelines, two sets of state to manage, and a contract between them that needs constant maintenance. Business logic gets scattered. Some in the frontend for "responsiveness," some in the backend for "security." Testing becomes harder. Onboarding new developers takes longer. Why do we keep doing this to ourselves? "Simplicity is a great virtue, but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better." — Edsger Dijkstra </blockquote> That last line is the whole story of React's dominance. Complexity sold very well indeed. The alternative is to co-locate business logic in the backend and treat HTML as a first-class output format. This isn't a new idea. It's how the web worked for most of its early history. But it deserves a serious look now that HTML and CSS have got genuinely good. Modern HTML gives you dialogs, form validation, lazy loading, and semantic elements that screen readers understand out of the box. CSS handles layouts that used to require JavaScript. If you lean into the platform, you get performance and accessibility by default instead of fighting for them. That doesn't mean zero JavaScript. It means JavaScript where it actually adds value, not as the default for everything. The Rust Backend: Axum, SQLx, and Postgres</h2> I'm writing web services in Rust using Axum. The type system catches entire categories of bugs at compile time, and the performance characteristics mean I don't need to think about scaling until much later than I would with other languages. Axum itself is straightforward. It's a thin layer over the Tokio async runtime with good ergonomics for routing and middleware. Unlike the full-featured frameworks you get in other languages, Axum is deliberately a low-level building block. I appreciate that as a design goal, even if it means you're assembling more pieces yourself. I've ended up writing my own macro-based utilities on top of it to make common patterns more declarative, which is exactly the kind of flexibility this approach allows. For database access, I'm using SQLx with Postgres. SQLx checks your SQL queries against your actual database schema at compile time. No ORM abstraction layer, no runtime query building, just raw SQL with the safety of knowing it will work before you deploy. Postgres continues to be the right default for most applications. It handles JSON, full-text search, and geospatial data without needing separate systems. That said, it's not without its quirks. If you try to build queue systems on top of it, or rely heavily on locks and NOTIFY, you can run into global lock contention that's easy to miss until it becomes a problem. It's quite possible to shoot yourself in the foot. I've been using Postgres long enough that I know where the rough edges are and how to work around them. Over the years I've watched people move to trendier databases only to spend time rebuilding features that a relational database gives you out of the box. HTML-First with Maud and HTMX</h2> For rendering HTML, I'm using Maud, a macro-based templating library for Rust. Templates are type-checked at compile time, and because they're just Rust code, you get all the refactoring and IDE support you're used to. For styling, I'm using a small macro to collect and aggregate CSS snippets from across the codebase. When I need interactivity beyond what HTML provides natively, I reach for HTMX. It lets you make AJAX requests and update parts of the page using HTML attributes. The mental model stays simple: the server returns HTML, and HTMX swaps it into the DOM. The result is fast and accessible. The pages are small. The server does the work it's good at. And because the core content is server-rendered HTML, there's a graceful baseline even without JavaScript. Infrastructure You Can Understand: Docker Compose on a VPS</h2> I'm currently in the process of moving my side projects from cloud services to a self-hosted VPS. The primary motivation is cost. Running something like Zero Waste Tickets on managed cloud infrastructure costs more than it needs to. But there's also value in having infrastructure I can fully understand and control. For orchestration, Docker Compose is enough. I define my services in a single file, run docker compose up</code>, and everything works. No service mesh, no ingress controllers, no YAML spread across dozens of files. For object storage, I use whatever S3-compatible option the host provides, or RustFS if I need to run it myself. The S3 API is a reasonable standard, and there's no reason to couple to a specific cloud provider for blob storage. We use Kubernetes heavily at work, and I understand why it exists. For large-scale systems with dedicated platform teams, the complexity is justified. But for most projects, it's overhead that doesn't pay for itself. Durable Execution for the Agentic Future: Restate</h2> The piece of this stack I'm most excited about is Restate. Restate is a durable execution platform. It stores the progress of your workflows as they run, so if something crashes, execution resumes from where it left off rather than restarting from the beginning. Retries, idempotency, and state persistence are handled by the runtime rather than being your problem. I've been using it in side projects for long-running workflows and agent orchestration. If you've ever built a multistep process, an agent that calls external APIs, waits for responses, and makes decisions over minutes or hours. You end up hand-rolling state machines, writing retry logic, building idempotency checks, tracking progress in a database. And then you do it all again for the next workflow, slightly differently each time. Restate makes that entire category of problem go away. You write your workflow as straightforward code, and the runtime handles persistence, retries, and recovery. If something crashes, execution picks up where it left off. As AI agents start doing more real work, spanning multiple steps, calling external services, running over extended time periods, this kind of infrastructure stops being nice-to-have. I'm planning to expand my use of it and introduce it at work where it makes sense. Wrapping Up</h2> None of these are particularly exotic choices. Postgres has been around for decades. HTML-first development is older than React. Docker Compose is deliberately simple. What's changed is that I'm choosing them deliberately rather than defaulting to complexity. Doing more with less. I'd love to know what your stack looks like right now. What are you excited about this year? What have you dropped that you used to think was essential? Drop me a line. I'm always up for that conversation. New Year, New Blog 2026-01-01T00:00:00+00:00 I'm bringing the blog back. The world I've spent my career in is changing fast, and I've been doing a lot of thinking about where things are heading. I want a place to work through those ideas properly, not in throwaway social media posts but in something I own and can build on. I've also been digging through my archives, recovering old posts from previous incarnations of this site. I'll be backfilling those over time. For nostalgia, not much of that older thinking still holds up, but it's interesting to see how things have changed. There's music and other projects I want to get up here too. For now, stick around. I've got a lot to share about this strange moment we're all in.

Fast iteration</h2>
Each step below is a decision gate where someone would traditionally ask "is this worth the effort?" None of this would have been worth the effort without AI, but with AI it meant I could quickly iterate through multiple prototypes and not be scared about throwing away code.</p>

What's worth building now?</h2>
The interesting question isn't "how fast can AI write code." It's "what becomes worth building when the cost drops this much?"</p>
I think we're still in the early days of answering that. The threshold has moved more than most people realise.</p>

Daz

Code I'll Never Read

Why AI Fails at Scale

Five Rewrites in a Week

Advanced Tool Calling Patterns for AI Agents

What Happens When You Stop Reading the Code?

Operational Debt

How I Work with AI Coding Agents

Code Review in the Age of AI-Augmented Development

Build Fast, Learn Slow

How to Overengineer a Login Form

Context Engineering Is the Job

Thinking in Plans, Not Code

Maybe All Intelligence is Artificial

AI Engineer or Sloperator?

Moltbot: The Bored Ape of Integration Patterns

Rethinking My Position on AI

Do More With Less: My Web Stack for 2026

New Year, New Blog

Daz

Code I'll Never Read

Why AI Fails at Scale

Five Rewrites in a Week

Fast iteration</h2> Each step below is a decision gate where someone would traditionally ask "is this worth the effort?" None of this would have been worth the effort without AI, but with AI it meant I could quickly iterate through multiple prototypes and not be scared about throwing away code.</p>

What's worth building now?</h2> The interesting question isn't "how fast can AI write code." It's "what becomes worth building when the cost drops this much?"</p> I think we're still in the early days of answering that. The threshold has moved more than most people realise.</p>

Advanced Tool Calling Patterns for AI Agents

What Happens When You Stop Reading the Code?

Operational Debt

How they compound</h2> These problems compound:</p> Cognitive debt</strong>: you can't understand the code fast enough</li>

How I Work with AI Coding Agents

The core principle</h2> One observation underpins everything: LLMs are stateless</strong>. They have no memory between requests. Output quality is bounded by context quality.</p>

Implement</h2> Should be the simplest phase. Feed the plan and only the specific files needed. For larger tasks, break implementation into chunks, each in a fresh context window, to stay below roughly 40% context window utilisation, where I've found output quality starts to drop off noticeably.</p>

Code Review in the Age of AI-Augmented Development

Build Fast, Learn Slow

How to Overengineer a Login Form

Context Engineering Is the Job

Thinking in Plans, Not Code

Maybe All Intelligence is Artificial

AI Engineer or Sloperator?

Moltbot: The Bored Ape of Integration Patterns

Rethinking My Position on AI

Do More With Less: My Web Stack for 2026

New Year, New Blog

Fast iteration</h2>
Each step below is a decision gate where someone would traditionally ask "is this worth the effort?" None of this would have been worth the effort without AI, but with AI it meant I could quickly iterate through multiple prototypes and not be scared about throwing away code.</p>

What's worth building now?</h2>
The interesting question isn't "how fast can AI write code." It's "what becomes worth building when the cost drops this much?"</p>
I think we're still in the early days of answering that. The threshold has moved more than most people realise.</p>