The Evaluator
Your go-to blog for insights on AI observability and evaluation.
AI agent evaluation: How to test, debug, and improve agents in production
Lessons from building and shipping Alyx, our AI agent
Swarm management in agent harnesses: owning long-running agents
As we have built our own harness management tools internally at Arize, and watched external systems like Devin @cognition start managing other Devins, managed agents at @AnthropicAI and long running
What is an evaluation harness?
An evaluation harness is the standardized infrastructure that decides what gets evaluated, runs the evaluation, and acts on the result.
Sign up for our newsletter, The Evaluator — and stay in the know with updates and new resources:
MCP vs. CLI Skills for agents: what our eval found (and which you should use)
Twitter said pick a side. The eval said the question was wrong. Six months ago, MCP (model context protocol) was the hot new thing: tool usage with a built-in discovery…
Why agent telemetry needs standards
Enterprise agents are moving from demos into production workflows, which creates a basic problem: teams need to understand what those agents actually did.
Prompt templates as configs, not code
This post was written in April 2026. Cloud products, feature maturity, and recommended patterns change over time, so readers should treat these examples as directional guidance. For teams already using Arize, there is a natural extension of that pattern. Prompt Playground can sit upstream of the config layer as the place where prompts are edited, compared, and versioned before they are promoted into whatever config system the company already trusts in production.
Using context graphs: build a data moat like Google’s using your enterprise data
Enterprise software is on the verge of its first compounding data loop, the same kind of self-reinforcing mechanism that built the most valuable consumer businesses of the last twenty years….
Context management in agent harnesses: memory, files, and subagents
A version of this article originally appeared on X. Every agent harness runs into the same limit: the context window is too small for everything the model might want to remember….
What is an agent harness?
A version of this article originally appeared on X. Someone asked me at a hacker event last week: “Can anyone actually tell me what a harness really is?” It was…
Beyond models: How context and evals make agents work in production
Building an AI agent has never been easier. But getting one into production that’s reliable is still hard. Most teams can ship a working demo in a day. The agent…