Evaling an AI agent

How do you measure how well an open ended agent works?

Create’s agent takes natural language and turns it into production software. It isn’t deterministic. “Build a mentoring marketplace with Stripe payments for sessions” has no canonical answer. Benchmarks don’t exist, golden outputs are brittle, and LLM-as-a-judge misses the part that matters: does the app made work when someone uses it?

So we built our own measurement system.

What we shipped

Over the last month we stood up a daily evaluation framework, Create Evals, that simulates user sessions, scores the results, and gives engineers a fast signal on whether the agent is getting better or worse.

The loop is simple:

Prompt sets. Curated scenarios drawn from real sessions, edge cases, and synthetic flows. Everything from “add auth to this site” to full builds like “personal finance tracker with charts and login.” Each set holds ~50 examples, including long conversational sequences.
Scorers. We layer three perspectives:
- LLM judges for relevance and coherence.
- Computer-use agents (CUA) that launch the generated app in a browser, click through flows, and observe state via the selectors we instrument (data-testids, etc.) Here’s a CUA in action scoring a generated site.
- Humans-in-the-loop for fast side-by-sides on selected failures.

Daily tracking. Scores, logs, visual diffs, and qualitative notes land in our observability stack. Engineers review regressions in minutes, not days.

The system runs every morning across a curated batch of prompts. If a change lands (in the models, in the planner, in the runtime) we see the effect the same day.

Why this is necessary

Traditional software tests assert exact outputs; agentic systems don’t work that way. We tried the common approaches:

LLM-as-a-judge hallucinated success and missed logic bugs when just viewing code.
String or DOM comparisons broke whenever the agent chose a different but valid layout.
Manual QA didn’t scale with the combinatorics of agent behavior.

Create Evals solved practical problems immediately:

Caught regressions before they hit users.
Flagged prompt templates that degraded output quality.
Drove improvements in long-session coherence (we stretched one conversation eval from 8 to 165+ turns and watched stability hold).
Let us quantify tradeoffs—when we added a design-reasoning step, we saw UX improve and latency increase; the decision became data-informed.

The goal isn’t maximum coverage; it’s a small set of evals that correlate tightly with user experience, run automatically, and are interpretable by engineers.

Hard parts

CUA accuracy. Browser agents top out around 70–90 % precision on some tasks. We compensate with instrumentation and human spot checks.
LLM bias. Models still hallucinate success when they see “good-looking” code. Running the app is non-negotiable.
Eval design. We avoid stylized prompts that nobody would actually type.
Human review. Every eval we keep has to be understandable in minutes. If engineers can’t reason about a failure, the metric loses value.

Inside the agent

When you prompt Create, the agent reads the repo, the chat, and the available tools. It plans, edits files, updates the UI, wires auth/payments, deploys, and spins up environments. Failures can happen anywhere: indexing, compilation, deployment, business logic. Evals give us a lens into the entire stack.

Industry context

Academic benchmarks like SWEBench, WebArena, and AppBench are useful but narrow. They focus on code reasoning or single-turn tasks. We care about full workflows that end with a working product. To our knowledge, Create is the first text-to-app platform running structured, daily evals grounded in production behavior.

Why it matters

Our vision is simple: anyone should be able to go from idea to business using natural language. That includes auth, payments, databases, deployment, integrations, the unglamorous parts. To get there, the agent must be reliable. Reliability demands metrics, feedback loops, and tooling that make debugging agents tractable.

Evals are the first hill to climb before you get production data. Every model swap, planner tweak, UI change, or tool integration flows through this system. It lets us ship quickly without flying blind.

Still, there's a limit to their usefulness. The ultimate test is how the agent operates across many production sessions. So we're now building systems that can simulate changes to the agent across previous sessions. We're also building preference testing set ups that make it quick to collect pairwise rankings between sessions and agent versions.

We’re still early, and there’s a lot left to build.

If shaping measurable, trustworthy behavior in powerful AI systems sounds like your kind of work, we’re hiring in SF.

What we shipped

Why this is necessary

Hard parts

Inside the agent

Industry context

Why it matters

More from Anything

The Night Dirk Stopped Waiting

$11m to build Anything

Anything Max

Product

Login

Start building

Resources

Docs

Pricing

Integrations

Affiliates

Company

Blog

Careers

Privacy Policy

Terms of Service

Contact Us

Join X Community

© 2025 Create Anything