
Every AI feature looks incredible in a demo. You craft the perfect prompt, feed it a clean input, and the output is magic. Then production happens — and unlike a traditional bug, there's no stack trace.
01Intro
Every AI feature looks incredible in a demo. You craft the perfect prompt, feed it a clean input, and the output is magic. Stakeholders are impressed. The team is excited. Ship it.
Then production happens. Real users don't type clean inputs. They misspell things, ask ambiguous questions, chain requests together in ways nobody anticipated. The model that looked flawless in a controlled environment starts quietly failing and unlike a traditional bug, there's no stack trace. No error log. Just a user getting a confident, wrong answer and losing trust in your product.
We've shipped enough AI features into production at Labrys to know this pattern well. The demo always works. The first week in production is where the real engineering starts. And the gap between those two states is not a prompting problem. It's a measurement problem.
02Where Production AI Breaks
This is the gap most teams underestimate. Traditional software either works or it doesn't. AI works most of the time and that's the dangerous part. You don't know which 15% of requests are failing until someone complains.
Most production AI failures fall into three categories.
Hallucinations. The model generates something plausible but factually wrong. It invents data, references things that don't exist, or confidently answers questions it should decline. In a demo this rarely surfaces because you're controlling the input. In production, it's inevitable.
Edge case blindness. Your system handles the happy path perfectly but falls apart at the boundaries. Unusual formatting, unexpected user intent, inputs that sit between two categories. These aren't rare, they're the majority of real-world usage that your test cases didn't cover.
Prompt brittleness. Small changes in input phrasing produce wildly different outputs. The model handles "find me a restaurant" perfectly but breaks on "where should I eat." Same intent, different result. This compounds as your system scales, every new feature or prompt change risks breaking something that was working.
The cost isn't just bad outputs. It's support tickets, manual overrides, users who silently churn, and engineering time spent firefighting instead of building. Most teams respond by patching prompts. Adding more instructions, more examples, more edge case handling. The prompt bloats, gets fragile, and the cycle repeats.
There's a better way. You need a system that catches these failures before your users do.
03Evals
Evals when done right, provide a reliable way of producing benchmarks for AI powered systems. They aren't "run it a few times and check if it feels right." They're a systematic, repeatable quality layer closer to a test suite than a QA pass. Think of them as the CI pipeline for AI behaviour. Without them, every change to your system is a gamble.
Start With Tracing
Before any of this works, you need tracing. Traces are structured logs of what your AI system actually did. The input, the output, any tool calls, retrieved context, intermediate steps. Without them you're evaluating blind. You can't write assertions against outputs you never captured, and you can't feed an LLM judge a conversation you never logged.
Instrument your system to capture traces first. At minimum you want the full request/response pair, any retrieval or tool-use steps in between, latency per step, and the model version that produced the output. If you're using a framework like LangChain or Vercel AI SDK, most of this comes almost free. If you're hitting APIs directly, you'll need to build the logging layer yourself. Either way, it's not optional. Everything else builds on top of that.
With tracing in place, there are three levels of evals:
Unit Tests
Strict, code-based assertions that run fast and cheap on every change. Does the JSON parse? Did it call the right tool? Is there a hallucinated UUID in the output?
Break your AI's surface area into features and scenarios, write scoped assertions for each. You don't need production data, use an LLM to generate synthetic test cases and run assertions against those. These won't catch everything, but they filter out the obvious stuff so your more expensive evaluation layers can focus on the more complicated cases.
The key discipline here is treating these like real tests, not one-off scripts. They should run in CI. They should block merges when they fail. If your AI feature doesn't have unit tests, you don't have a feature. You have a prototype.
LLM as a Judge
This is the real meat. Unit tests catch structural failures. But what about when the output parses fine, calls the right tool, and is still wrong? That's where you need something that can evaluate natural language quality. And that means using an LLM to judge the output of your LLM.
Best practices:
- Find one domain expert. Not a committee. One person whose judgment defines "good." Their pass/fail decisions become the ground truth you align your LLM judge against.
- Binary pass/fail only. No 1-5 scales. Scales are ambiguous, unactionable, and don't correlate with what actually matters. Pass or fail. That's it.
- Require written critiques with every judgment. The critique is where the real value lives. It forces the expert to articulate why, and those critiques become your few-shot examples for the LLM judge prompt.
- Ignore off-the-shelf metrics. Generic "helpfulness" or "coherence" scores from evaluation frameworks create a false sense of measurement. Let failure patterns emerge from your actual data, not from someone else's defaults.
- Use the most powerful model you can afford. The judge runs offline, so it can be slower and more expensive than your production model. Advanced reasoning capability matters more here than latency.
- Build the judge iteratively. Start with the expert's critiques as examples in the prompt, run the LLM judge on the same data, compare agreement, iterate on the prompt until convergence.
A/B Testing
Only appropriate once unit tests and LLM-as-judge are solid. This is standard product experimentation applied to AI. Split traffic between two versions of your system and measure which one performs better against your eval criteria.
The prerequisite is that you already have reliable evals telling you what "better" means. Without that, you're A/B testing blind. Most teams aren't ready for this until their product is mature and they've already squeezed significant gains from the first two levels. Don't jump here early.
04Tooling
You don't need to build this infrastructure from scratch. The evals ecosystem has matured and there are solid options depending on your needs.
Braintrust is strong on experiment tracking and comparing eval runs side by side. If your workflow is heavy on iteration, it's a good fit. Promptfoo is open source and CLI-first, which makes it easy to drop into existing CI pipelines. Langfuse is open source and self-hostable, which matters if you have data residency requirements. It handles tracing, scoring, and dataset management in one place. LangSmith provides tracing and a prompt playground from the LangChain ecosystem, but doesn't require LangChain to use.
Pick one and start. The tool matters far less than actually looking at your data. A spreadsheet with pass/fail judgments beats a fancy dashboard nobody opens.
05Where to Start
If you're reading this and your AI feature is already in production without evals; start by instrumenting traces. Get visibility into what your system is actually doing on real requests. Then write unit tests for the obvious structural failures. These are cheap, fast, and they'll catch the things that should never have shipped.
Next, get one domain expert making pass/fail judgments with written critiques on a sample of real outputs. Use those critiques to build an LLM judge. Run it on every change. Track your pass rate over time.
The delta between a working demo and a reliable production system isn't closed by better prompts. It's closed by knowing what's breaking, how often, and whether your changes are making it better or worse. The teams shipping reliable AI products aren't using secret models or magic frameworks. They're just measuring. Systematically, repeatedly, and honestly. It's the same discipline that underpins any secure, scalable software. Don't overcomplicate it, but don't skip it either.