How We Use Harness Design to Make Pandy Reliable
Tools, rules, and feedback loops do more for reliability than raw model power. Here's how that plays out inside Pandy.
April 17, 2026
Tools, rules, and feedback loops do more for reliability than raw model power. Here's how that plays out inside Pandy.
Reliable AI systems are built on strong scaffolding around the model. The labs like OpenAI and Anthropic call this harness design, and it's what separates a flashy demo from a system you can trust in production.
Two recent posts lay this out well. Anthropic wrote about a 3-agent system (planner, generator, evaluator) for long coding sessions, and OpenAI shipped a real product where all code was written by an agent. These are different domains but the lesson is the same: tools, rules, and feedback loops do more for reliability than raw model power.
In Pandy, here's a tour of how that plays out under the hood.
How a Pandy reply happens
When a customer message lands in Pandy, the AI enters a tool loop. The model picks a tool - search the knowledge base, look up an order, check a customer's plan, verify identity, send a reply - runs it, reads the result, and decides what to do next. This kind of loop runs on every turn during a chat. If the loop hits the ceiling of iterations, it gives up and escalates the chat to a human. That ceiling exists to make sure customers don't wait on the AI stuck for some reason.
Most replies don't really need many turns. A simple FAQ might take two: one knowledge base search, one reply. A Shopify order question might take four or five: order lookup, history check, generating an address change OTP, sending the reply. The loop adapts to whatever the request actually needs.
When the conversation needs structure: procedures and tasks
A lot of what customers ask for goes beyond Q&A. Cancellations, refunds, address changes, and identity verification (like in banking) are full workflows that require information to be collected and specific steps to be taken before they can be considered done. Pandy handles all of these through something we call a procedure.
A procedure is a contract between us and the AI. It spells out exactly what information needs to be collected from the customer, which tools must run, and what counts as done. Once the AI enters procedure mode, it commits to the contract and can't drift off into small talk or decide halfway through that the work is good enough. The procedure system tracks every field collected and every tool executed, and won't mark the work complete until the contract is satisfied.
Once the AI enters procedure mode, it commits to the contract and can't drift off into small talk.
Active procedure
Subscription Cancellation
Required fields
Account email
sarah@example.com
Reason
Switching providers
Identity verified
OTP confirmed
Required actions
Offer retention deal
Declined 30% off
Cancel subscription
Effective immediately
Prorate refund
$14.20 to original card
Send confirmation
Email sent to customer
The pause-and-resume part is what makes this feel human. Customers don't follow scripts. They start a return, then ask about shipping on something else, then come back ten minutes later. The procedure record sits in the database with pending progress, and when the customer circles back, the AI can pick up exactly where it left off without losing context.
Inside any single conversation, there's a second discipline: tasks. If a customer asks for three things at once - "cancel my subscription, refund last month, and send me an invoice" - the AI puts them into a task list with a rule: only one task is in progress at a time. Models tend to start everything in parallel and may not finish, so forcing sequential execution is how we keep a multi-request chat from turning into a tangle of half-done work.
What we decided not to leave to the AI
A customer interaction is part judgment and part logic. The harness mindset is about being deliberate about which is which.
Before the AI sees a message, Pandy's workflow engine runs. It checks rules - is the customer on the premium plan, does the message contain a refund keyword, is the chat coming from a specific channel - and acts on them by routing the chat to a team, setting priority, adding tags, or opening a Jira issue. Some workflows block the AI from responding entirely (if a senior agent should handle this customer, the bot doesn't take over). Others run async without blocking the customer experience. These routing decisions are deterministic, so we handle them in code on the way in.
Context is the other half of this. The OpenAI harness post puts it this way: "What the agent can't see doesn't exist." In practice, that means a model with no context to draw on will confidently make up the answer.
What the agent can't see doesn't exist.
Context assembled before AI responds
Message
“Where’s my order?”
Language
English
Timezone
EST (UTC-5)
Customer
Premium · 14mo
KB
3 articles ready
Procedure
None active
History
30-day chain
Reply
Composing...
Pandy's prompt builder assembles context for every turn. Replies are written correctly in the customer's language. The AI uses the timezone of the customer, so when it says "tomorrow at 9am" it means the customer's tomorrow. Using active procedures, the AI knows it's mid-cancellation. Recent knowledge base hits are pre-fetched on the first message of a chat so the AI has facts ready before it even tries to use them. Conversation history is chained for thirty days so longer chats don't lose earlier context. We built it all this way from real-life conversations, because we hit a case where any part of these features being missing caused a wrong answer.
How we know it's working
Pandy has two layers of feedback for catching when the AI gets something wrong.
The first is observability. Every Pandy reply stores the full tool call trace, the raw model input, the raw model output, the LLM response ID, and token counts. When a customer or an operator says "this reply was wrong," we don't have to guess. We can replay the whole sequence and see exactly where it went off. Without this, an AI system is a black box that produces complaints you can't trace.
An AI system without observability is a black box that produces complaints you can't trace.
The second is evaluation. Pandy ships with a chat evaluator that grades conversations on resolution rate, sentiment, knowledge base usage, response quality, and a handful of other dimensions. It runs on a separate model with a low temperature so the scores stay stable across runs. The evaluator powers post-chat analytics and is crucial for finding gaps in the knowledge base and training the team.
Our evaluator doesn't yet run inside the live response loop. The Anthropic harness uses its evaluator in real time - it grades the generator's work, fails sprints that don't meet the bar, and forces the generator to try again. Running our evaluator the same way would let us catch weak replies while they can still be rewritten, and that's the next piece of the harness we're building.
Why all this matters if you're shopping for a CX platform
Simply adding an LLM on to a CX platform is asking for reputational damage.
I say all this to point out that simply adding an LLM on to a CX platform is asking for reputational damage. The model is the easiest part of the system these days, and most teams have swapped models several times in the past two years. The harness around the model is what determines whether the system holds up as your support load grows. The tools, the procedures, the workflows, the context layer, the observability, and the evaluation are the parts actually doing the work.
Production handles cases a demo never sees, and the scaffolding is what makes those cases tractable. That's where we've put our time.
Want to see how the harness holds up?
Try Pandy free or book a walkthrough of the internals.
Sources
On this page