The Demo That Lied

It always starts the same way.

You build a tiny LLM feature in a weekend:

a support copilot that drafts replies
a “chat with your docs” box
an agent that fills forms, triages bugs, or writes SQL

It works beautifully in the demo.

Then you ship.

And within a week you get the three messages that haunt every LLM product:

“It gave me the wrong answer confidently.”
“It used my words but said something I never approved.”
“Why did my bill jump from $40 to **$ 900**?”

The model didn’t get worse.

Reality got wider.

More users, more edge cases, more ambiguous prompts, more retries, more concurrency, more context… and suddenly your “cool AI thing” is just another production system.

So here’s the checklist I wish every team used before calling an LLM feature “done.”

0) Decide what you’re shipping: assistant vs. system

There are two very different products that get lumped together as “an LLM feature.”

Assistant UX: the model suggests, the human decides (drafts, summaries, copilots)
System UX: the model acts and changes state (agents, automation, tool-calling)

If a human is always the final gate, you can tolerate more uncertainty.

If the model can do things (send an email, change a record, run a workflow), you need guardrails like you’re designing a payments system.

1) Make outputs boring: constrain the format

The fastest way to reduce “LLM chaos” is to stop asking for freeform text when you don’t need it.

If your downstream code expects a structure, demand a structure.

Use JSON schemas / structured outputs when available
Keep fields small and typed
Validate everything like it came from the internet (because it did)

example/validation.ts

// Pseudocode
const result = await llm.generate({
  schema: {
    type: 'object',
    required: ['category', 'confidence', 'next_action'],
    properties: {
      category: { type: 'string' },
      confidence: { type: 'number' },
      next_action: { type: 'string' },
    },
  },
})

if (result.confidence < 0.7) {
  // fall back, ask a human, or route differently
}

Freeform text is great for humans.

Production systems love boring.

2) Treat prompts like code (version them, test them)

Prompts are not “copy.” They’re logic.

So give them the same treatment:

store prompts in the repo
add a prompt version to every request
write tests for the behavior you care about

A simple pattern:

prompts/support_draft/v3.txt
log prompt_version=v3 with every output
roll forward/back like a feature flag

When something goes weird, “Which prompt was live?” should be answerable in one query.

3) Build evals early (or you’ll debug in production)

Classic software has unit tests.

LLM software needs evals.

Not academic benchmarks—product-specific evals:

“Does it cite sources when it claims facts?”
“Does it refuse to answer when docs are missing?”
“Does it follow our style guide?”
“Does it ever leak secrets from the prompt?”

Start small:

25–100 golden prompts
expected labels/outputs
a judge model or simple rule-based checks

Then run them:

on every prompt change
when you switch models
on a nightly schedule

The goal isn’t perfection.

The goal is knowing when you made it worse.

4) RAG is not magic. It’s plumbing.

“Let’s add RAG” is the 2026 version of “let’s add microservices.”

Sometimes it’s correct.

Sometimes it just makes the system harder to reason about.

If you do retrieval, treat it like a search product:

chunking strategy matters
metadata matters (source, timestamp, permissions)
ranking matters
freshness matters

And the biggest rule:

Never let retrieval bypass authorization.

If a user can’t read a document in your product, the model shouldn’t be able to quote it either.

5) Add a refusal mode (the feature users respect most)

The best LLM products have a superpower:

They say “I don’t know” quickly.

A practical pattern:

ask for an answer and a confidence score
require citations when answering from knowledge
if missing citations or confidence low → refuse or ask a clarifying question

In other words:

don’t optimize for “always answer”
optimize for “never mislead”

Users forgive refusal.

They do not forgive confident nonsense.

6) Put a budget on tokens (before your CFO does)

LLMs are the first feature where “one more paragraph” has a direct cost.

So put guardrails around spend:

cap context window growth
cap tool calls per request
cap retries
cap streaming duration
compress history (summarize)

And log:

tokens in/out
latency
tool-call count
model name + version

If you can’t answer “what does this feature cost per 1,000 uses?” you’re not shipping—you’re gambling.

7) Design for retries and duplicates (yes, again)

LLM requests time out.

Users hit refresh.

Workers retry.

So if the model can trigger side effects, you need idempotency:

idempotency keys
dedupe for tool calls
stored responses for repeats

LLM agents are just distributed systems wearing a new hat.

8) Observe the right things (not just “it failed”)

For LLM features, you want observability at three layers:

a) Request layer

prompt version
model + parameters
retrieval hits
tool calls

b) Output layer

schema validation failures
refusal rate
citation coverage
toxicity / policy flags (if relevant)

c) Product layer

user edits (how much did they change the draft?)
acceptance rate
time saved
escalation rate

The metric “LLM success rate” is meaningless.

The metric “users accept the draft without edits” is product truth.

9) Have a “kill switch”

At some point, the model will:

degrade
change behavior
hit a provider outage
trip a safety filter
start responding in Shakespearean riddles (it happens)

You need a switch to:

fall back to a smaller/cheaper model
disable certain tools
force refusal mode
turn the feature off

This is not paranoia.

This is how you sleep.

Closing: The Trend Isn’t “LLMs.” The Trend Is Reliability.

Anyone can ship a demo.

The teams that win in 2026 ship LLM features that:

behave predictably
fail safely
cost what they’re supposed to cost
earn user trust over time

LLMs are new.

Production is not.

If you want, I can also:

add a companion post: “Evals 101: Building a Golden Set for Your Product”
or tailor this to your stack (Next.js + API routes + Prisma + whatever you’re using)

The LLM Shipping Checklist: From Demo to Production (Without Getting Burned)