~/blog/The-LLM-Shipping-Checklist-From-Demo-to-Production
Published on

The LLM Shipping Checklist: From Demo to Production (Without Getting Burned)

1000 words5 min read–––
Views
Authors

The Demo That Lied

It always starts the same way.

You build a tiny LLM feature in a weekend:

  • a support copilot that drafts replies
  • a “chat with your docs” box
  • an agent that fills forms, triages bugs, or writes SQL

It works beautifully in the demo.

Then you ship.

And within a week you get the three messages that haunt every LLM product:

  1. “It gave me the wrong answer confidently.”
  2. “It used my words but said something I never approved.”
  3. “Why did my bill jump from 40to40 to **900**?”

The model didn’t get worse.

Reality got wider.

More users, more edge cases, more ambiguous prompts, more retries, more concurrency, more context… and suddenly your “cool AI thing” is just another production system.

So here’s the checklist I wish every team used before calling an LLM feature “done.”


0) Decide what you’re shipping: assistant vs. system

There are two very different products that get lumped together as “an LLM feature.”

  • Assistant UX: the model suggests, the human decides (drafts, summaries, copilots)
  • System UX: the model acts and changes state (agents, automation, tool-calling)

If a human is always the final gate, you can tolerate more uncertainty.

If the model can do things (send an email, change a record, run a workflow), you need guardrails like you’re designing a payments system.


1) Make outputs boring: constrain the format

The fastest way to reduce “LLM chaos” is to stop asking for freeform text when you don’t need it.

If your downstream code expects a structure, demand a structure.

  • Use JSON schemas / structured outputs when available
  • Keep fields small and typed
  • Validate everything like it came from the internet (because it did)
example/validation.ts
// Pseudocode
const result = await llm.generate({
  schema: {
    type: 'object',
    required: ['category', 'confidence', 'next_action'],
    properties: {
      category: { type: 'string' },
      confidence: { type: 'number' },
      next_action: { type: 'string' },
    },
  },
})

if (result.confidence < 0.7) {
  // fall back, ask a human, or route differently
}

Freeform text is great for humans.

Production systems love boring.


2) Treat prompts like code (version them, test them)

Prompts are not “copy.” They’re logic.

So give them the same treatment:

  • store prompts in the repo
  • add a prompt version to every request
  • write tests for the behavior you care about

A simple pattern:

  • prompts/support_draft/v3.txt
  • log prompt_version=v3 with every output
  • roll forward/back like a feature flag

When something goes weird, “Which prompt was live?” should be answerable in one query.


3) Build evals early (or you’ll debug in production)

Classic software has unit tests.

LLM software needs evals.

Not academic benchmarks—product-specific evals:

  • “Does it cite sources when it claims facts?”
  • “Does it refuse to answer when docs are missing?”
  • “Does it follow our style guide?”
  • “Does it ever leak secrets from the prompt?”

Start small:

  • 25–100 golden prompts
  • expected labels/outputs
  • a judge model or simple rule-based checks

Then run them:

  • on every prompt change
  • when you switch models
  • on a nightly schedule

The goal isn’t perfection.

The goal is knowing when you made it worse.


4) RAG is not magic. It’s plumbing.

“Let’s add RAG” is the 2026 version of “let’s add microservices.”

Sometimes it’s correct.

Sometimes it just makes the system harder to reason about.

If you do retrieval, treat it like a search product:

  • chunking strategy matters
  • metadata matters (source, timestamp, permissions)
  • ranking matters
  • freshness matters

And the biggest rule:

Never let retrieval bypass authorization.

If a user can’t read a document in your product, the model shouldn’t be able to quote it either.


5) Add a refusal mode (the feature users respect most)

The best LLM products have a superpower:

They say “I don’t know” quickly.

A practical pattern:

  • ask for an answer and a confidence score
  • require citations when answering from knowledge
  • if missing citations or confidence low → refuse or ask a clarifying question

In other words:

  • don’t optimize for “always answer”
  • optimize for “never mislead”

Users forgive refusal.

They do not forgive confident nonsense.


6) Put a budget on tokens (before your CFO does)

LLMs are the first feature where “one more paragraph” has a direct cost.

So put guardrails around spend:

  • cap context window growth
  • cap tool calls per request
  • cap retries
  • cap streaming duration
  • compress history (summarize)

And log:

  • tokens in/out
  • latency
  • tool-call count
  • model name + version

If you can’t answer “what does this feature cost per 1,000 uses?” you’re not shipping—you’re gambling.


7) Design for retries and duplicates (yes, again)

LLM requests time out.

Users hit refresh.

Workers retry.

So if the model can trigger side effects, you need idempotency:

  • idempotency keys
  • dedupe for tool calls
  • stored responses for repeats

LLM agents are just distributed systems wearing a new hat.


8) Observe the right things (not just “it failed”)

For LLM features, you want observability at three layers:

a) Request layer

  • prompt version
  • model + parameters
  • retrieval hits
  • tool calls

b) Output layer

  • schema validation failures
  • refusal rate
  • citation coverage
  • toxicity / policy flags (if relevant)

c) Product layer

  • user edits (how much did they change the draft?)
  • acceptance rate
  • time saved
  • escalation rate

The metric “LLM success rate” is meaningless.

The metric “users accept the draft without edits” is product truth.


9) Have a “kill switch”

At some point, the model will:

  • degrade
  • change behavior
  • hit a provider outage
  • trip a safety filter
  • start responding in Shakespearean riddles (it happens)

You need a switch to:

  • fall back to a smaller/cheaper model
  • disable certain tools
  • force refusal mode
  • turn the feature off

This is not paranoia.

This is how you sleep.


Closing: The Trend Isn’t “LLMs.” The Trend Is Reliability.

Anyone can ship a demo.

The teams that win in 2026 ship LLM features that:

  • behave predictably
  • fail safely
  • cost what they’re supposed to cost
  • earn user trust over time

LLMs are new.

Production is not.

If you want, I can also:

  • add a companion post: “Evals 101: Building a Golden Set for Your Product”
  • or tailor this to your stack (Next.js + API routes + Prisma + whatever you’re using)