LLMOps: The Missing Playbook

MLOps was built for a world where models are trained, tested, versioned, and deployed. LLMs broke that world.

You don't train a foundation model. You don't own the weights. You don't control when the provider ships a new version. Your "code" is a natural language prompt that behaves differently depending on the temperature, the context window, and whatever silent update the provider rolled out last Tuesday.

MLOps gave us pipelines. LLMOps needs something fundamentally different: an operating discipline for systems you don't fully control.

This is the playbook I wish I had when I started building LLM-powered production systems. Everything here is hard-won — from real deployments, real failures, and real costs.

THE PARADIGM SHIFT

In traditional MLOps, you control the model — you train it, you version it, you deploy it. In LLMOps, the model is a black box behind an API that changes without notice. This inverts the operational model: instead of controlling the model and monitoring the data, you control the prompts and monitor the model. Everything else follows from this inversion.

The LLMOps Stack

The stack has three layers. Most teams only build the middle one — deployment — and wonder why their systems degrade over time.

Prompt Management Is Version Control

Your prompts are production code. Treat them that way.

The single biggest operational mistake I see in LLM systems is prompts stored as string literals inside application code. This creates three problems:

No audit trail. When a prompt changes, you can't diff it against the previous version or correlate it with quality changes.
No rollback path. If a prompt update degrades quality, you need to revert the application deployment — not just the prompt.
No separation of concerns. The people who should be iterating on prompts (domain experts, product managers) can't do so without a code deployment.

The Prompt Registry Pattern

Store prompts in a dedicated registry — a database, a config service, or even a Git repo with a CI/CD pipeline. Each prompt has:

A unique identifier (e.g., summarize-email-v3)
The template with variable placeholders
Metadata: model target, temperature, max tokens, stop sequences
An eval score: the last measured performance on your eval suite
A rollback pointer: the previous version to revert to

THE GIT PATTERN

The simplest prompt registry that actually works: a Git repo where each prompt is a YAML file. CI runs evals on every PR. Merge to main triggers deployment to a config service your app reads from. This gives you version history, code review, automated testing, and rollback — all with tools your team already knows. Don't over-engineer this. A Git repo beats a custom database 90% of the time.

Evals Are Your Test Suite

In traditional software, you write unit tests. In LLMOps, you write evals.

An eval is a structured test that measures whether your LLM system produces acceptable outputs for a set of known inputs. This sounds simple. It is profoundly difficult in practice, because "acceptable" is subjective and the output space is enormous.

The Eval Taxonomy

Layer 1: Deterministic evals. Exact match, regex, JSON schema validation. These are fast, cheap, and should cover your structured outputs entirely. If your system extracts dates, the date should be parseable. If it classifies sentiment, the output should be one of the valid labels.

Layer 2: Semantic evals. Embedding similarity, ROUGE, BERTScore. These measure whether the meaning of the output is close to the expected answer, even if the exact wording differs. Essential for summarization, paraphrasing, and any generative task.

Layer 3: LLM-as-Judge. Use a stronger model (or the same model with a carefully crafted rubric) to evaluate the output. This is increasingly standard and surprisingly effective — but it introduces its own calibration challenges (see: Dutch Books).

Layer 4: Human review. Expensive and slow, but irreplaceable for calibrating your automated judges. Run human evals periodically to ensure your LLM judges haven't drifted.

EVAL-DRIVEN DEVELOPMENT

The workflow I've found most effective: write the eval first. Before you touch the prompt, define what "good" looks like with 20-50 test cases. Measure the baseline. Iterate on the prompt. Ship only if the eval improves. This is TDD for LLMs. It sounds obvious. Almost nobody does it. The teams that do are the ones whose systems don't mysteriously degrade at 3am.

The Eval-Driven Development Loop

Write the eval first. Before you touch the prompt, define what "good" looks like with 20-50 test cases.
Measure the baseline. Run the current prompt against the eval suite. Record the score.
Iterate on the prompt. Make changes. Run the eval. Repeat.
Ship only if the eval improves (or holds steady on a regression suite).
Monitor the eval in production. Sample 1-5% of production traffic and run the eval continuously.

The Gateway Pattern

Every production LLM system should have a gateway — a single service that sits between your application and the model providers.

The gateway handles:

Routing: Which model handles which request
Rate limiting: Respect provider quotas, distribute load
Caching: Return cached results for repeated queries
Fallback: If Provider A is down, route to Provider B
Logging: Capture every request/response for debugging and analytics
Cost tracking: Tag each request with cost metadata
Prompt injection defense: Scan inputs before they reach the model

THE CONTROL PLANE

A gateway is not optional infrastructure — it's your control plane. Without it, you can't answer basic questions: "How much are we spending per feature?" "What's our p95 latency?" "What happened when quality dropped last Tuesday?" If you're calling LLM APIs directly from application code, you're flying blind. Every production outage I've debugged in LLM systems could have been caught earlier with a proper gateway.

Open-source options: LiteLLM, Portkey, and Kong's AI Gateway all provide this pattern. Or build your own — it's a thin proxy with middleware, not a complex system.

Observability: What to Measure

LLM observability is different from traditional APM. You're not just measuring latency and error rates. You need to measure semantic quality — whether the system is producing good answers.

The LLMOps Dashboard

The metrics that matter:

| Metric | What it tells you | Alert threshold | |--------|------------------|-----------------| | Latency (p50/p95/p99) | Provider performance | p95 > 2x baseline | | Token usage (in/out) | Cost efficiency | > 20% increase | | Error rate | Provider reliability | > 1% | | Cache hit rate | Caching effectiveness | < 15% (too low) | | Eval score (rolling) | Output quality | > 5% drop from baseline | | Confidence distribution | Calibration drift | Bimodal shift | | Cost per query | Unit economics | > budget threshold |

Drift Detection

The silent killer of LLM systems is drift. It happens in three ways:

Model drift. The provider updates their model. Your prompts, tuned to the old behavior, start producing different outputs. No error. No alert. Just quietly worse results.
Data drift. The distribution of user queries changes. Your eval suite, based on historical queries, no longer represents production traffic. Your metrics look fine, but users are unhappy.
Prompt drift. Someone edits a prompt without running evals. The change looks harmless. It breaks an edge case that shows up three weeks later.

THE SILENT KILLER

Model drift is the hardest operational problem in LLMOps because it produces no errors. The API returns 200. The response looks plausible. But the behavior has subtly changed. I've seen a provider model update silently change the JSON formatting of extraction outputs, breaking a downstream parser that had worked flawlessly for months. Continuous eval on production traffic is the only reliable defense.

Cost Management

LLM costs are deceptive. They start small and grow exponentially with scale.

The cost model is: (input tokens + output tokens) × price per token × number of requests. Every dimension can explode independently:

A prompt that grows by 200 tokens adds $0.002 per request. At 100K requests/day, that's $200/day — $6,000/month — from a single prompt change.
A feature that makes 3 LLM calls instead of 1 triples your cost overnight.
A retry loop on failures can 10x your spend during an incident.

The Cost Control Playbook

Set hard budget limits per service, per day. Kill the circuit breaker before you get a surprise bill.
Track cost per feature, not just per service. Know which product features are expensive.
Implement token budgets per request. Cap input + output tokens. Truncate context if needed.
Cache everything cacheable. Semantic caching with embedding similarity catches ~30% of redundant queries.
Batch what you can. Batch APIs are 50% cheaper. If latency isn't critical, use them.
Review monthly. Costs drift upward silently. Monthly reviews catch the creep.

THE $6K PROMPT

A real scenario I've seen: a developer added "additional context" to a system prompt during debugging, forgot to remove it, and it shipped to production. 200 extra tokens per request. At 100K daily requests on a $10/M token model, that's $6,000/month from a forgotten debug string. Token budgets and prompt-level cost tracking would have caught this on day one.

The Checklist

Before you ship an LLM feature to production, verify:

[ ] Prompts are stored in a versioned registry, not hardcoded
[ ] An eval suite exists with at least 30 test cases
[ ] The eval runs in CI — prompt changes are blocked if evals fail
[ ] A gateway/proxy sits between your app and model providers
[ ] Fallback providers are configured and tested
[ ] Every request is logged with: tokens, latency, cost, model, prompt version
[ ] Caching is enabled for repeated/similar queries
[ ] Rate limits and budget caps are set per service
[ ] Production quality is monitored via continuous eval sampling
[ ] There's a rollback procedure that takes under 5 minutes
[ ] The team knows who to call when the eval score drops

The Maturity Model

Where does your team sit?

Most teams are at Level 0. Getting to Level 1 takes a week. Getting to Level 2 takes a month. Levels 3 and 4 are where the real competitive advantage lives — and where most of the cost savings come from.

The gap between Level 0 and Level 2 is where 80% of the operational failures happen. Close that gap first. Optimize later.

THE ONE-WEEK WIN

Getting from Level 0 to Level 1 is the highest-ROI investment in LLMOps. It takes roughly a week: move prompts to a YAML config, write 30 test cases, set up a cost dashboard. That's it. No fancy infrastructure. No ML platform. Just basic hygiene. This single step prevents the majority of production incidents I've seen in LLM systems.

Closing Thought

LLMOps isn't MLOps with a new name. It's a fundamentally different discipline because the core asset — the model — isn't yours. You're operating on rented intelligence.

That means your operational leverage comes from everything around the model: how you prompt it, how you evaluate it, how you route to it, how you cache it, and how you recover when it breaks.

The model is the commodity. The operations are the moat.

Build accordingly.