What's the difference between MLOps and AI ops?

MLOps emerged for the lifecycle of classical ML — training models, deploying them, monitoring drift. AI ops in 2026 typically means the broader stack that supports LLM-based applications: model providers, prompts, retrieval, agents, evaluation, and observability. MLOps tools are still used (experiment tracking, monitoring) but the surface area is broader because LLM apps have more moving parts than classical ML.

Do I need a full stack to run AI in production?

No. Most production AI applications run with 3-5 layers: an LLM provider, an observability tool, a vector DB if doing RAG, and a deployment runtime. Vector DBs are optional if you're not doing retrieval. Agent frameworks are optional if you're not doing multi-step workflows. Start minimal and add layers as the application needs grow.

Open source vs commercial — which to choose?

Most production stacks mix both. Commercial LLM APIs (Anthropic, OpenAI) are usually faster and higher-quality than self-hosted alternatives at the same compute cost. But observability (Langfuse open-source), vector DBs (Qdrant, Chroma open-source), and evaluation tooling can often run open-source without significant trade-offs. The choice is usually pragmatic, not ideological.

How much does a production AI stack cost?

Highly variable. A solo developer running a side project: $50-200/month, mostly LLM API costs. A startup with moderate user traffic: $500-5,000/month. An enterprise with significant volume: tens of thousands to millions per month. The LLM provider bill is usually the largest single line item. Caching and prompt optimization can reduce it by 30-70%.

The 2026 AI Ops Stack: A Practical Guide to Running AI in Production

What an AI ops stack actually is, what goes in each layer, and how to choose tools that work together. Covers LLMs, observability, vector DBs, agents, deployment, and evaluation.

“AI ops stack” is the 2026 phrase for the collection of tools you actually need to run an AI application in production. The terminology is messy — sometimes called LLMOps, sometimes AI infrastructure, sometimes just “the stack.” Whatever you call it, the components are converging into a recognizable set of layers.

This guide walks through what each layer is, what tools dominate it in 2026, and how to compose them into a working stack. For each layer there’s a deeper article linked.

TL;DR — the eight layers at a glance

#	Layer	Purpose	Common tools
1	LLM provider	The model itself	Claude, GPT, Gemini, self-hosted
2	Coding assistants	What you build the app with	Cursor, Claude Code, Copilot
3	Orchestration / agents	Multi-step logic	LangGraph, CrewAI, Pydantic AI
4	MCP / tool integration	LLM-to-data connections	Filesystem, GitHub, Postgres, Slack
5	Vector DB / retrieval	RAG storage	Pinecone, Weaviate, Qdrant, pgvector
6	Deployment runtime	Model hosting	Modal, Replicate, Together, vLLM
7	Observability	Tracing + cost + quality	Langfuse, LangSmith, Helicone, Arize
8	Evaluation	Quality measurement	Inspect, Promptfoo, custom

Not every application needs all eight:

Simple chatbot: layers 1, 2, 7
Serious RAG agent: layers 1-3, 5, 7
Multi-agent research system: all eight

Start minimal. Add layers as the application needs grow — not before.

Layer 1: The LLM provider

The model itself. The decision here drives everything downstream.

Commercial APIs (most production):

Anthropic Claude (Opus 4.x, Sonnet 4.x, Haiku 4.x) — leads on complex reasoning, long context, agentic workflows
OpenAI GPT-5 family — strong all-rounder, biggest ecosystem
Google Gemini 2.x — competitive on multimodal, integrated with GCP services
Mistral, Cohere, others — niche or regional positions

Self-hosted and open-weight (specific use cases):

Llama 3.3 and Llama 4 — Meta’s open-weight models (Llama 4 released April 2025 with Scout and Maverick variants); license has restrictions but workable for most uses
Qwen 2.5 and 3.x — Alibaba; Apache 2.0 license; strong multilingual
DeepSeek family — strong reasoning, very permissive licensing
Phi-4 family — Microsoft; small but clever

For local development and indie use cases, see our best local LLMs for MacBook M5 16GB guide. For production, commercial APIs almost always make more sense than self-hosting unless cost or data residency dictates otherwise.

Layer 2: Coding assistants

What your team uses to write the AI application is itself an AI ops decision. The shape of your AI app often shapes the coding tool that fits.

Cursor — best in-editor experience for AI-augmented coding ($20/month or $16/month annual). See our Cursor pricing breakdown.
Claude Code — CLI agent that excels at multi-file refactors and MCP-integrated workflows. See our Claude Code workflow guide.
OpenAI Codex CLI — OpenAI’s open-source CLI agent (launched April 2025).
GitHub Copilot — Pro $10/month, Pro+ $39/month, Business $19/user/month; least agent-style; widely deployed.

The Claude Code vs Codex CLI vs Cursor Agent comparison covers the real workflow differences.

For most professional teams in 2026: Cursor for the editor + Claude Code for agentic tasks is the common stack.

Layer 3: Orchestration and agent frameworks

When your app does more than one LLM call, you need orchestration logic. The 2026 landscape:

LangGraph — most popular; graph-based agent state machines
CrewAI — role-based multi-agent patterns; growing
Pydantic AI — newer, type-safe, Python-native; well-regarded among Python developers
AutoGen — Microsoft; enterprise-leaning
Custom Python or TypeScript — most production apps still use this; frameworks are valuable when you have multi-agent or complex state needs

For multi-step workflows or autonomous agents, a framework usually pays back. For “call LLM, parse response, return to user” applications, custom code is simpler.

Layer 4: MCP and tool integration

The Model Context Protocol (MCP), announced by Anthropic in November 2024, became the standardized way to connect LLMs to external data and tools through 2025-2026. It works across Claude, Cursor, Continue.dev, Cline, Zed, and others — vendor-neutral by design.

What MCP is: see the plain-English MCP guide
Which servers to install: see best MCP servers in 2026

For internal AI tooling, MCP is now the default way to connect to filesystems, GitHub, databases, internal APIs. If you’re building developer-facing AI tools, supporting MCP is becoming table stakes.

Layer 5: Vector DB and retrieval

For Retrieval Augmented Generation (RAG) — pulling relevant context from your own data into the LLM’s prompt — you need a vector store.

Production-grade vector DBs:

Pinecone — managed; mature; production-friendly
Weaviate — open-source plus managed cloud; flexible schema
Qdrant — open-source plus managed cloud; strong performance characteristics
Chroma — open-source; great for development, used in production by some teams
Postgres with pgvector — for teams that want vector search without adopting a new database
Turbopuffer — newer; designed for cost-efficient large-scale retrieval

The Postgres-with-pgvector pattern has grown a lot in 2025-2026 — many teams realize they don’t need a separate vector DB for their actual data volume. For very large embeddings collections (50M+) a dedicated vector DB still wins.

Layer 6: Deployment runtime

Where the model actually runs, especially for fine-tuned or open-weight models you serve yourself.

Modal — Python-native serverless GPU; popular for inference workloads
Replicate — API for running open-source models; pay per second
Together AI — hosted inference for Llama, Qwen, others; competitive pricing
Anyscale — Ray-based; for complex distributed workloads
RunPod, Vast.ai, Lambda Labs — raw GPU rentals; cheapest but more setup
vLLM (self-hosted) — high-throughput open-source inference server

For most teams using commercial LLM APIs, this layer is skipped entirely — Anthropic, OpenAI, and Google host the model. The deployment runtime matters when you self-host or serve fine-tuned variants.

Layer 7: Observability

Without observability, an LLM application is a black box. You cannot tell whether the model is returning bad answers, whether costs are escalating, or whether latency is degrading. The category exploded in 2024-2025 and has consolidated around several leaders in 2026.

The dedicated guide: Best LLM Observability Tools in 2026.

Quick orientation:

Langfuse — open-source-first, full-stack tracing plus eval plus prompt management
LangSmith — by LangChain; tightest integration with LangGraph
Helicone — proxy-based; minimal code changes
Arize Phoenix — strong on eval and drift detection
Weights and Biases Weave — extends the W&B platform to LLM workloads
Galileo — enterprise-focused; eval-heavy

For most teams: Langfuse (self-hosted) or LangSmith (managed) handles 80% of needs. Helicone is the easiest to bolt onto an existing app.

Layer 8: Evaluation

Evaluating LLM applications systematically — beyond “the answer looks right” — is the operational discipline that separates research-quality demos from production reliability.

Eval frameworks:

Inspect (UK AI Safety Institute) — open-source; rigorous; growing adoption
Promptfoo — open-source; YAML-based; lightweight
OpenAI Evals — open-source; OpenAI’s reference framework
Custom evals in your observability tool — most production teams end up here

The pattern that works: start with a small golden set of 50-200 examples, run the LLM application against them, score the outputs (manually or with an LLM-judge), and gate deployments on eval score regressions. Tools support this; the discipline of maintaining and updating evals is the actual work.

The honest decision framework

The temptation is to adopt every layer. The reality is most teams should run lean and add layers as the application needs grow.

For a solo developer or side project

LLM API (Claude or OpenAI directly)
Cursor or Claude Code
A vector DB only if doing RAG (Postgres pgvector is fine)
Helicone or Langfuse cloud for basic observability
Skip everything else

Monthly cost: $30-150 depending on usage.

For a startup with a serious AI product

LLM API plus fallback (Claude primary, OpenAI fallback for redundancy)
LangGraph or Pydantic AI for orchestration
Pinecone or Qdrant for vector storage
LangSmith for observability plus eval
Promptfoo or Inspect for systematic eval
A custom prompt-management layer (Langfuse covers this)

Monthly cost: $500-5,000.

For an enterprise with significant volume

Multi-provider LLM routing (Anthropic plus OpenAI plus Azure OpenAI plus Bedrock)
A managed orchestration layer (LangGraph plus custom)
Dedicated vector DB at production scale (Pinecone or Weaviate)
Enterprise observability (Arize, Galileo, or LangSmith Enterprise)
Internal eval infrastructure with golden sets per use case
MCP servers for internal data sources
AI gateway or proxy layer for compliance

Monthly cost: tens of thousands to millions, dominated by LLM API spend.

What’s emerging in late 2026 and beyond

A few categories worth watching:

AI gateways — proxies in front of LLM APIs that add caching, fallback, cost control, prompt versioning (Portkey, Helicone, Langfuse all moving here)
Long-running agent infrastructure — agents that run for hours unattended with checkpoints (Devin-style; commercializing in 2026)
Multi-modal observability — tracing for systems that handle images, audio, video alongside text
AI red-teaming and safety tooling — adversarial testing infrastructure for production LLM apps
Hosted MCP server marketplaces — paid hosting for trusted MCP integrations

If you’re building a serious AI product, these are worth keeping an eye on, but most aren’t yet mandatory in 2026.

Common mistakes

Adopting an agent framework too early — most apps work fine with custom Python or TypeScript orchestration until they don’t
Skipping observability — leads to silent regressions, surprise cost spikes, debugging hell
Over-investing in a vector DB — Postgres pgvector covers more cases than people realize
Not running evals — every team that skips this regrets it within a quarter
Locking into one LLM provider — multi-provider routing has matured; lock-in is increasingly unnecessary
Building tooling that MCP already standardizes — if you’re connecting LLMs to GitHub, Postgres, Slack, etc., check MCP first

Where to go next

For specific layers, follow the deeper guides:

LLMs: Best local LLMs for 16GB Macs and Llama vs Qwen vs DeepSeek on Apple Silicon
ML platforms (classical and LLM-adjacent): Best machine learning platforms in 2026
MCP: What is MCP and Best MCP servers in 2026
Coding tools: Cursor pricing, Claude Code workflow guide, Claude Code vs Codex CLI vs Cursor Agent
Observability: Best LLM observability tools in 2026
Coding alternatives: Best Claude Code alternatives

The AI ops stack is still moving, but the layers are stable. Tools shift; the architecture pattern of LLM plus orchestration plus retrieval plus observability plus evaluation is durable.