tinyctl.dev
Tech Tutorials

The 2026 AI Ops Stack: A Practical Guide to Running AI in Production

What an AI ops stack actually is, what goes in each layer, and how to choose tools that work together. Covers LLMs, observability, vector DBs, agents, deployment, and evaluation.

“AI ops stack” is the 2026 phrase for the collection of tools you actually need to run an AI application in production. The terminology is messy — sometimes called LLMOps, sometimes AI infrastructure, sometimes just “the stack.” Whatever you call it, the components are converging into a recognizable set of layers.

This guide walks through what each layer is, what tools dominate it in 2026, and how to compose them into a working stack. For each layer there’s a deeper article linked.

TL;DR — the eight layers at a glance

#LayerPurposeCommon tools
1LLM providerThe model itselfClaude, GPT, Gemini, self-hosted
2Coding assistantsWhat you build the app withCursor, Claude Code, Copilot
3Orchestration / agentsMulti-step logicLangGraph, CrewAI, Pydantic AI
4MCP / tool integrationLLM-to-data connectionsFilesystem, GitHub, Postgres, Slack
5Vector DB / retrievalRAG storagePinecone, Weaviate, Qdrant, pgvector
6Deployment runtimeModel hostingModal, Replicate, Together, vLLM
7ObservabilityTracing + cost + qualityLangfuse, LangSmith, Helicone, Arize
8EvaluationQuality measurementInspect, Promptfoo, custom

Not every application needs all eight:

  • Simple chatbot: layers 1, 2, 7
  • Serious RAG agent: layers 1-3, 5, 7
  • Multi-agent research system: all eight

Start minimal. Add layers as the application needs grow — not before.

Layer 1: The LLM provider

The model itself. The decision here drives everything downstream.

Commercial APIs (most production):

  • Anthropic Claude (Opus 4.x, Sonnet 4.x, Haiku 4.x) — leads on complex reasoning, long context, agentic workflows
  • OpenAI GPT-5 family — strong all-rounder, biggest ecosystem
  • Google Gemini 2.x — competitive on multimodal, integrated with GCP services
  • Mistral, Cohere, others — niche or regional positions

Self-hosted and open-weight (specific use cases):

  • Llama 3.3 and Llama 4 — Meta’s open-weight models (Llama 4 released April 2025 with Scout and Maverick variants); license has restrictions but workable for most uses
  • Qwen 2.5 and 3.x — Alibaba; Apache 2.0 license; strong multilingual
  • DeepSeek family — strong reasoning, very permissive licensing
  • Phi-4 family — Microsoft; small but clever

For local development and indie use cases, see our best local LLMs for MacBook M5 16GB guide. For production, commercial APIs almost always make more sense than self-hosting unless cost or data residency dictates otherwise.

Layer 2: Coding assistants

What your team uses to write the AI application is itself an AI ops decision. The shape of your AI app often shapes the coding tool that fits.

  • Cursor — best in-editor experience for AI-augmented coding ($20/month or $16/month annual). See our Cursor pricing breakdown.
  • Claude Code — CLI agent that excels at multi-file refactors and MCP-integrated workflows. See our Claude Code workflow guide.
  • OpenAI Codex CLI — OpenAI’s open-source CLI agent (launched April 2025).
  • GitHub Copilot — Pro $10/month, Pro+ $39/month, Business $19/user/month; least agent-style; widely deployed.

The Claude Code vs Codex CLI vs Cursor Agent comparison covers the real workflow differences.

For most professional teams in 2026: Cursor for the editor + Claude Code for agentic tasks is the common stack.

Layer 3: Orchestration and agent frameworks

When your app does more than one LLM call, you need orchestration logic. The 2026 landscape:

  • LangGraph — most popular; graph-based agent state machines
  • CrewAI — role-based multi-agent patterns; growing
  • Pydantic AI — newer, type-safe, Python-native; well-regarded among Python developers
  • AutoGen — Microsoft; enterprise-leaning
  • Custom Python or TypeScript — most production apps still use this; frameworks are valuable when you have multi-agent or complex state needs

For multi-step workflows or autonomous agents, a framework usually pays back. For “call LLM, parse response, return to user” applications, custom code is simpler.

Layer 4: MCP and tool integration

The Model Context Protocol (MCP), announced by Anthropic in November 2024, became the standardized way to connect LLMs to external data and tools through 2025-2026. It works across Claude, Cursor, Continue.dev, Cline, Zed, and others — vendor-neutral by design.

For internal AI tooling, MCP is now the default way to connect to filesystems, GitHub, databases, internal APIs. If you’re building developer-facing AI tools, supporting MCP is becoming table stakes.

Layer 5: Vector DB and retrieval

For Retrieval Augmented Generation (RAG) — pulling relevant context from your own data into the LLM’s prompt — you need a vector store.

Production-grade vector DBs:

  • Pinecone — managed; mature; production-friendly
  • Weaviate — open-source plus managed cloud; flexible schema
  • Qdrant — open-source plus managed cloud; strong performance characteristics
  • Chroma — open-source; great for development, used in production by some teams
  • Postgres with pgvector — for teams that want vector search without adopting a new database
  • Turbopuffer — newer; designed for cost-efficient large-scale retrieval

The Postgres-with-pgvector pattern has grown a lot in 2025-2026 — many teams realize they don’t need a separate vector DB for their actual data volume. For very large embeddings collections (50M+) a dedicated vector DB still wins.

Layer 6: Deployment runtime

Where the model actually runs, especially for fine-tuned or open-weight models you serve yourself.

  • Modal — Python-native serverless GPU; popular for inference workloads
  • Replicate — API for running open-source models; pay per second
  • Together AI — hosted inference for Llama, Qwen, others; competitive pricing
  • Anyscale — Ray-based; for complex distributed workloads
  • RunPod, Vast.ai, Lambda Labs — raw GPU rentals; cheapest but more setup
  • vLLM (self-hosted) — high-throughput open-source inference server

For most teams using commercial LLM APIs, this layer is skipped entirely — Anthropic, OpenAI, and Google host the model. The deployment runtime matters when you self-host or serve fine-tuned variants.

Layer 7: Observability

Without observability, an LLM application is a black box. You cannot tell whether the model is returning bad answers, whether costs are escalating, or whether latency is degrading. The category exploded in 2024-2025 and has consolidated around several leaders in 2026.

The dedicated guide: Best LLM Observability Tools in 2026.

Quick orientation:

  • Langfuse — open-source-first, full-stack tracing plus eval plus prompt management
  • LangSmith — by LangChain; tightest integration with LangGraph
  • Helicone — proxy-based; minimal code changes
  • Arize Phoenix — strong on eval and drift detection
  • Weights and Biases Weave — extends the W&B platform to LLM workloads
  • Galileo — enterprise-focused; eval-heavy

For most teams: Langfuse (self-hosted) or LangSmith (managed) handles 80% of needs. Helicone is the easiest to bolt onto an existing app.

Layer 8: Evaluation

Evaluating LLM applications systematically — beyond “the answer looks right” — is the operational discipline that separates research-quality demos from production reliability.

Eval frameworks:

  • Inspect (UK AI Safety Institute) — open-source; rigorous; growing adoption
  • Promptfoo — open-source; YAML-based; lightweight
  • OpenAI Evals — open-source; OpenAI’s reference framework
  • Custom evals in your observability tool — most production teams end up here

The pattern that works: start with a small golden set of 50-200 examples, run the LLM application against them, score the outputs (manually or with an LLM-judge), and gate deployments on eval score regressions. Tools support this; the discipline of maintaining and updating evals is the actual work.

The honest decision framework

The temptation is to adopt every layer. The reality is most teams should run lean and add layers as the application needs grow.

For a solo developer or side project

  • LLM API (Claude or OpenAI directly)
  • Cursor or Claude Code
  • A vector DB only if doing RAG (Postgres pgvector is fine)
  • Helicone or Langfuse cloud for basic observability
  • Skip everything else

Monthly cost: $30-150 depending on usage.

For a startup with a serious AI product

  • LLM API plus fallback (Claude primary, OpenAI fallback for redundancy)
  • LangGraph or Pydantic AI for orchestration
  • Pinecone or Qdrant for vector storage
  • LangSmith for observability plus eval
  • Promptfoo or Inspect for systematic eval
  • A custom prompt-management layer (Langfuse covers this)

Monthly cost: $500-5,000.

For an enterprise with significant volume

  • Multi-provider LLM routing (Anthropic plus OpenAI plus Azure OpenAI plus Bedrock)
  • A managed orchestration layer (LangGraph plus custom)
  • Dedicated vector DB at production scale (Pinecone or Weaviate)
  • Enterprise observability (Arize, Galileo, or LangSmith Enterprise)
  • Internal eval infrastructure with golden sets per use case
  • MCP servers for internal data sources
  • AI gateway or proxy layer for compliance

Monthly cost: tens of thousands to millions, dominated by LLM API spend.

What’s emerging in late 2026 and beyond

A few categories worth watching:

  • AI gateways — proxies in front of LLM APIs that add caching, fallback, cost control, prompt versioning (Portkey, Helicone, Langfuse all moving here)
  • Long-running agent infrastructure — agents that run for hours unattended with checkpoints (Devin-style; commercializing in 2026)
  • Multi-modal observability — tracing for systems that handle images, audio, video alongside text
  • AI red-teaming and safety tooling — adversarial testing infrastructure for production LLM apps
  • Hosted MCP server marketplaces — paid hosting for trusted MCP integrations

If you’re building a serious AI product, these are worth keeping an eye on, but most aren’t yet mandatory in 2026.

Common mistakes

  • Adopting an agent framework too early — most apps work fine with custom Python or TypeScript orchestration until they don’t
  • Skipping observability — leads to silent regressions, surprise cost spikes, debugging hell
  • Over-investing in a vector DB — Postgres pgvector covers more cases than people realize
  • Not running evals — every team that skips this regrets it within a quarter
  • Locking into one LLM provider — multi-provider routing has matured; lock-in is increasingly unnecessary
  • Building tooling that MCP already standardizes — if you’re connecting LLMs to GitHub, Postgres, Slack, etc., check MCP first

Where to go next

For specific layers, follow the deeper guides:

The AI ops stack is still moving, but the layers are stable. Tools shift; the architecture pattern of LLM plus orchestration plus retrieval plus observability plus evaluation is durable.