How to Choose an Observability Platform in 2026: A 7-Step Decision Framework
A practical framework for choosing between Datadog, Grafana Cloud, New Relic, Better Stack, Dynatrace, and the self-hosted alternatives. Seven concrete decisions that determine which platform actually fits your team.
TL;DR: Choosing an observability platform is mostly about matching the pricing model to your growth trajectory, validating OpenTelemetry support, and being honest about your team’s operational capacity. Feature differentiation matters less than vendors imply; cost models matter more. This guide walks through seven concrete decisions that determine the right answer for your team.
Why This Decision Is Harder Than It Looks
The observability platform decision is one of the most consequential infrastructure choices a team makes, for three reasons. First, the platform becomes deeply embedded in operational workflows — dashboards, alert configurations, runbooks all wrap around it. Second, the cost grows non-linearly with telemetry volume, and telemetry volume grows with the codebase. Third, switching costs are real: a serious migration takes 6-12 months and significant engineering capacity.
This is the decision teams most often regret in retrospect, and the regret is almost always about cost rather than capability. A platform that fit the budget at year one becomes punishingly expensive at year three, after the team has built so much around it that migrating is economically painful.
The framework below is designed to surface these issues upfront, before the contract is signed.
Step 1: Define Your Telemetry Growth Trajectory
Most observability cost models assume linear growth. Real telemetry growth is super-linear because three factors compound:
- Service count growth as the system decomposes into more microservices
- Per-service telemetry growth as engineers add more logging, more spans, more custom metrics
- Traffic growth as the product itself grows
Project telemetry volume at year 1, year 2, and year 3 separately. If you cannot articulate a reasonable estimate, that itself is a finding — you need to measure baseline volume in your current setup before evaluating new platforms.
The output of this step is a number: projected total ingest (in GB or TB/month) at three time horizons. Every cost comparison after this step uses these numbers.
Step 2: Model Cost Against Each Pricing Structure
Observability vendors use three pricing structures, and each behaves differently as your stack scales:
| Pricing model | Examples | Scales with | Worst-case scenario |
|---|---|---|---|
| Per-host | Datadog, Dynatrace | Infrastructure footprint | Auto-scaling explosions, ephemeral workloads |
| Per-ingested-GB | New Relic, Grafana Cloud | Telemetry verbosity | High-cardinality metrics, verbose logging |
| Flat tier / seat-based | Better Stack, some SigNoz tiers | Team size + included quotas | Hitting tier ceilings as data grows |
| Self-hosted infrastructure | LGTM, SigNoz self-hosted | Cluster size and engineering time | Operational burden if telemetry grows faster than platform team |
Model the projected year-3 cost under each pricing model for each vendor on your shortlist. The variance is often larger than expected — by year 3, a per-host model might cost 4x what a per-GB model would for the same workload, or vice versa, depending on which dimension grows faster in your environment.
Step 3: Verify OpenTelemetry Native Support
OpenTelemetry has become the de facto instrumentation standard. Every major platform now claims OTel support, but the quality varies. The questions to answer:
- Does the platform ingest OTLP (HTTP and gRPC) natively, or does it require a vendor-specific collector in front?
- Do the platform’s core features (APM, RUM, log correlation) work with OTel-instrumented services without the proprietary agent?
- Is OTel publicly stated as the primary instrumentation path in the vendor’s product roadmap, or is it positioned as a secondary import path?
The reason this matters is optionality. If you instrument with OTel and the platform supports it natively, you can switch backends later without re-instrumenting code. If the platform’s best features require the proprietary agent, you have lock-in even on top of OTel-instrumented services.
Step 4: Audit Cost Control Features
This step separates platforms that take cost management seriously from platforms that monetize cost surprises. The cost control features that matter:
- Sampling controls for distributed traces — without sampling, trace ingest scales with traffic times average span count, which compounds fast
- Log filtering before ingest — drop noisy logs at the collector rather than storing them and querying around them
- Configurable retention by signal type — keep traces for 7 days, metrics for 13 months, logs for 30 days, and pay for each independently
- Per-service telemetry budgeting — visibility into which services are generating cost so you can target cleanup
- In-product cost dashboards — track spend in near-real-time, not on the monthly invoice
A platform that hides cost-control features behind enterprise tiers is signaling its pricing philosophy. Take that signal seriously.
Step 5: Be Honest About Your Operational Capacity
Self-hosted observability is dramatically cheaper than managed at scale — but only if your team can actually operate it. The honest question is: do you have a platform engineering team that operates other production infrastructure, and can it absorb another distributed system (or four)?
| Operational capacity | Right answer |
|---|---|
| No dedicated platform team, observability is one engineer’s part-time job | Managed only (Datadog, Grafana Cloud, New Relic, Better Stack) |
| Small platform team, observability is one full-time owner | Managed primary; self-hosted single-application option (SigNoz, OpenObserve) acceptable |
| Mature platform team operating Kubernetes, databases, message queues | Self-hosted LGTM stack viable; managed still simpler if cost permits |
| Infrastructure-as-product team with deep operational experience | Self-hosted preferred if cost dominates |
The mistake teams make is choosing self-hosted because the licensing cost looks attractive, then absorbing the operational burden without budgeting for it. Six months later the platform team is firefighting the observability stack instead of building product infrastructure.
Step 6: Test Real Workloads in Proof-of-Concept
Vendor demos always look good. Your workload is the only thing that actually predicts behavior. The PoC should include:
- Real ingestion volume: route a representative slice of production telemetry to the candidate platform for at least two weeks
- Real query patterns: have your on-call rotation use the platform during an actual incident, not in a sandbox
- Real cost data: at the end of the PoC, the vendor should provide an accurate cost projection based on observed ingest, not a list-price estimate
- Real integration depth: connect the candidate platform to your existing alerting, on-call, and runbook systems
Two weeks is the minimum. Six weeks is more realistic if the platform will become deeply embedded in operational workflows. Skipping or compressing this step is the single most common source of post-purchase regret.
Step 7: Negotiate Procurement Around What Actually Matters
By the time you reach procurement, you know what the platform costs at your projected scale. The negotiation should focus on:
- Multi-year commitment vs. annual flexibility — multi-year discounts are real, but lock you in. Annual is more expensive but lets you switch.
- Overage pricing — what happens when you exceed your committed volume? Some vendors charge punitive overage rates; others auto-scale at the same unit price.
- Renewal escalators — annual price increases at renewal are common. Negotiate this upfront, not at year two.
- Module bundling — if you need APM, RUM, and log management, bundle pricing is meaningfully better than buying SKUs individually.
- Migration assistance — if you are coming from another platform, vendors will often subsidize migration costs to win the deal.
Procurement negotiation typically saves 15-30% on list price. Going into the conversation with concrete year-3 cost projections from competing vendors is the single strongest leverage point.
A Worked Example
A 60-engineer SaaS company in 2026 evaluating Datadog, Grafana Cloud, and self-hosted SigNoz, with 3 TB/month projected to grow to 15 TB/month over three years:
| Vendor | Year 1 cost | Year 3 cost | Total 3-year cost |
|---|---|---|---|
| Datadog (per host + APM + Logs SKUs) | ~$120k | ~$380k | ~$750k |
| Grafana Cloud (per GB, all signals) | ~$45k | ~$185k | ~$370k |
| Self-hosted SigNoz (infra + 0.3 FTE platform engineer) | ~$80k | ~$120k | ~$300k |
These numbers are illustrative, not authoritative — the point is the shape of the variance. By year three, the cheapest option is roughly 2.5x cheaper than the most expensive, and the order can flip depending on which dimension grows fastest. Without modeling this explicitly, teams pick based on year-one list pricing, which is the wrong basis for a multi-year decision.
Bottom Line
The most important step in choosing an observability platform is being honest about telemetry growth and cost trajectory, not feature comparison. The features converge; the cost models diverge sharply at scale. Match the pricing model to your growth pattern, validate OpenTelemetry support to preserve optionality, and budget for the operational capacity each option actually requires. Done right, the seven-step process above takes 2-3 months. Done wrong, the regret takes years.