What is the single most important factor when choosing an observability platform?

Total cost over a three-year horizon at projected telemetry volume — not list price. Observability costs grow faster than most teams forecast because telemetry volume grows with code complexity, microservices count, and traffic. A platform that fits today's budget at today's scale may be unaffordable in two years. The most important evaluation discipline is modeling cost at 3x, 5x, and 10x current ingestion volume before committing.

Should I prioritize features or pricing model?

Pricing model, in most cases. Feature differentiation between the major observability platforms has narrowed significantly — most can cover metrics, logs, traces, RUM, and alerting acceptably. Pricing models still vary dramatically: per-host (Datadog) compounds with infrastructure growth, per-GB-ingested (Grafana Cloud, New Relic) compounds with telemetry verbosity, flat tiers (Better Stack) cap at smaller scale. Match the pricing model to what scales fastest in your environment.

How important is OpenTelemetry support?

Very important if you want optionality. OpenTelemetry-native ingestion means you can switch backends later without re-instrumenting code. Every major platform now claims OTel support, but the quality varies — verify that the platform supports OTLP protocols natively, that core features work without proprietary agents, and that the platform's roadmap publicly commits to OTel as the primary instrumentation path.

When does self-hosting make sense over managed?

Self-hosting wins clearly above roughly 50-100 TB/month of telemetry ingest, or when you already have a platform engineering team operating other infrastructure, or when data residency requirements prevent SaaS adoption. Below those thresholds, the operational cost of self-hosting usually exceeds the licensing cost of a managed platform.

How long should the evaluation process take?

Most teams underestimate this. A serious evaluation includes vendor demos (~2 weeks), proof-of-concept deployment in a non-production environment (~4-6 weeks), real telemetry volume testing (~2 weeks), and procurement (~2-4 weeks). Plan for a 2-3 month evaluation if the decision matters; longer if multiple stakeholder approvals are required.

How to Choose an Observability Platform in 2026: A 7-Step Decision Framework

A practical framework for choosing between Datadog, Grafana Cloud, New Relic, Better Stack, Dynatrace, and the self-hosted alternatives. Seven concrete decisions that determine which platform actually fits your team.

TL;DR: Choosing an observability platform is mostly about matching the pricing model to your growth trajectory, validating OpenTelemetry support, and being honest about your team’s operational capacity. Feature differentiation matters less than vendors imply; cost models matter more. This guide walks through seven concrete decisions that determine the right answer for your team.

Why This Decision Is Harder Than It Looks

The observability platform decision is one of the most consequential infrastructure choices a team makes, for three reasons. First, the platform becomes deeply embedded in operational workflows — dashboards, alert configurations, runbooks all wrap around it. Second, the cost grows non-linearly with telemetry volume, and telemetry volume grows with the codebase. Third, switching costs are real: a serious migration takes 6-12 months and significant engineering capacity.

This is the decision teams most often regret in retrospect, and the regret is almost always about cost rather than capability. A platform that fit the budget at year one becomes punishingly expensive at year three, after the team has built so much around it that migrating is economically painful.

The framework below is designed to surface these issues upfront, before the contract is signed.

Step 1: Define Your Telemetry Growth Trajectory

Most observability cost models assume linear growth. Real telemetry growth is super-linear because three factors compound:

Service count growth as the system decomposes into more microservices
Per-service telemetry growth as engineers add more logging, more spans, more custom metrics
Traffic growth as the product itself grows

Project telemetry volume at year 1, year 2, and year 3 separately. If you cannot articulate a reasonable estimate, that itself is a finding — you need to measure baseline volume in your current setup before evaluating new platforms.

The output of this step is a number: projected total ingest (in GB or TB/month) at three time horizons. Every cost comparison after this step uses these numbers.

Step 2: Model Cost Against Each Pricing Structure

Observability vendors use three pricing structures, and each behaves differently as your stack scales:

Pricing model	Examples	Scales with	Worst-case scenario
Per-host	Datadog, Dynatrace	Infrastructure footprint	Auto-scaling explosions, ephemeral workloads
Per-ingested-GB	New Relic, Grafana Cloud	Telemetry verbosity	High-cardinality metrics, verbose logging
Flat tier / seat-based	Better Stack, some SigNoz tiers	Team size + included quotas	Hitting tier ceilings as data grows
Self-hosted infrastructure	LGTM, SigNoz self-hosted	Cluster size and engineering time	Operational burden if telemetry grows faster than platform team

Model the projected year-3 cost under each pricing model for each vendor on your shortlist. The variance is often larger than expected — by year 3, a per-host model might cost 4x what a per-GB model would for the same workload, or vice versa, depending on which dimension grows faster in your environment.

Step 3: Verify OpenTelemetry Native Support

OpenTelemetry has become the de facto instrumentation standard. Every major platform now claims OTel support, but the quality varies. The questions to answer:

Does the platform ingest OTLP (HTTP and gRPC) natively, or does it require a vendor-specific collector in front?
Do the platform’s core features (APM, RUM, log correlation) work with OTel-instrumented services without the proprietary agent?
Is OTel publicly stated as the primary instrumentation path in the vendor’s product roadmap, or is it positioned as a secondary import path?

The reason this matters is optionality. If you instrument with OTel and the platform supports it natively, you can switch backends later without re-instrumenting code. If the platform’s best features require the proprietary agent, you have lock-in even on top of OTel-instrumented services.

Step 4: Audit Cost Control Features

This step separates platforms that take cost management seriously from platforms that monetize cost surprises. The cost control features that matter:

Sampling controls for distributed traces — without sampling, trace ingest scales with traffic times average span count, which compounds fast
Log filtering before ingest — drop noisy logs at the collector rather than storing them and querying around them
Configurable retention by signal type — keep traces for 7 days, metrics for 13 months, logs for 30 days, and pay for each independently
Per-service telemetry budgeting — visibility into which services are generating cost so you can target cleanup
In-product cost dashboards — track spend in near-real-time, not on the monthly invoice

A platform that hides cost-control features behind enterprise tiers is signaling its pricing philosophy. Take that signal seriously.

Step 5: Be Honest About Your Operational Capacity

Self-hosted observability is dramatically cheaper than managed at scale — but only if your team can actually operate it. The honest question is: do you have a platform engineering team that operates other production infrastructure, and can it absorb another distributed system (or four)?

Operational capacity	Right answer
No dedicated platform team, observability is one engineer’s part-time job	Managed only (Datadog, Grafana Cloud, New Relic, Better Stack)
Small platform team, observability is one full-time owner	Managed primary; self-hosted single-application option (SigNoz, OpenObserve) acceptable
Mature platform team operating Kubernetes, databases, message queues	Self-hosted LGTM stack viable; managed still simpler if cost permits
Infrastructure-as-product team with deep operational experience	Self-hosted preferred if cost dominates

The mistake teams make is choosing self-hosted because the licensing cost looks attractive, then absorbing the operational burden without budgeting for it. Six months later the platform team is firefighting the observability stack instead of building product infrastructure.

Step 6: Test Real Workloads in Proof-of-Concept

Vendor demos always look good. Your workload is the only thing that actually predicts behavior. The PoC should include:

Real ingestion volume: route a representative slice of production telemetry to the candidate platform for at least two weeks
Real query patterns: have your on-call rotation use the platform during an actual incident, not in a sandbox
Real cost data: at the end of the PoC, the vendor should provide an accurate cost projection based on observed ingest, not a list-price estimate
Real integration depth: connect the candidate platform to your existing alerting, on-call, and runbook systems

Two weeks is the minimum. Six weeks is more realistic if the platform will become deeply embedded in operational workflows. Skipping or compressing this step is the single most common source of post-purchase regret.

Step 7: Negotiate Procurement Around What Actually Matters

By the time you reach procurement, you know what the platform costs at your projected scale. The negotiation should focus on:

Multi-year commitment vs. annual flexibility — multi-year discounts are real, but lock you in. Annual is more expensive but lets you switch.
Overage pricing — what happens when you exceed your committed volume? Some vendors charge punitive overage rates; others auto-scale at the same unit price.
Renewal escalators — annual price increases at renewal are common. Negotiate this upfront, not at year two.
Module bundling — if you need APM, RUM, and log management, bundle pricing is meaningfully better than buying SKUs individually.
Migration assistance — if you are coming from another platform, vendors will often subsidize migration costs to win the deal.

Procurement negotiation typically saves 15-30% on list price. Going into the conversation with concrete year-3 cost projections from competing vendors is the single strongest leverage point.

A Worked Example

A 60-engineer SaaS company in 2026 evaluating Datadog, Grafana Cloud, and self-hosted SigNoz, with 3 TB/month projected to grow to 15 TB/month over three years:

Vendor	Year 1 cost	Year 3 cost	Total 3-year cost
Datadog (per host + APM + Logs SKUs)	~$120k	~$380k	~$750k
Grafana Cloud (per GB, all signals)	~$45k	~$185k	~$370k
Self-hosted SigNoz (infra + 0.3 FTE platform engineer)	~$80k	~$120k	~$300k

These numbers are illustrative, not authoritative — the point is the shape of the variance. By year three, the cheapest option is roughly 2.5x cheaper than the most expensive, and the order can flip depending on which dimension grows fastest. Without modeling this explicitly, teams pick based on year-one list pricing, which is the wrong basis for a multi-year decision.

Bottom Line

The most important step in choosing an observability platform is being honest about telemetry growth and cost trajectory, not feature comparison. The features converge; the cost models diverge sharply at scale. Match the pricing model to your growth pattern, validate OpenTelemetry support to preserve optionality, and budget for the operational capacity each option actually requires. Done right, the seven-step process above takes 2-3 months. Done wrong, the regret takes years.