All posts
ObservabilityLLMProduction

Observability for LLM Pipelines: Tracing, Evaluation Metrics, and Per-Request Cost Attribution

May 9, 202611 min readBy Yogendra Singh

Traditional application observability asks three questions: is it up, is it fast, and are there errors? LLM pipelines need those answers too, but they also need answers to questions that don't arise in conventional systems: is the output quality degrading? Which requests are driving disproportionate cost? Did a prompt change last week cause a regression in a downstream task that wasn't directly tested? Standard APM tools are not designed to answer these questions out of the box, and bolting them on as an afterthought produces gaps that only appear when something goes wrong in production.

What follows is the observability stack I built for a multi-tenant AI reporting pipeline — one that handles thousands of LLM calls per hour across multiple models, tenants, and task types. The tooling is DataDog for metrics and traces, Sentry for error tracking and session replay, and a custom evaluation pipeline for quality metrics. The specific tools matter less than the philosophy: every LLM request should be a first-class observable unit with its own trace, its own cost record, and a quality score wherever feasible.

Structured logging: the foundation everything else builds on

Advertisement

Every LLM call in the system emits a structured JSON log event before the call (with request metadata) and after (with response metadata). The schema is fixed and versioned: `llm_call_id`, `tenant_id`, `task_type`, `model_name`, `prompt_version`, `input_token_count`, `output_token_count`, `latency_ms`, `cache_hit`, `cost_usd`, `error_code` (null on success), and `quality_score` (null when not available). This schema is enforced at the logging call site via a typed Go struct — a missing field is a compile error, not a runtime surprise.

The `prompt_version` field deserves emphasis. When a prompt is updated, the version increments. This means every log event is tagged with the exact prompt version that produced it. If a quality regression appears after a prompt update, we can query logs by prompt version to measure the before/after delta precisely. Without version tracking, prompt regressions are very hard to diagnose because the prompt itself isn't stored in the response — only the output is visible, and the cause is invisible.

Distributed tracing: LLM calls as spans in the request trace

Each incoming API request generates a DataDog trace with a root span representing the full request lifecycle. LLM calls within that request are child spans with the LLM-specific fields (model, token counts, cost) as span attributes. This means when a report request is slow, the DataDog trace shows exactly which LLM call in the pipeline was slow, what it was asked to do, how many tokens it consumed, and whether it was a cache hit or miss. Without this, 'the report is slow' is a shrug; with it, 'the anomaly narrative generation step took 4.2 seconds and consumed 3,800 output tokens' is an actionable observation.

We propagate trace context across service boundaries using W3C Trace Context headers. The Node.js API gateway starts the trace, passes the trace ID to the Go processing service, which passes it to the Python evaluation microservice. In DataDog, the full cross-service trace is assembled automatically, showing the entire request journey across three language runtimes as a single waterfall. This is non-negotiable for a polyglot microservices architecture — debugging without it means correlating logs manually by timestamp, which is slow and error-prone.

Evaluation metrics: measuring quality you can act on

Latency and error rate are necessary but not sufficient for LLM observability. An LLM pipeline can have perfect uptime and consistent latency while producing subtly degraded output that users notice before the dashboard does. Quality metrics require a quality measurement layer, which is the part most teams skip because it's harder to build.

  • Factual consistency score: for report summaries, we extract key numerical claims from the LLM output and verify them against the source data programmatically. A summary claiming a 12% revenue increase when the data shows 8% is a factual error, detectable without human review.
  • Structural validity: for JSON-output tasks, schema validation is a hard quality gate — a response that doesn't parse is a failure, logged as such and never served to the user.
  • Length ratio: outputs that are dramatically shorter or longer than the prompt-version baseline (more than 2 standard deviations) are flagged for human review. Length drift is often a signal of prompt or model behavior change.
  • LLM-as-judge: for narrative summaries where correctness is harder to verify mechanically, a smaller, fast model evaluates the summary against a rubric on a 1–5 scale. This adds cost (~$0.001 per evaluation) but produces a trackable quality signal over time.
  • User feedback signal: thumbs up/down on AI-generated summaries is persisted to the database, associated with the specific LLM call ID, and aggregated into a weekly quality report per tenant and per prompt version.

Per-request cost attribution: from DataDog to the billing report

LLM cost attribution is the observability dimension most teams implement last and regret most. 'We spent $X on OpenAI this month' is not actionable. 'Tenant A's anomaly detection feature accounted for 34% of our total LLM spend, and 60% of that came from 8% of requests that triggered the most expensive fallback model' is actionable. Getting from the first statement to the second requires logging cost at the individual request level and building aggregation on top of it.

We compute `cost_usd` at the call site using the model's current per-token pricing, stored in a configuration table updated when providers change their rates. Input and output tokens are priced separately (output tokens cost more for all major providers). The computed cost is logged as a DataDog metric with `tenant_id`, `task_type`, and `model_name` as dimensions. A DataDog Notebook runs daily and aggregates these into a cost breakdown that the product team reviews weekly. When a specific tenant or task type shows unexpected cost growth, we have the signal to investigate before it shows up in the monthly bill.

Sentry integration: error tracking for non-deterministic failures

LLM APIs fail in ways that conventional APIs don't. They return valid HTTP 200 responses containing content-policy refusals, truncated outputs, or structured-output violations that look like success at the HTTP layer but are failures at the application layer. We capture these as Sentry issues with the full context attached: prompt version, input token count, model, tenant, and a truncated version of the problematic output (we never log full outputs to Sentry due to data privacy constraints — only the first 200 characters).

Sentry's grouping is particularly useful here: content-policy refusals group together so we can see whether a specific prompt version is triggering the policy filter at an elevated rate. JSON parse failures group by task type so we can see which output format is most brittle. Setting up alert rules on Sentry issue volume by type means we're paged on quality regressions, not just on HTTP errors — a meaningful extension of what 'error rate' means in an LLM system.

The operational playbook: weekly review ritual

Observability infrastructure is only valuable if someone is looking at it. We run a 30-minute weekly LLM pipeline review using a fixed DataDog dashboard. The review covers: cost-per-tenant week-over-week, cache hit rate by task type, p99 latency by model and task type, quality score trends by prompt version, and open Sentry issues by type. Each metric has a 'normal range' documented in the dashboard description. Anything outside the normal range generates a ticket before the meeting ends. This ritual caught a prompt regression causing a 15% drop in factual consistency scores three weeks before any user complained.

Observability for LLM pipelines is not a debugging tool — it's a product management tool. Token cost, quality scores, and latency by task type tell you where to invest in improvement and where you're already performing well. Without it, every product decision about the AI layer is a guess.

Where to start if you're building this from scratch

If you're building LLM observability from scratch, the priority order matters. First: structured logging with a fixed schema, including cost. Second: distributed tracing with LLM calls as spans. Third: at least one automated quality signal (structural validity is the cheapest to implement). Fourth: a weekly review ritual that forces someone to look at the dashboard. The LLM-as-judge quality scoring and the full Sentry integration can come later — but structured logs and traces from day one make everything else possible.

Getting this infrastructure right is the difference between an LLM feature you can confidently iterate on and one you're afraid to touch because you can't tell if changes make it better or worse. Setting up observability stacks for AI pipelines — the instrumentation, the dashboards, the alerting, and the review rituals — is something I do as part of contract engagements. If your AI product is flying blind in production, that's a tractable problem.

Open to select projects

Building something with AI?

I take on select AI engineering projects end-to-end — from React frontend to LLM pipeline on AWS. Tell me what you're building.

Advertisement