LLM Observability with Langfuse: What to Capture, What to Ignore

LLM observability has matured from "log everything to a JSON file" to a proper discipline in 2026. The substrate for sovereign deployments is Langfuse self-hosted. The question that determines whether your observability is useful or noise is what you choose to capture — and what you deliberately don't.

MindMap Engineering

MindMap Digital

When we're parachuted into a stalled agent deployment — and we've now done this for 11 customers in the last 18 months — the first thing we ask the team to show us is their observability dashboard. About a third of those teams open Langfuse, walk us through three or four queries, and we can have an architectural diagnosis within an hour. The other two thirds open a homegrown JSON-log viewer that captures everything and tells them nothing. The difference between the two cohorts isn't engineering skill; it's discipline about what to capture and what to deliberately not capture. LLM observability has matured from "log everything to a JSON file" to a proper discipline in 2026. The substrate we deploy in every sovereign customer engagement is Langfuse self-hosted — it's open source, it runs inside the customer's perimeter, and the data model matches the shape of LLM workloads well. The harder question than which tool to use is what to capture, what to deliberately leave out, and which four queries actually need to be embedded in a dashboard that the engineering team will check weekly. Here's how we set up Langfuse on customer engagements — drawn from the production deployments we've shipped over the past two years.

What to capture by default

For every LLM call: the input prompt (the full prompt sent to the model, not just the user's message), the retrieved context (if RAG), the model identifier and version, the response, the response metadata (token counts, latency, finish reason), the user identifier, the workflow / session identifier, and the trace identifier that ties this call back to the user-facing request. For every tool call inside an agent: the function name, the arguments, the result, the latency. For every retrieval call: the query, the top-K results, the scores. This baseline gives you replayability — you can reconstruct any decision the system made by querying the observability store.

What to capture for evaluation

Beyond the baseline, capture user feedback signals where you can: thumbs-up/thumbs-down where the UX supports it, downstream action taken (the user did or didn't accept the model's recommendation), and where applicable, the ground-truth outcome (the user's bug report was actually about X, the claim was actually approved). These signals are what turn observability into evaluation: when you can join model outputs to user outcomes, you can quantify quality drift over time and trigger alerts when it crosses thresholds. Most observability deployments don't make this join cleanly, which is why most observability deployments produce dashboards no one looks at after the second week.

What to deliberately not capture

Three categories of data should be filtered before the observability store. Personally-identifying information that the workflow doesn't need to retain for compliance purposes — emails, phone numbers, ID numbers — should be redacted at the observability ingest layer (we use Redacto for this in customer deployments). Free-text fields that have known sensitive content like clinical notes or trade secrets — these need an access-control model in the observability layer that's at least as strict as the underlying system. Anything covered by retention limits — for example, EU GDPR right to erasure means the observability store needs to honour deletion requests, which is a non-trivial operational requirement and is much easier if you minimise the scope of what's captured.

The Langfuse data model in production terms

Trace: a top-level request. Generation: a single LLM completion (one model call, one response). Span: any other measurable unit (a tool call, a retrieval call, an external API call). Score: a quality metric attached to a trace or generation. The mapping in our deployments is: a single user request becomes one trace; each LLM call inside that request becomes a generation; each tool or retrieval call becomes a span; quality metrics (faithfulness, relevance, user feedback) attach as scores. This nesting reflects the operational shape of LLM applications and makes the queries the team actually wants to run — "show me all failures in the last hour", "what's the p95 latency for the agent on workflow X" — direct rather than cross-joined.

The four queries every team should be running weekly

Query one: cost-per-completion over time, broken down by workflow. Detects cost regressions early. Query two: latency distribution per workflow at p50 / p95 / p99. Detects performance regressions. Query three: error rate per workflow, broken down by error type (tool-call failure, validation failure, model refusal, timeout). Detects quality regressions and identifies which subsystem is the cause. Query four: user-feedback aggregation per workflow over a 7-day window. Detects subjective quality drift before it becomes a customer complaint. We embed all four queries into a Langfuse dashboard at deployment time, with thresholds that page the on-call engineer when crossed.

Where observability becomes evaluation

Production observability data is the input to a proper evaluation discipline. The best eval sets are built from real production traces — the team picks the cases the model handled badly, labels the correct outputs, and uses the labelled set to gate future model or prompt changes. This loop (capture → label → eval-gate → deploy → capture) is what produces the production-quality LLM systems that actually keep improving over time. Without observability infrastructure that captures the right data with retrievable access, this loop doesn't close. See /agentic-ai for the broader runtime architecture and /glossary#observability for the underlying concept.

About the author

MindMap Engineering

MindMap Digital Engineering Practice

MindMap Engineering is the collective practice behind 117 production-deployed AI accelerators across BFSI, healthcare, government, retail and telecom. The pieces published here are written by the engineering leads who shipped the systems they describe — sovereign LLM platforms, RAG pipelines, agentic workflows, IDP systems — at customer sites across three continents. We don't write about architectures we haven't deployed.

Credentials + recognition

✓117 production-deployed AI accelerators
✓50+ enterprise customers across BFSI, healthcare, government
✓Deployments live across India, UK, EU, Gulf, North America, Africa
✓Sovereign deployment as the default architectural pattern
✓Langfuse + RAGAS + vLLM + Qdrant production experience

Areas of repeated lived expertise

Open-weights LLM serving (Llama, Qwen, Mistral, DeepSeek)Production RAG architectureAgentic AI runtime engineeringDocument intelligence (IDP) at 94%+ STPOn-premise + air-gapped deployments

More Insights

Keep reading

The 2026 Sovereign AI Architecture Report

Data-driven analysis of every meaningful sovereign AI stack in production today. Compares 6 open-weights model families, 4 vector databases, 3 inference servers and 5 reference architectures on cost-per-million-tokens, regulator-readiness, integration substrate and operational complexity. Survey-based, with the deployment numbers from 50+ regulated-industry engagements behind every recommendation.

Saurabh Goenka

22 min read

State of Agentic AI in Regulated Industries 2026

A production-pattern survey of agentic AI in BFSI, healthcare, public sector and pharma. What patterns actually ship (ReAct + tool-use, planner-executor, multi-agent orchestration), what fails in audit (silent loops, hidden tool calls, unbounded reasoning), and the four engineering controls separating prototypes from production. Based on the agent runtimes we've shipped at 17 regulated customers in the past 18 months.

MindMap Engineering

20 min read

EU AI Act Readiness Benchmark — 50 Enterprises

Anonymised readiness benchmark across 50 enterprises with EU exposure — banks, insurers, hospitals, manufacturers, public-sector bodies — measured against the 11 Articles 9–15 evidence requirements. Median readiness is 38%; only 14% would survive a supervisory audit today. Where the gaps cluster, why they're tractable in 90 days, and the five interventions that close the most ground.

Saurabh Goenka

18 min read

View all insights →

Ready to apply these ideas?

Talk to our engineering team. No sales pitch — just a technical conversation.

Start a conversation →