What 94% Straight-Through Processing Actually Requires
Every IDP vendor's demo promises 95%+ straight-through. Every production IDP deployment that's been running for more than 9 months sits around 75–85%. The gap between the demo number and the production number is where the actual engineering lives. Here's what 94% STP requires across classification, extraction, validation and exception design.
In the first week of November 2025 I was in a war-room in Hyderabad with the operations director of a tier-1 bank's trade finance team. The bank had deployed an IDP system 14 months earlier that the vendor's demo had positioned at 95% straight-through processing. They were running at 71%. Their operations team had ballooned from 8 reviewers to 23 chasing exceptions, and the executive sponsor was being asked uncomfortable questions about ROI. The bank brought us in for a remediation. By March we'd lifted them to 92% STP, kept the operations team at 12 (the four-person reduction came from exception-design improvements that let reviewers handle 2.5× the volume), and freed roughly $1.4M in annual operating cost. That gap — vendor demo of 95%, production reality of 71%, remediated reality of 92%+ — is where the engineering of IDP actually lives. Every IDP vendor's sales deck promises 95%+ straight-through. Every production IDP deployment that's been live more than 9 months sits around 75–85% before someone does the engineering work to fix it. The 94%+ STP we ship across DocuMage deployments isn't a feature of any single capability; it's the cumulative effect of four engineering decisions made early and held to throughout the deployment.
The classifier-first routing pattern
Decision one is structural: don't send every document to the LLM. The cheapest classifier you can deploy (a fine-tuned small model or even a heuristic rules layer for the highest-volume document types) routes documents to the right extraction path. Highly structured forms (passport pages, standard invoices in a known template, government ID documents) go to template-based extraction with field coordinates; layout-free documents (correspondence, free-form claims, contracts) go to LLM-augmented extraction. Routing accuracy at the classifier stage is decisive — get it wrong and the downstream extraction is operating on the wrong assumptions about layout. Production classifier accuracy needs to be 98%+ for the architecture to work; if you're below that, work on the classifier before scaling the extraction layer.
Per-field confidence scoring is non-negotiable
Decision two: every extracted field carries a confidence score. Without per-field confidence, you have no signal for which fields to trust, and the system either over-routes to human review (hurting STP) or under-routes (letting wrong fields flow through). Per-field confidence comes from two sources: the model's own confidence calibration (which is usable for the structured-output models we typically deploy, but not for the LLM-based ones without calibration work), and downstream validation against expected types and ranges. The right architecture combines both — a field gets a final confidence that's the minimum of the model's score and the validation score — and routes by that final number.
Business-rule validation as a first-class pipeline stage
Decision three: validation is a pipeline stage, not an afterthought. Extracted fields are checked against the business rules that downstream systems will check anyway. Invoice line items sum to the invoice total. Issue dates are before today. ID numbers pass their checksum. Counterparties exist in the registry. If validation fails, the document routes to exception review rather than flowing through to the downstream system that would reject it anyway. This decision alone typically lifts STP by 8–12 points in our deployments, because the upstream catch eliminates the loop where a document flows through, gets rejected downstream, comes back to the operations team to correct, and re-enters processing.
Exception design is the 50% of the work nobody scopes
The 6% of documents that don't go straight-through is where the value engineering lives. Most IDP programmes optimise extraction accuracy and treat the human-review queue as a regrettable necessity to be minimised. The teams that hit 94% sustainably treat exception design as a first-class workstream: structured exception UX that shows the document with the low-confidence fields highlighted, the reasoning trace from extraction, and the user's expected next action; clear escalation paths for documents the reviewer can't resolve; learning loop where the reviewer's corrections feed back into the extractor's eval set. Without this, your reviewer team burns out within 18 months and STP rates collapse.
Five failure modes that keep STP capped at 75%
Stopping at OCR — the team assumes text extraction is enough, then discovers downstream that the structured-field problem is still unsolved. Template-only extraction at scale — works on the 60% of documents that are highly structured, fails on the long tail. No confidence scoring — model emits fields with no signal which to trust. No business-rule validation — extraction is correct but downstream rejection rate is high because format-valid but business-invalid output flows through. Treating the reviewer queue as a buffer rather than a learning loop — the operations team patches the same kinds of errors month after month with no improvement in the underlying extractor.
The first-90-days plan for an IDP programme
Days 1–30: pick a single document type and a single workflow. Inventory the production volume, the current STP rate (almost certainly 60–80% if any automation already exists), and the cost of an exception. Don't try to do everything. Days 31–60: deploy the classifier-first architecture for the chosen document type. Add per-field confidence scoring. Add business-rule validation. Build the exception-review UX. Days 61–90: production rollout in 10%/25%/100% phases with hypercare. Measure STP weekly. The target for a properly-architected first workflow is 90%+ STP by day 90, drifting toward 94%+ over the following two quarters as the exception-feedback loop tightens the extractor. The MindMap IDP platform (DocuMage) ships with these patterns built in — see /document-intelligence for the architecture.
MindMap Engineering
MindMap Engineering is the collective practice behind 117 production-deployed AI accelerators across BFSI, healthcare, government, retail and telecom. The pieces published here are written by the engineering leads who shipped the systems they describe — sovereign LLM platforms, RAG pipelines, agentic workflows, IDP systems — at customer sites across three continents. We don't write about architectures we haven't deployed.
- ✓117 production-deployed AI accelerators
- ✓50+ enterprise customers across BFSI, healthcare, government
- ✓Deployments live across India, UK, EU, Gulf, North America, Africa
- ✓Sovereign deployment as the default architectural pattern
- ✓Langfuse + RAGAS + vLLM + Qdrant production experience
Keep reading
The 2026 Sovereign AI Architecture Report
Data-driven analysis of every meaningful sovereign AI stack in production today. Compares 6 open-weights model families, 4 vector databases, 3 inference servers and 5 reference architectures on cost-per-million-tokens, regulator-readiness, integration substrate and operational complexity. Survey-based, with the deployment numbers from 50+ regulated-industry engagements behind every recommendation.
State of Agentic AI in Regulated Industries 2026
A production-pattern survey of agentic AI in BFSI, healthcare, public sector and pharma. What patterns actually ship (ReAct + tool-use, planner-executor, multi-agent orchestration), what fails in audit (silent loops, hidden tool calls, unbounded reasoning), and the four engineering controls separating prototypes from production. Based on the agent runtimes we've shipped at 17 regulated customers in the past 18 months.
EU AI Act Readiness Benchmark — 50 Enterprises
Anonymised readiness benchmark across 50 enterprises with EU exposure — banks, insurers, hospitals, manufacturers, public-sector bodies — measured against the 11 Articles 9–15 evidence requirements. Median readiness is 38%; only 14% would survive a supervisory audit today. Where the gaps cluster, why they're tractable in 90 days, and the five interventions that close the most ground.
Ready to apply these ideas?
Talk to our engineering team. No sales pitch — just a technical conversation.
Start a conversation →