NEWMindMap Digital has acquired Bluetide.co— deepening our data & agentic-AI stack.Read more →
Home · Services · OCR / IDP
Intelligent Document Processing · OCR

Every document, every layout, every language — extracted and verified

OCR has been a solved problem for printed Latin-script invoices since 2010. The work nobody else does well is the other ninety percent of enterprise documents: handwriting, multilingual scans, novel layouts, three-page faxes, and PDFs that were screenshots of photographs. We combine template-free LLM extraction with classical OCR, business-rule validation, and confidence-driven human review to deliver straight-through processing rates that move the unit economics, not the demo slide.

99.1%
Field accuracy on production traffic
140+
Pre-built extractors
62 langs
Latin, Arabic, Cyrillic, Indic, CJK
68%
Avg straight-through rate
99.1%
Field accuracy
68%
Straight-through rate
4 hrs
Median cycle time
$0.03
Cost per document
Capabilities

What we deliver

Layout-agnostic extraction

Classical OCR for character recognition, layout analysis for table and form structure, and an LLM extraction pass that understands what a field means rather than where it sits on the page. The result is consistent extraction across novel layouts without rebuilding templates every time a supplier changes their letterhead.

99.1% field accuracy in production

Handwriting and low-quality scans

Multi-model ICR pipeline tuned for fax, photocopy, mobile capture, and handwritten input across English, French, Arabic, and major Indic scripts. We pre-process aggressively — deskew, denoise, super-resolution upscaling — before extraction, and post-process with confidence-weighted ensembling.

Pre-built extractor library

Over one hundred forty production-tested extractors out of the box: KYC documents from one hundred ninety-six countries, bank statements from four hundred banks, invoices, contracts, medical records, insurance forms, BOMs, lab reports, shipping manifests. Each one tuned, evaluated, and version-controlled.

140+ ready extractors

Business-rule validation

Extracted fields are validated against your business rules, external databases, and prior documents — three-way invoice match, KYC consistency across documents, contract-clause flagging against your policy library. Validation failures are routed with the rule that fired, not as opaque exceptions.

Confidence-driven review

A field-level confidence score routes each extraction either to straight-through processing, to a fast reviewer queue, or to a senior reviewer for high-risk documents. The thresholds tune themselves to your acceptance criteria over time, lifting STP rates without lifting error rates.

Integration depth

Pre-built connectors to SAP, Oracle, Salesforce, the major core banking platforms, the major hospital information systems, SharePoint, and any REST endpoint. Extracted data lands in the system of record with the source document linked for audit, not in a CSV someone has to upload.

Live Demo

OCR scan in progress

DocGenie — Extracting invoice fields
Vendor
TechSupplies Ltd
99% confidence
Invoice #
INV-2024-08821
98% confidence
Date
15 Mar 2024
97% confidence
Amount
£ 24,500.00
99% confidence
PO Reference
PO-2024-1123
96% confidence
Tax
£ 4,900.00 (20%)
98% confidence
Reference Architecture

How a query actually flows.

A real trace through the sovereign stack. Six stages, ~1.4 seconds end-to-end, zero packets leaving your perimeter.

QUERY TRACE · LIVEtrace_id 0x8c41a2b9usr_4821
SOVEREIGN · ON-PREM·17:42:09 IST·● 200 OK
01
User submit
"Q3 underwriting flags"
42ms
02
Embed
bge-large-en · 1024d
180ms
03
Vector search
pgvector · k=32
90ms
04
Rerank · guardrail
PII · safety · top-8
140ms
05
Sovereign LLM
Llama 3.1 · 70B · local
940ms
06
Compose · cite
8 docs · markdown
28ms
WATERFALL · LAST QUERYtotal 1.42s · sla < 2s
USER SUBMIT
42 ms
EMBED · bge
180 ms
VECTOR SEARCH
90 ms
RERANK · GUARD
140 ms
LLM INFERENCE
940 ms
COMPOSE · CITE
28 ms
0 ms50010001500
RESPONSE · SAMPLE8 docs cited · 99% confidence
Q"Summarise Q3 underwriting flags"
A3 anomalies detected in Q3 underwriting [1]: velocity spikes in segment-NA [4], policy concentration above threshold [7], and 2 dormant accounts re-activated [11].
[1]q3_uw_summary.pdf
[4]region_na_h2.xlsx
[7]concentration_log.csv
[11]dormant_audit.pdf
LIVE TRACES · LAST 90s12 ok · 0 failed · 0 egress
17:42:090x8c41a2b9usr_4821rag.query8 docs · llama-70b1.42 s● OK
17:42:040x8c419f44svc_kycllm.classifydoc=invoice · 99%0.81 s● OK
17:41:580x8c419b10usr_2110agent.runfraud_check · 12 rules2.04 s● OK
17:41:510x8c41960cusr_4821rag.query6 docs · llama-70b1.11 s● OK
17:41:460x8c4192e8svc_ocrllm.extract12 fields · 98.6%0.94 s● OK
17:41:390x8c418f10usr_8801agent.rununderwrite · pass1.66 s● OK
ZERO API EGRESS · 0 BYTES OUTALL STAGES INSIDE PERIMETEREVERY TRACE WRITTEN TO YOUR AUDIT STORE↗ SOVEREIGN
Methodology

How we deliver

01

Document discovery

We catalogue every document type in scope, sample volume, current process steps, error modes, and downstream system. You get an honest assessment of which documents are good candidates and which are not — some are not worth automating yet, and we will say so.

02

Pipeline design and ground truth

Senior solution architect designs the classification, extraction, validation, and routing pipeline. In parallel your domain experts label a ground-truth set of two hundred to a thousand documents per type that will gate every release. Without ground truth, accuracy claims are theatre.

03

Build and tune

Configure the extractors from the library, train classifiers on your taxonomy, define the validation rules, and integrate with downstream systems. Six to nine week timeline for a typical four-to-six document-type rollout. Accuracy is measured continuously against ground truth.

04

Shadow and cutover

Run the pipeline in shadow mode against live document traffic for two to four weeks, surfacing edge cases and tuning thresholds without affecting business outcomes. Cut over only when you have hit your agreed accuracy and STP targets.

05

Operate and expand

Production operations with monthly accuracy reports, quarterly model retraining cycles, and an expansion backlog for new document types. The platform compounds — adding the second document type costs a fraction of the first.

Technology

The stack we build on

OCR and vision
Tesseract 5
PaddleOCR
Azure Document Intelligence
Google Document AI
Amazon Textract
Custom CV models
Extraction LLMs
GPT-4o Vision
Claude 3.5 Sonnet Vision
Qwen-VL
Phi-3 Vision
Domain fine-tunes
Table-reasoning models
Integrations
SAP S/4 and ECC
Oracle EBS / Fusion / PeopleSoft / JDE
QuickBooks / Tally
Salesforce
Temenos / Finacle / Flexcube
Any REST or SFTP
Compliance and audit
SOC 2 Type II
ISO 27001
GDPR Art. 25
DPDP Act
HIPAA Safe Harbor
Immutable audit log
"We had been doing KYC document review with two hundred contractors in two locations. MindMap's IDP pipeline now handles ninety-four percent of documents straight-through with audit-grade logging. Onboarding moved from five days to four hours and the regulator audited the system without a single finding."
Director of Operations, Tier-1 Sub-Saharan African Bank
Engagement Options

How we work together

Managed SaaS

Hosted by MindMap on ISO 27001 and SOC 2 Type II certified infrastructure with multi-tenant data isolation, regional residency options, and a documented sub-processor list. Up and running in under two weeks for standard document types.

Private cloud

Deployed in your AWS, Azure, or GCP tenant. You retain full data residency control and key management; we operate the pipeline. Includes integration with your identity provider, your SIEM, and your existing observability stack.

On-prem and air-gapped

Full deployment inside your data centre with no outbound internet, suitable for central banks, defence, and regulators with strict data-sovereignty mandates. Includes the OCR engines, LLM serving, and management plane, with a documented upgrade and patch path.

FAQ

Common questions

What accuracy can you actually deliver on low-quality documents?+

On clean, printed documents in a supported language we deliver ninety-nine percent plus field accuracy. On faxes, photocopies, and mobile-captured images we deliver ninety-five to ninety-eight percent depending on quality, using aggressive pre-processing and a confidence-weighted ensemble of OCR engines and vision LLMs. On handwriting we deliver ninety-two to ninety-six percent in English and major Indic scripts. We measure all of this against ground truth — not against vendor demos.

Do you support handwritten documents?+

Yes, with realistic limits. We handle handwritten English, French, Arabic, Hindi, and a growing list of Indic scripts. Accuracy varies with handwriting quality but is typically high enough to drive a confidence-routed pipeline where seventy to eighty percent of handwritten input is straight-through and the rest goes to a fast reviewer queue. We will share representative samples on your data before contracting.

How do you handle a document type you have never seen?+

Our LLM extraction layer is template-free — it understands document structure semantically and can extract fields from novel layouts on day one. For high-volume novel documents we then build a tuned extractor over a few weeks against a ground-truth set, which lifts accuracy and reduces inference cost. For low-volume novel documents we run them through the general extractor with human review on every page. The economics drive the choice.

Can this run fully on-premise without internet access?+

Yes. The full pipeline — OCR engines, vision LLMs, classifier, validation engine, management plane — runs inside your data centre with no outbound network. We have deployed this pattern inside central banks and defence agencies. The trade-off is hardware investment up front and a longer release cadence; the benefit is absolute data sovereignty.

How do you integrate with our downstream systems?+

Pre-built connectors for SAP, Oracle, the major core banking platforms, Salesforce, the major hospital information systems, and SharePoint. For anything else we integrate via REST, SOAP, SFTP, or message bus. Extracted data lands in the system of record with a link back to the source document image so an auditor can trace any data point to its origin.

What does a human-in-the-loop workflow look like?+

Each extracted field carries a confidence score. Above your threshold the document is straight-through. Below it the document routes to a reviewer queue with the suspect fields highlighted, the source image side-by-side, and the validation rules that triggered review made explicit. Reviewers correct, the corrections feed back into the model training set, and accuracy compounds. We track reviewer productivity and inter-rater agreement so the workflow improves over time.

Ready to explore OCR / IDP?

Speak to our engineering team. No sales pitch — just a technical conversation.

Start a conversation →
Talk to the product team