Medical Records Digitisation at an Indian Hospital Group — 8M Legacy Records Searchable in 14 Languages
Medical Records Parser + DocuMage ingesting eight million paper-and-PDF legacy records into a structured longitudinal patient record.
The challenge
The hospital group — an Indian multi-specialty hospital chain with 28 hospitals across India serving more than 4 million outpatient visits and 220,000 inpatient admissions per year — was operating with a structurally fragmented medical-records estate. The group had grown by acquisition over a decade and inherited an EHR landscape including three different commercial EHRs, two in-house builds and (in the older acquired hospitals) a substantial paper-record archive that had never been digitised. The group's clinicians frequently treated patients whose prior care had been at a different facility within the group without access to the prior clinical record.
The clinical-quality and patient-safety implications were material. Duplicate diagnostic testing, missed allergy and contraindication signals, and inconsistent management of chronic conditions across encounters were all traceable to the records fragmentation. The group's chief medical officer had set a target of delivering a unified, searchable, structured longitudinal patient record across the group, with the legacy paper-and-PDF archive ingested and the cross-EHR data unified.
The constraints were stringent. The Indian regulatory environment required all patient data to remain in India. The clinical documentation across the group was written in 14 different Indian languages — primarily English, Hindi and Tamil, with significant minorities in Bengali, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, Odia, Assamese and Sanskrit-based clinical vocabulary. The clinical content was a mix of structured EHR data, dictated-and-transcribed notes, handwritten chart notes (in the older hospitals) and image-based content (scanned lab and imaging reports).
The approach
MindMap deployed Medical Records Parser (Mp) as the structured-extraction engine, DocuMage for the OCR and document-classification layer, Clinical Pathway Engine (Cp) as the clinical-context layer, and Data Lake Architect (Dl) for the underlying longitudinal-record data infrastructure.
Phase one was the data-infrastructure build. The group's three commercial EHRs and two in-house builds were integrated through HL7 FHIR-based adapters (with custom adapters for the in-house builds that did not support FHIR cleanly). The legacy paper-and-PDF archive was ingested through a high-volume scanning operation followed by the platform's OCR-and-extraction pipeline. The unified longitudinal record schema was designed by the group's clinical-informatics team in collaboration with our embedded clinical-informaticist.
Phase two was the multi-language clinical extraction. DocuMage's OCR layer handles the 14 Indian languages and the script variants (the Indic scripts plus Latin). Medical Records Parser's structured extraction handles the clinical-content layer — diagnosis, procedure, medication, allergy, lab values, imaging findings — across the language mix, with a unified clinical-ontology layer that maps language-specific clinical phrasing to SNOMED CT and the Indian clinical-coding equivalents the group uses.
Phase three was the legacy archive ingestion. The paper archive — approximately 8 million records — was processed at high volume across a 14-month ingestion campaign. The ingestion was prioritised: active-patient records first (so that current clinical care benefited immediately), recent-archive records next, and the deep historical archive last.
Phase four was the clinical-workflow integration. The unified longitudinal record was integrated into each EHR's clinical workflow so clinicians could access the cross-encounter record from within their familiar EHR interface, with deep-links into the historical-record detail where the clinician needed the full source content.
The pre-built building blocks
Rather than commission a ground-up build, the engagement leaned on MindMap's pre-built accelerator library — production-tested components that compress what would otherwise be a six-to-nine-month build into weeks.
Medical Records Parser
14-language clinical-content structured extraction
DocuMage
Indic-script OCR and document classification
Clinical Pathway Engine
Clinical-context and ontology layer
Data Lake Architect
Longitudinal-record data infrastructure
The architecture
The platform runs on the group's private cloud tenant inside Indian regional infrastructure, with full Indian data-residency and the group's HIPAA-equivalent data-protection posture maintained throughout.
DocuMage's OCR layer is multi-stage. For the Indic-script content, a fine-tuned Indic-script OCR model (built on Tesseract 5 with custom training data for the medical-handwriting domain) handles the bulk volume; for the higher-complexity handwritten content (older paper records, dictation transcriptions), an LLM-based OCR layer takes over. The extraction layer is multilingual end-to-end — the model handles documents written in any of the 14 supported languages without requiring upstream language detection.
Medical Records Parser's clinical-extraction layer uses a fine-tuned Llama 3.1 70B variant trained on the group's de-identified clinical-document corpus (approximately 240,000 documents across the language mix). The structured-extraction output is a unified clinical-fact schema with provenance back to the source document, the extraction confidence and the language of the source content.
The unified longitudinal record is held in a combination of structured (PostgreSQL) and unstructured (S3-equivalent object store) storage, with a full-text and semantic search index across both. Search uses a hybrid approach: classical BM25 for the structured-content queries (lab values, diagnoses, medications) and embedding-based semantic search for the unstructured-content queries (clinical narrative, prior-encounter context).
Integration with the EHRs uses HL7 FHIR for the commercial EHRs and custom adapters for the in-house builds. The platform writes the cross-EHR longitudinal-record view back to each EHR via the EHR's standard clinical-document import APIs, so clinicians see the unified record inside their familiar workflow.
The numbers behind the story
Approximately 8 million legacy records have been digitised and integrated into the unified longitudinal record. Combined with the cross-EHR integration of the live record-flow, the platform now provides a unified clinical view across the group's full patient population.
Field-level extraction accuracy on the legacy records is 96.8% on the group's clinical-informatics quality-assurance sampling. The 14-language extraction quality is consistent across the supported languages, with the smaller-language content (the regional Indian languages) showing only modest accuracy degradation against English and Hindi.
Record-lookup latency for clinicians is sub-2-seconds at p95, including the cross-encounter aggregation. The clinical-workflow integration has driven adoption: clinicians' usage of the unified longitudinal record has grown to more than 80% of in-scope encounters within twelve months of in-hospital go-live.
Clinical-quality outcomes have followed. The group's clinical-quality team has measured material reductions in duplicate diagnostic testing (the prior record now shows the recent test result), in allergy-and-contraindication misses (the prior record now surfaces the documented allergy), and in chronic-condition management gaps (the cross-encounter view surfaces gaps the single-encounter view did not).
An unexpected outcome: the structured longitudinal record has become the foundation for the group's research arm. The unified record now supports the group's first cross-hospital clinical-research studies, with the structured clinical facts providing a research-grade longitudinal cohort that the previous fragmented records estate did not support.
“Our clinicians were treating patients without access to the patient's prior care record across our own hospitals. That is a patient-safety problem we knew we had to solve. MindMap delivered the unified longitudinal record across the group in eighteen months, with the legacy paper archive ingested and our fourteen-language clinical content searchable. Our clinical-quality team has measured real safety-and-quality improvements.”— Chief Medical Officer· Indian Multi-Specialty Hospital Group
Why MindMap was chosen
The group had previously evaluated two global medical-records vendors and one regional EHR vendor for the unified-record use case. The global vendors lacked the multi-language depth that the Indian context required; the regional EHR vendor proposed a wholesale EHR migration that the group's CIO considered operationally unacceptable.
MindMap's accelerator-composition approach — augmenting the group's existing EHR estate with a unified-record platform rather than replacing it, with the 14-language clinical-extraction capability — was the structural differentiator. We could demonstrate the multi-language extraction on the group's own sample records during the bid.
Our embedded clinical-informatics expertise on the delivery team (three Indian-trained clinical-informaticists and a former hospital CMIO) was the third factor. The group's CMO felt that the team understood the clinical-workflow and clinical-quality realities of Indian hospital care, not just the AI technology.
Related deployments
Prior Auth Acceleration
Prior Auth Accelerator + DocGenie automated 70% of payer prior auth submissions end-to-end, reducing turnaround from 3 days to 4 hours.
Medical Records Processing
Medical Records Parser processing 14,000 patient documents per day across nine hospitals, lifting coding accuracy from 87% to 99.2%.
Clinical-Trial Document Workflow
DocuMage + Clinical Trial Matcher accelerated study-start document cycle 5x — letting trial sites enrol patients weeks sooner than the prior baseline.
Want an outcome like this?
Start with a 2-week AI Readiness Sprint. We deliver a prioritised use-case backlog and business case grounded in what's actually buildable with our accelerator library.