Document extraction pipeline that converts unstructured inputs into validated structured data.
Smart OCR
OCR + LLM pipeline for extracting structured information from unstructured documents with validation built in.
Automate document extraction without leaving quality control to manual review alone.
Casefile Summary
Internal operations that need document data without repeating manual extraction and review work.
High reported accuracy was not enough unless downstream automation could trust the extracted schema.
To turn document handling into a repeatable backend workflow instead of a partially manual validation task.
Problem
Documents arrived in inconsistent formats and still needed structured data on the other side. Raw OCR alone did not make the workflow safe enough for automation.
Architecture
The workflow keeps OCR, language extraction and schema validation as separate steps so failures stay visible and recoverable.
Document -> OCR layer -> LLM extraction -> Schema validation
| | | |
v v v v
source raw text structured draft trusted payload
Key Decisions
- Separate OCR failure from model failure.
- Validate extracted payloads with a typed schema before automation continues.
- Treat accuracy as useful only when the next system can trust the result.
System Diagram
Operational Flow
Documents enter the pipeline
Source files arrive with inconsistent structure, formatting and text quality.
OCR and model layers run separately
OCR produces text, then the model turns that text into a structured candidate output.
Schema checks gate the workflow
Typed validation catches malformed or incomplete results before they become business data.
Trusted payloads move downstream
Only validated extractions continue into the systems that consume document data.
Repository Snippet
From the public API router. This is the actual shape of the extraction surface.
@router.post("/extract", response_model=ExtractionResponse)
async def extract_document(file: UploadFile = File(...)) -> ExtractionResponse:
content = await file.read()
if not content:
raise HTTPException(status_code=400, detail="Uploaded file is empty")
return pipeline.run(content=content, filename=file.filename)
Decision Record
Manual review reduced risk, but it also blocked scale and made extraction throughput expensive.
Combine OCR and LLM extraction with schema validation instead of trusting raw text extraction alone.
More validation logic to maintain, but much lower chance of silent bad data entering downstream workflows.
The pipeline stayed automation-first while keeping failures visible enough to correct.
Result
The system reached more than 98 percent accuracy on the target workflow while keeping the process automation-first.
Production Signals
- Structured validation reduced silent extraction failures.
- OCR and model output were treated as separate failure surfaces.
- Accuracy mattered only if downstream automation could trust the result.
- Document handling moved closer to a real backend workflow than a review-heavy manual step.
Public Repository Evidence
This workflow also exists as a public technical artifact: smart-document-extractor.
- Verified public stack: Python, FastAPI, Pydantic, Docker.
- Verified public endpoints include
GET /api/healthandPOST /api/extract. - The public README documents the same OCR -> LLM -> validation pipeline described here.
Operational Readout
The public repository is useful evidence because it shows the extraction flow as an actual API surface, not just as portfolio copy.
Captured Surface