Back to notebook Case 003

Smart OCR

OCR + LLM pipeline for extracting structured information from unstructured documents with validation built in.

Mission

Automate document extraction without leaving quality control to manual review alone.

Current Stack
PythonOpenAIOCRPydantic

Casefile Summary

System Role

Document extraction pipeline that converts unstructured inputs into validated structured data.

Primary Users

Internal operations that need document data without repeating manual extraction and review work.

Operational Constraint

High reported accuracy was not enough unless downstream automation could trust the extracted schema.

Why It Exists

To turn document handling into a repeatable backend workflow instead of a partially manual validation task.

Problem

Documents arrived in inconsistent formats and still needed structured data on the other side. Raw OCR alone did not make the workflow safe enough for automation.

Architecture

The workflow keeps OCR, language extraction and schema validation as separate steps so failures stay visible and recoverable.

Document -> OCR layer -> LLM extraction -> Schema validation
    |             |              |                   |
    v             v              v                   v
 source       raw text      structured draft     trusted payload

Key Decisions

  • Separate OCR failure from model failure.
  • Validate extracted payloads with a typed schema before automation continues.
  • Treat accuracy as useful only when the next system can trust the result.

System Diagram

flowchart LR A["Uploaded PDF / Image"] --> B["OCR Engine"] B --> C["Prompt Builder"] C --> D["LLM Engine"] D --> E["Pydantic Validation"] E --> F["Trusted JSON Output"]

Operational Flow

01 Intake

Documents enter the pipeline

Source files arrive with inconsistent structure, formatting and text quality.

02 Extraction

OCR and model layers run separately

OCR produces text, then the model turns that text into a structured candidate output.

03 Validation

Schema checks gate the workflow

Typed validation catches malformed or incomplete results before they become business data.

04 Automation

Trusted payloads move downstream

Only validated extractions continue into the systems that consume document data.

Repository Snippet

From the public API router. This is the actual shape of the extraction surface.

@router.post("/extract", response_model=ExtractionResponse)
async def extract_document(file: UploadFile = File(...)) -> ExtractionResponse:
    content = await file.read()
    if not content:
        raise HTTPException(status_code=400, detail="Uploaded file is empty")
    return pipeline.run(content=content, filename=file.filename)

Decision Record

Context

Manual review reduced risk, but it also blocked scale and made extraction throughput expensive.

Decision

Combine OCR and LLM extraction with schema validation instead of trusting raw text extraction alone.

Tradeoff

More validation logic to maintain, but much lower chance of silent bad data entering downstream workflows.

Result

The pipeline stayed automation-first while keeping failures visible enough to correct.

Result

The system reached more than 98 percent accuracy on the target workflow while keeping the process automation-first.

Production Signals

  • Structured validation reduced silent extraction failures.
  • OCR and model output were treated as separate failure surfaces.
  • Accuracy mattered only if downstream automation could trust the result.
  • Document handling moved closer to a real backend workflow than a review-heavy manual step.

Public Repository Evidence

This workflow also exists as a public technical artifact: smart-document-extractor.

  • Verified public stack: Python, FastAPI, Pydantic, Docker.
  • Verified public endpoints include GET /api/health and POST /api/extract.
  • The public README documents the same OCR -> LLM -> validation pipeline described here.

Operational Readout

The public repository is useful evidence because it shows the extraction flow as an actual API surface, not just as portfolio copy.

Captured Surface