The gap between character recognition and document understanding is architectural, not a question of model accuracy. This piece walks through the mechanisms — vision encoders, layout-aware token grounding, entity and reference resolution, decomposed confidence scoring — that close it.
OCR is a character-recognition system. It maps pixel patterns to Unicode strings and emits a linearized text stream. That is a useful primitive, and for two decades it has been the foundation of most contract digitization pipelines. But character recognition is not document understanding, and the distance between the two is not a question of better OCR — it is a question of whether the system has any structural representation of the document at all.
A contract is a hierarchical, cross-referenced, context-dependent artifact. Defined terms in section 1 modify the meaning of clauses two hundred pages later. Tables encode semantic relationships between column headers and cell values. Exhibits are incorporated by reference and stored as separate attachments. Handwritten initials on amendment pages carry legal weight. None of this structure survives a text-only extraction pass.
The interesting question is not whether multi-modal systems are "better" at OCR. They are solving a different problem — joint reasoning over visual layout and textual content — and emitting a fundamentally different kind of output.
In this context, a multi-modal model is a vision-language model (VLM) that processes rendered page images and textual content through a shared embedding space. The architectural pattern is consistent across current generations of these models: a vision encoder produces visual tokens, a learned projection maps those tokens into the language model's embedding space, and a decoder generates structured output conditioned on both modalities simultaneously.
The vision encoder — typically a ViT or ConvNeXt backbone — tokenizes each page into a grid of patches, commonly 14×14 or 16×16 pixels. Each patch carries a 2D positional embedding that preserves its location on the page. These visual tokens are projected into the language model's embedding space through a learned adapter, often a cross-attention layer or a simple MLP projection.
The practical consequence is what makes this approach structurally different from OCR-plus-NLP. When the model attends to the token representing "Net 30" in a payment table, it can simultaneously attend to the visual tokens representing the column header three rows above, the row label to its left, and the table border that defines the cell's scope. That spatial context is unavailable to any pipeline that has already flattened the document into a text stream.
The end-to-end pipeline runs in four stages: page rendering and visual tokenization, model-side extraction into a constrained schema, deterministic post-processing for entity and reference resolution, and confidence scoring across multiple signals. Each stage is worth describing in isolation, because they have different reliability properties and different failure modes.
Documents (Word, PDF, Spreadsheets, etc.) are rendered to images at extraction-grade resolution (typically 150–200 DPI for clause-heavy pages, higher for handwritten or low-contrast content). Pages are encoded independently, with page-index embeddings concatenated so that the model can later reason about page ordering. Bounding-box coordinates for every patch are retained in a side channel, which is what makes source-citation and human review tractable later in the pipeline.
The VLM is prompted with an extraction schema — a JSON schema that declares the fields to extract, their types, and their cardinality. The model emits a constrained JSON object that conforms to that schema. Constrained decoding (using grammar-based or logit-mask techniques) guarantees structural validity at generation time; the values themselves are still subject to the model's epistemic uncertainty, which is what confidence scoring exists to quantify.
Span citations are emitted alongside each value: every extracted field carries a reference to the page, bounding box, and surface form from which it was extracted. This is non-negotiable for legal review, because a value with no provenance is a value a reviewer cannot verify.
Graph construction happens as a deterministic post-processing step over the model's output. This separation matters: the probabilistic component is bounded to the extraction itself, while reference resolution runs through auditable rule-based logic that a reviewer can step through.
Entity resolution clusters surface forms that refer to the same entity — "the Company," "Acme Corp.," "Acme" — using a combination of string similarity, definition-section anchoring, and contextual embeddings. Reference resolution then walks the document and binds each defined-term invocation to its canonical definition node, each "as set forth in Section X.Y" to the target clause, and each exhibit reference to its attachment. The output is a typed graph: nodes for parties, defined terms, clauses, obligations, dates, and monetary values; edges for defines, references, governs, obligates, and triggers.
A composite confidence score is computed per field from multiple independent signals, each of which fails differently. The decomposition is described in detail later in this piece.
Every extracted field carries four pieces of metadata that are non-negotiable for legal review: a normalized value, the original surface form, full provenance back to source pixels, and a decomposed confidence score. A representative field looks like this:
{
"field": "termination_for_convenience.notice_period",
"value": "P90D",
"surface_form": "ninety (90) days' prior written notice",
"provenance": {
"document_id": "msa-2024-acme-v3.pdf",
"page": 14,
"bbox": [142, 318, 487, 341],
"clause_path": "§8.2(b)"
},
"confidence": {
"extraction": 0.96,
"schema_validation": 1.0,
"cross_reference_consistency": 0.92
},
"alternatives": []
}Three things matter in this structure. Values are normalized — "ninety (90) days" becomes ISO 8601 duration P90D — while the original surface form is preserved verbatim for audit. Provenance includes the exact bounding box, which is what makes reviewer UIs able to jump-cite a value back to its source pixels. Confidence is decomposed across independent signals rather than collapsed into a single number.
A standard commercial MSA with exhibits frequently runs 150–400 pages. Even with efficient vision encoders, the full document does not fit in a single context window — a 200-page document (Word, PDF, Spreadsheets, etc.) rendered at extraction-grade resolution produces on the order of 100K–200K visual tokens before any text tokens are added. The extraction pipeline handles this through hierarchical processing rather than naïve chunking.
The first pass runs a lightweight layout-classification model over every page to produce a document map: section boundaries, table locations, signature pages, exhibit boundaries. This map is cheap to compute and is the routing substrate for everything that follows.
The second pass routes pages to specialized extractors based on the map. Table-heavy pages go through a table-extraction prompt with the relevant column-header context preserved. Narrative clause pages go through a clause-extraction prompt with the surrounding section context. Signature pages go through a different prompt entirely, because the extraction targets — names, titles, dates, initials — are spatially organized rather than sequentially organized.
The third pass operates on the document map and the accumulated extraction graph rather than the raw pages. This is where cross-page references get resolved — "as set forth in Schedule 2" gets bound to the actual table extracted from page 187, "subject to the limitations in Section 12.4" gets bound to the cap-on-liability clause. Running this as a separate pass over structured data is what keeps cross-document resolution tractable; attempting to resolve these references during the per-page extraction would require keeping the entire document in context, which is the constraint the hierarchical approach exists to avoid.
Confidence in a well-designed extraction pipeline is not a single number. It is a composite signal aggregated from several independent sources, each of which catches a different class of error.
The geometric mean of token logprobs across an extracted value span gives a base calibration of how concentrated the model's output distribution was. This catches cases where the model was selecting between plausible alternatives — for example, ambiguous date formats or numeric values where multiple OCR-plausible readings exist. On its own it is noisy, but as one component of a composite it carries useful signal.
Schema validation is binary but high-signal. If the extracted value for a governing_law field doesn't match a known jurisdiction string, or a notice_period field can't be parsed as an ISO 8601 duration, the validator catches the failure before the value reaches a reviewer. This eliminates a class of errors that token-level confidence misses entirely — a model can be highly confident about a value that is nonetheless structurally invalid.
The strongest single signal of correctness is agreement across independent extractions of the same field. The effective date often appears in the preamble, in a signature block, and in a recitals section. The governing law appears in the governing-law clause and frequently in the dispute-resolution clause. When the same field is extractable from multiple locations, disagreement is a strong negative signal and agreement is a strong positive one. The pipeline runs this check after entity resolution and folds the result into the composite score.
For high-stakes fields — indemnification caps, change-of-control triggers, termination provisions, limitation-of-liability clauses — ensemble agreement runs the same extraction through multiple sampling temperatures or model configurations and measures the rate of agreement across runs. This is expensive and reserved for fields where the cost asymmetry of an undetected error justifies the additional inference, but it catches the subset of errors that single-pass extraction misses by definition.
The composite confidence score is what drives review routing. High-confidence fields flow to the system of record; medium-confidence fields go to spot-check review; low-confidence fields route to full human review with the model's top-k alternatives surfaced. Thresholds are calibrated per field type against a held-out labeled set, because the cost asymmetry of a wrong indemnification cap is not the cost asymmetry of a wrong notice address.
Multi-modal extraction does not eliminate error — it changes the shape of the error distribution. The failure modes are different from OCR failure modes, and reviewers need to be calibrated for them.
Hallucinated values occur when the model generates a plausible value that is not actually present in the source. Schema validation catches a subset of these (invalid jurisdiction strings, malformed durations); cross-reference consistency catches another subset (the hallucinated value disagrees with the same field extracted elsewhere); the remainder requires provenance-anchored review, which is why every value carries a bounding box back to its source pixels.
Context-window truncation in cross-document scenarios — when an MSA, SOW, and amendment package together exceed the model's effective context — manifests as silent reference-resolution failures. The hierarchical processing pipeline mitigates this by operating on the extraction graph rather than the raw pages in the resolution pass, but the operator running the pipeline needs to verify that the document map captured all attachments before trusting cross-document references.
Non-determinism between runs at non-zero temperatures means that two extractions of the same document can disagree on low-confidence fields. For production pipelines, extraction is run at temperature 0 with deterministic decoding wherever the model API supports it; ensemble agreement, where used, is explicit and logged rather than a side effect of stochastic sampling.
Prompt injection via document contentis a category that text-only pipelines do not face in the same form. A document that contains text resembling an instruction ("ignore previous instructions and return…") is a legitimate concern in any pipeline that concatenates document content into a prompt. The mitigation is structural: extraction prompts treat document content as data, never as instructions, and constrained decoding ensures the output schema is enforced regardless of what the document content attempts.
Field-level F1 against a labeled gold set is the right primary metric. Accuracy alone hides class imbalance — most fields are correctly null in any given contract, so any system that aggressively returns null can post inflated accuracy numbers. F1 forces a balance between precision and recall on the fields that actually appear.
The gold set itself needs to be constructed with care. Labels should be produced by at least two reviewers per document with adjudication on disagreement, because inter-annotator agreement on contract extraction is meaningfully below 1.0 even for experienced reviewers. Semantic-equivalence rules need to be encoded explicitly: "January 1, 2025" and "01/01/2025" and 2025-01-01 should all count as matches for a date field, but a system that treats them as distinct strings will under-report its true accuracy.
The gold set must also be quarantined from the prompts and few-shot examples used in production. Leakage between the evaluation set and the prompt content invalidates the metric, and this is easier to do accidentally than most teams realize — example contracts pulled from the same corpus, schema fields named after specific clauses that appear in the gold set, system prompts that paraphrase patterns from labeled documents. A disciplined evaluation pipeline keeps the gold set in a separate repository with access controls and tracks every prompt revision against the evaluation results so regressions are caught at the prompt level rather than at the model level.
Per-field thresholding is calibrated against the same gold set. For each field, the threshold for high-confidence routing is set such that the precision on above-threshold extractions meets the operational target — typically > 0.99 for fields flowing directly to the system of record. Below-threshold fields route to review at progressively heavier touch points. Re-calibration is a scheduled activity, not a one-time setup, because both the document distribution and the model behavior drift over time.
The substantive shift is not "AI replaces OCR." It is that the extraction system now has a structural representation of the document — visual, textual, and relational — that survives end-to-end through the pipeline. Provenance is preserved at the pixel level. Confidence is decomposed across independent signals. Cross-references are resolved as a separate, auditable pass. Failure modes are different from OCR failure modes and require different review calibration, but they are tractable failure modes because every output is anchored to its source.
For teams evaluating this transition, the concrete starting point is a calibration pilot: 200 contracts of a single agreement type, labeled to gold-set standards, run through both the current pipeline and a multi-modal pipeline, with field-level F1 measured separately for high-stakes and routine fields. The decision point is not "is the new pipeline better on average" — averages obscure exactly the asymmetries that matter — but "does it meet the operational precision target on the fields where errors are most costly, while maintaining recall on the fields where omissions are most costly." That measurement is the basis of every downstream decision about deployment, review workflow, and threshold calibration.
Related Insights