Document Processing May 2026 12 min read Deepak Patil

From Documents to Insight: How to Extract Structured Data (Word, PDF, Spreadsheets, etc.)

A technical walkthrough of the full extraction pipeline — partitioning, schema-constrained extraction, structure-aware chunking, and confidence-gated routing — with a worked example showing how a single contract clause flows from raw document bytes to a structured field.

Enterprise knowledge lives in PDFs. Contracts, financial reports, technical manuals, regulatory filings, invoices — the documents that drive decisions are almost universally stored in formats that resist programmatic access. The text is there. The structure is there. The intelligence is there. None of it is queryable, comparable, or monitorable until it has been extracted into structured form.

The extraction problem sounds straightforward: read the document, pull out the relevant data, store it somewhere useful. In practice it is one of the harder engineering problems in enterprise AI — and the gap between a naive implementation and a production-ready pipeline is wider than most teams expect when they start.

This article walks through that pipeline end to end: why pointing a capable model at a PDF and trusting it to produce structured output doesn't work, what each stage of a production pipeline actually does, and — most concretely — what a single contract clause looks like as it flows through the pipeline from PDF bytes to a normalized, provenance-anchored, confidence-scored field.

The goal is not to read documents. The goal is to turn documents into structured data that can be queried, compared, monitored, and acted on at portfolio scale. Those are different problems, and the second one is significantly harder.

The Document Parsing Problem

Modern vision-language models are genuinely capable. They handle large context windows. They produce reasonable structured output. They read text inside table cells correctly most of the time. So when teams start building document extraction pipelines, the natural instinct is to point a capable model at a stack of PDFs with a well-written prompt and see how far that gets them.

The answer, consistently, is: not far enough for production. The Unstructured team's SCORE-Bench evaluation (December 2025) is the clearest public articulation of why. SCORE-Bench is a benchmark of real-world enterprise documents — financial reports with nested tables, scanned forms, technical documentation with multi-column layouts — annotated by domain experts and designed specifically to expose the gap between systems that handle clean academic PDFs and systems that handle production document portfolios.

The hallucination–recall tradeoff

The central tension SCORE-Bench surfaces is what they call the hallucination vs. recall tradeoff: the balance between finding all the text a document contains (Percent Tokens Found) and inventing text that isn't there (Percent Tokens Added). These are independent failure modes. A system can have excellent recall by aggressively extracting every plausible token, at the cost of hallucinating spurious ones. A system can have excellent hallucination control by extracting conservatively, at the cost of missing real content.

The benchmark results show that pipelines built around capable models can land at very different points on this curve depending on how they're constructed. The lowest hallucination rate across all evaluated pipelines (Percent Tokens Added of 0.036) belongs to a purpose-built pipeline using a frontier model as its VLM backend. The second-highest recall (Percent Tokens Found of 0.924) is achieved by the same family of pipelines, simultaneously. Other approaches reach higher recall but carry hallucination rates three to four times higher — a tradeoff that is unacceptable for downstream applications where false content is more damaging than missing content.

The takeaway isn't that one model is better than another. It's that the operating point on the recall/hallucination curve is determined by the pipeline architecture wrapped around the model — not by the model alone.

Where the gap is widest: tables and structure

Table extraction is where the gap between raw model calls and purpose-built pipelines is most consequential. SCORE-Bench separates this into two metrics: Cell Content Accuracy (reading the text inside the cell correctly) and Cell Level Index Accuracy (understanding which row and column that text belongs to). The distinction matters because the failure mode is subtle: a model can read a cell's text correctly while placing it under the wrong column header. The number is right; its meaning is wrong.

For financial data, pricing tables, or specification sheets, this is a data integrity failure that is hard to detect downstream. A liability cap of $5,000,000 bound to the wrong jurisdiction column, a payment term attributed to the wrong service line, a renewal date tied to the wrong contract instance — all of these look like clean extractions until the downstream system acts on them.

Failure ModeRoot CauseDownstream ImpactSeverity

Hallucinated contentModel invents tokens not in sourceCorrupted vector store; confident wrong answershigh

Missing coverageModel skips sections or truncatesRetrieval gaps; incomplete answershigh

Cell-index misalignmentCell content read correctly but placed in wrong row/columnNumerically correct values bound to wrong headershigh

Element sequencing errorsHeaders, paragraphs, figures emitted in wrong orderBroken context in downstream chunksmedium

Inconsistent output formatModel deviates from schema between documentsPipeline failures; manual post-processing requiredmedium

Mid-sentence chunk splitsCharacter-based chunking ignores structureDiluted embeddings; poor retrieval precisionmedium

Hallucinated content doesn't announce itself. It looks like extracted text. It flows into your vector store, gets retrieved, and feeds into your LLM as if it were real. The downstream effect is answers that are confidently wrong in ways that are hard to catch after the fact.

The Architecture of a Production Pipeline

Closing the gap between a raw model call and production-grade extraction isn't a question of finding a better model. It's a question of building the right scaffolding around the model. A production pipeline has five distinct layers, each addressing a specific failure mode from the benchmark results. The architecture is sequential at the document level but operates on different representations of the document at each stage.

Pipeline architecture

Each layer takes a more refined representation of the document than the last. Partitioning operates on rendered pages. Prompted extraction operates on typed elements. Post-processing operates on the model's structured output. Chunking operates on extracted, normalized content. Confidence scoring operates on the fully extracted field set. The probabilistic component — the model call — is bounded to stage 2, and everything around it is deterministic, inspectable, and testable.

Stage 1 — Document partitioning

Partitioning converts a raw PDF into a structured set of typed document elements: paragraphs, titles, tables, list items, figures, headers, footers. Each element carries a type label, a bounding box, a page reference, and a position in the document's logical order. This is the step that creates a structured representation for everything downstream to operate on.

Addresses: element sequencing errors, table misalignment, structural context loss. A model working on partitioned elements knows what type of content it is processing and where it sits in the document hierarchy — information that has been irretrievably lost by the time a text-only pipeline reaches the extraction step.

Stage 2 — Schema-constrained extraction

The VLM is prompted with an extraction schema — a JSON schema declaring the fields to extract, their types, and their cardinality. Constrained decoding (grammar-based or logit-mask techniques) guarantees structural validity at generation time. The values themselves are still subject to the model's epistemic uncertainty, which is what confidence scoring exists to quantify.

Addresses: the recall/hallucination tradeoff, table column misattribution, section boundary confusion. Schema constraints close the format inconsistency gap; structured prompting on typed elements closes the element alignment gap.

Stage 3 — Post-processing and normalization

Even well-prompted models produce edge cases. Post-processing normalizes output to canonical forms — "ninety (90) days" and "90 days" and "P90D" are unified to the same ISO 8601 duration. Schema validation runs against expected types and known value sets (jurisdictions, currencies, party roles). Cross-reference consistency checks compare fields extracted from multiple document locations.

Addresses: silent format drift across document types, downstream pipeline failures from malformed values, undetected disagreement between redundantly-extracted fields.

Stage 4 — Structure-aware chunking

With partitioned, extracted content, chunking can respect the document's logical structure rather than imposing arbitrary character boundaries. Chunk boundaries follow section breaks, clause boundaries, and table edges. By-title chunking ensures chunks never cross section headers. By-similarity chunking groups topically related elements when structure alone doesn't define clean boundaries.

Addresses: diluted embeddings from mixed-topic chunks, mid-sentence splits, retrieval precision degradation from coarse vector representations. The four chunking approaches are covered in detail in the next section.

Stage 5 — Confidence scoring and review routing

Not all extracted fields carry equal certainty. A production pipeline assigns a composite confidence score to each field — aggregated from token logprobs, schema validation results, and cross-reference consistency checks — and uses thresholds to route the field. High-confidence fields flow to the system of record. Medium-confidence fields go to spot-check review. Low-confidence fields route to full human review with the model's top-k alternatives surfaced.

Addresses: silent extraction failures that look correct but aren't, downstream decisions made on unreliable data, audit-trail requirements for regulated industries. Thresholds are calibrated per field type because the cost asymmetry of a wrong indemnification cap is not the cost asymmetry of a wrong notice address.

A Worked Example: One Clause Through the Pipeline

The abstractions above are easier to read than to apply. Here is what they look like in practice — a single clause from a Master Services Agreement traced through every stage of the pipeline.

Input — raw text from the PDF

The starting point is the rendered page. After OCR or text-layer extraction, the relevant section of the document looks like an undifferentiated text stream. Visual structure — the section number, the indentation, the bold heading — is preserved only in the rendered image, not in the text dump:

page-14.txt (raw extraction)

8.2 Termination for Convenience

(a) Either party may terminate this Agreement for convenience
upon providing the other party with not less than ninety (90)
days' prior written notice. Such notice shall be delivered in
accordance with Section 14.3 (Notices) and shall specify the
effective date of termination.

(b) Upon termination for convenience, Customer shall pay
Provider all fees accrued through the effective date of
termination, including any non-cancellable third-party costs
incurred by Provider prior to receipt of the termination notice.

This is what a text-only pipeline has to work with. The section number "8.2" is just two characters in the stream; nothing tells the downstream system that this is a section heading rather than, say, a paragraph reference. The clause hierarchy — (a) and (b) as sub-clauses of 8.2 — is a typographical convention that has been flattened. The reference to "Section 14.3 (Notices)" is just a string; no link to the actual notices clause exists. This is the representation a naive prompt-driven pipeline tries to extract from.

Stage 1 output — partitioned elements

Partitioning converts the same page into a typed element graph. Each element carries a type, a position, and the structural information that was visible in the rendered page but invisible in the text dump:

partitioned.json

[
  {
    "element_id": "elem_0142",
    "type": "Title",
    "level": 2,
    "text": "Termination for Convenience",
    "section_number": "8.2",
    "page": 14,
    "bbox": [72, 188, 412, 210]
  },
  {
    "element_id": "elem_0143",
    "type": "ListItem",
    "marker": "(a)",
    "parent_section": "8.2",
    "text": "Either party may terminate this Agreement for convenience upon providing the other party with not less than ninety (90) days' prior written notice. Such notice shall be delivered in accordance with Section 14.3 (Notices) and shall specify the effective date of termination.",
    "page": 14,
    "bbox": [72, 220, 540, 312],
    "internal_references": ["§14.3"]
  },
  {
    "element_id": "elem_0144",
    "type": "ListItem",
    "marker": "(b)",
    "parent_section": "8.2",
    "text": "Upon termination for convenience, Customer shall pay Provider all fees accrued through the effective date of termination, including any non-cancellable third-party costs incurred by Provider prior to receipt of the termination notice.",
    "page": 14,
    "bbox": [72, 322, 540, 398]
  }
]

The downstream stages now have access to information that was lost in the raw text. The clause is identified as termination for convenience, not as an arbitrary paragraph. Sub-clauses (a) and (b) are explicitly nested under section 8.2. Internal references have been detected and surfaced as structured fields rather than left as inline strings. Every element carries a bounding box, which is the substrate for provenance later in the pipeline.

Stage 2 output — schema-constrained extraction

With typed elements as input, the VLM extracts against an explicit schema. The prompt declares what fields to extract (notice period, terminating party, fee obligations), their types (duration, enum, monetary), and the element scope to extract them from. The model emits structured output that conforms to the schema by construction:

extraction.json (raw)

{
  "termination_for_convenience": {
    "available_to": "either_party",
    "notice_period": {
      "value": "ninety (90) days",
      "type": "duration"
    },
    "notice_delivery_clause": "Section 14.3",
    "post_termination_obligations": {
      "customer_pays": [
        "accrued fees through effective date",
        "non-cancellable third-party costs"
      ]
    },
    "source_elements": ["elem_0143", "elem_0144"]
  }
}

The model has done what it's good at: reading the clause text and emitting structured fields. What it has not yet done is normalize values, link cross-references, or attach confidence — those are deterministic downstream steps.

Stage 3 output — normalized and validated

Post-processing normalizes the values to canonical types and resolves the internal reference. "Ninety (90) days" becomes an ISO 8601 duration. The "Section 14.3" reference is resolved against the document's section index and bound to the actual notices clause element. Schema validation confirms that every field matches its declared type:

extraction.json (normalized)

{
  "field_path": "termination_for_convenience",
  "values": {
    "available_to": "either_party",
    "notice_period_iso": "P90D",
    "notice_period_surface": "ninety (90) days",
    "notice_delivery": {
      "reference_text": "Section 14.3",
      "resolved_element": "elem_0287",
      "resolved_section": "14.3 Notices"
    },
    "customer_obligations_on_termination": [
      "accrued_fees_through_effective_date",
      "non_cancellable_third_party_costs"
    ]
  },
  "provenance": {
    "document_id": "msa-2024-acme-v3.pdf",
    "primary_clause": "§8.2(a)",
    "page": 14,
    "bbox": [72, 220, 540, 312]
  }
}

Stage 5 output — confidence-scored and routed

The composite confidence score is assembled from three independent signals: the model's token-level logprob over the extracted span, the schema validation result, and a cross-reference consistency check (the notice period mentioned in §8.2 also appears in the termination summary in §1, where it can be cross-validated). The routing decision falls out of threshold comparisons against per-field calibration:

extraction.json (final, with confidence)

{
  "field_path": "termination_for_convenience",
  "values": { /* ...as above... */ },
  "provenance": { /* ...as above... */ },
  "confidence": {
    "extraction_logprob": 0.94,
    "schema_validation": 1.00,
    "cross_reference_consistency": 0.97,
    "composite": 0.95
  },
  "thresholds": {
    "field_class": "termination_provision",
    "high_confidence_min": 0.92,
    "review_required_below": 0.80
  },
  "routing": "system_of_record",
  "alternatives": []
}

The final record carries everything a downstream consumer needs: the normalized values, the original surface forms, full provenance to the source pixels, decomposed confidence signals, and an explicit routing decision. A reviewer who wants to audit this extraction can jump from the composite score directly to the bounding box on page 14 and verify the source text — every step in the pipeline is inspectable, and the probabilistic component is bounded to a single, auditable stage.

This is what "production-ready" means in practice. Not better prompts, not a larger model — but a pipeline where every extracted field carries normalization, provenance, decomposed confidence, and a routing decision, produced through stages that are deterministic except for the bounded model call at stage 2.

Chunking: The Step That Determines Retrieval Quality

The pipeline above produces structured extraction. For applications that also need retrieval — RAG over the document portfolio, semantic search, question-answering against the underlying text — there's a parallel decision about how the extracted content gets chunked for embedding and indexing. Chunking strategy is one of the highest-leverage decisions in a RAG pipeline, and poor chunking will degrade retrieval quality regardless of how capable the downstream model is.

Why chunking matters mechanically

Embedding models compress text into a fixed-dimension vector regardless of input length. A short focused passage and a long mixed-topic passage both produce a single vector of the same dimensionality. The compression is lossy, and it gets lossier as the input gets more semantically diverse. Large chunks that mix multiple topics produce coarse representations where no single topic dominates the vector — and similarity search against such vectors returns coarse, low-precision results.

The retrieved chunks also feed directly into the prompt as context for the generating model. Filling a context window with retrieved chunks creates the needle-in-a-haystack problem: the relevant content is present, but buried in surrounding context the model has to wade through to find it. Smaller, focused chunks both retrieve more precisely and present more cleanly downstream. A starting point of around 250 tokens — roughly 1,000 characters — is a sensible baseline for most enterprise document types.

The four chunking approaches

Character splitting

Baseline

Divides text into fixed N-character segments with optional overlap between consecutive chunks. The simplest possible approach — and the most problematic.

Strengths

Trivial to implement

Works on any text format

Limitations

Ignores document structure entirely

Splits sentences mid-word

Mixes unrelated topics in single chunks

Overlap doesn't fix structural breaks

Best for: Prototyping only — not suitable for production enterprise document pipelines

Recursive / sentence-level chunking

Improved

Splits text using an ordered list of separators — paragraph breaks, newlines, sentences, spaces — applied recursively until chunks reach the target size.

Strengths

Reduces mid-sentence splits

Respects paragraph boundaries

Works across plain text formats

Limitations

Still ignores tables, lists, headers

Requires different separator sets per format

Fails on image-based PDFs

No semantic awareness

Best for: Plain text documents with consistent formatting; not suitable for mixed-format enterprise PDFs

Structure-aware smart chunking

Production

Operates on document elements produced during partitioning — paragraphs, titles, tables, list items — rather than raw text. Chunk boundaries follow the document's logical structure.

Strengths

Preserves semantic boundaries

Handles tables, lists, headers natively

Universal across document types

Supports by-title, by-page, by-similarity strategies

Limitations

Requires a partitioning step before chunking

More complex pipeline setup

Best for: Enterprise document portfolios with mixed formats, complex layouts, and tables

Contextual chunking

Advanced

Prepends a short, chunk-specific explanatory context to each chunk before embeddings are generated — giving each segment additional information about where it sits within the broader document.

Strengths

Significantly improves retrieval accuracy

Each chunk carries its own context

No manual context writing required

Limitations

Higher processing cost per chunk

Requires a capable model for context generation

Best for: High-value document portfolios where retrieval precision is critical — contracts, financial reports, regulatory filings

The key insight from structure-aware chunking is that you are not splitting text — you are splitting a document that has already been understood. The partitioning step does the structural analysis; the chunking step respects it. This is the difference between cutting a document at arbitrary character counts and cutting it at the boundaries the document itself defines.

What This Means for Contract Documents

Contract PDFs are among the most demanding document types for extraction pipelines. They combine dense legal prose with structured data (dates, parties, monetary values), semi-structured content (clause hierarchies, cross-references), and tabular data (pricing schedules, SLA tiers, jurisdiction matrices). They are frequently scanned rather than digitally-native. They vary enormously in formatting across counterparties, time periods, and document types.

The failure modes that matter most in contract extraction are not the ones that produce obviously wrong output. They are the ones that produce plausible-looking output that is structurally wrong — a liability cap attributed to the wrong party, a renewal date extracted from the wrong clause, a governing law field populated with the wrong jurisdiction because the model confused a recital with a substantive provision.

Clause-level partitioning

Contract extraction requires partitioning at the clause level, not just the paragraph level. A liability clause, an indemnification clause, and a force majeure clause may all appear in the same section — and must be distinguished for accurate field extraction.

Cross-reference resolution

Contracts routinely define terms in one clause and use them in another. Effective extraction requires resolving these cross-references — understanding that "the Effective Date" in clause 7.1 refers to the date defined in clause 1.1.

Amendment-aware processing

A contract portfolio is not a static set of base agreements. Amendments, addenda, and side letters modify the terms of base agreements. Extraction pipelines must process the full document set for each agreement and reconcile conflicting provisions.

Confidence-gated field extraction

High-consequence fields — liability caps, payment terms, termination rights, governing law — require higher confidence thresholds before being passed downstream. A 60% confidence extraction of a liability cap is not useful data; it is a liability.

The Bottom Line

Extracting structured intelligence from PDF documents is a solved problem in the sense that the tools and techniques exist. It is not a solved problem in the sense that pointing a capable model at a document with a simple prompt produces production-quality output. The benchmark data is clear on this: the gap between a raw model call and a purpose-built pipeline is real, it is consistent across model families, and it shows up most consequentially in exactly the places that matter most for enterprise use — tables, document structure, and the balance between recall and hallucination.

The pipeline that closes that gap has five layers: partitioning, schema-constrained extraction, post-processing and normalization, structure-aware chunking, and confidence-gated review routing. Each layer operates on a more refined representation of the document than the last, and each addresses a specific failure mode. The probabilistic component is bounded to a single stage; everything around it is deterministic and inspectable. The output is not just extracted fields but extracted fields with normalization, provenance, confidence, and routing decisions attached — which is the difference between data that can be used and data that can't.

The organizations that treat document extraction as an infrastructure investment — not a one-time prompt engineering exercise — are the ones that end up with contract data they can actually use. The difference is not in the model. It is in the pipeline built around it.

Document ParsingRAGChunkingPDF ExtractionContract IntelligenceVision-Language ModelsConfidence Scoring

Related Insights

Document ExtractionApr 2026

Beyond OCR: How Multi-Modal AI Is Redefining Contract Data Extraction

Read article

Document ProcessingDec 2025

Why Confidence Scoring Is the Most Underrated Feature in Contract AI

Read article

Contract IntelligenceFeb 2026

The Hidden Cost of Unstructured Contract Data

Read article