Back to Insights
Document Processing May 2026 12 min read Deepak Patil

From Documents to Insight: How to Extract Structured Data (Word, PDF, Spreadsheets, etc.)

A technical walkthrough of the full extraction pipeline — partitioning, schema-constrained extraction, structure-aware chunking, and confidence-gated routing — with a worked example showing how a single contract clause flows from raw document bytes to a structured field.

Enterprise knowledge lives in PDFs. Contracts, financial reports, technical manuals, regulatory filings, invoices — the documents that drive decisions are almost universally stored in formats that resist programmatic access. The text is there. The structure is there. The intelligence is there. None of it is queryable, comparable, or monitorable until it has been extracted into structured form.

The extraction problem sounds straightforward: read the document, pull out the relevant data, store it somewhere useful. In practice it is one of the harder engineering problems in enterprise AI — and the gap between a naive implementation and a production-ready pipeline is wider than most teams expect when they start.

This article walks through that pipeline end to end: why pointing a capable model at a PDF and trusting it to produce structured output doesn't work, what each stage of a production pipeline actually does, and — most concretely — what a single contract clause looks like as it flows through the pipeline from PDF bytes to a normalized, provenance-anchored, confidence-scored field.

The goal is not to read documents. The goal is to turn documents into structured data that can be queried, compared, monitored, and acted on at portfolio scale. Those are different problems, and the second one is significantly harder.

The Document Parsing Problem

Modern vision-language models are genuinely capable. They handle large context windows. They produce reasonable structured output. They read text inside table cells correctly most of the time. So when teams start building document extraction pipelines, the natural instinct is to point a capable model at a stack of PDFs with a well-written prompt and see how far that gets them.

The answer, consistently, is: not far enough for production. The Unstructured team's SCORE-Bench evaluation (December 2025) is the clearest public articulation of why. SCORE-Bench is a benchmark of real-world enterprise documents — financial reports with nested tables, scanned forms, technical documentation with multi-column layouts — annotated by domain experts and designed specifically to expose the gap between systems that handle clean academic PDFs and systems that handle production document portfolios.

The hallucination–recall tradeoff

The central tension SCORE-Bench surfaces is what they call the hallucination vs. recall tradeoff: the balance between finding all the text a document contains (Percent Tokens Found) and inventing text that isn't there (Percent Tokens Added). These are independent failure modes. A system can have excellent recall by aggressively extracting every plausible token, at the cost of hallucinating spurious ones. A system can have excellent hallucination control by extracting conservatively, at the cost of missing real content.

The benchmark results show that pipelines built around capable models can land at very different points on this curve depending on how they're constructed. The lowest hallucination rate across all evaluated pipelines (Percent Tokens Added of 0.036) belongs to a purpose-built pipeline using a frontier model as its VLM backend. The second-highest recall (Percent Tokens Found of 0.924) is achieved by the same family of pipelines, simultaneously. Other approaches reach higher recall but carry hallucination rates three to four times higher — a tradeoff that is unacceptable for downstream applications where false content is more damaging than missing content.

The takeaway isn't that one model is better than another. It's that the operating point on the recall/hallucination curve is determined by the pipeline architecture wrapped around the model — not by the model alone.

Where the gap is widest: tables and structure

Table extraction is where the gap between raw model calls and purpose-built pipelines is most consequential. SCORE-Bench separates this into two metrics: Cell Content Accuracy (reading the text inside the cell correctly) and Cell Level Index Accuracy (understanding which row and column that text belongs to). The distinction matters because the failure mode is subtle: a model can read a cell's text correctly while placing it under the wrong column header. The number is right; its meaning is wrong.

For financial data, pricing tables, or specification sheets, this is a data integrity failure that is hard to detect downstream. A liability cap of $5,000,000 bound to the wrong jurisdiction column, a payment term attributed to the wrong service line, a renewal date tied to the wrong contract instance — all of these look like clean extractions until the downstream system acts on them.

Failure ModeRoot CauseDownstream ImpactSeverity
Hallucinated contentModel invents tokens not in sourceCorrupted vector store; confident wrong answershigh
Missing coverageModel skips sections or truncatesRetrieval gaps; incomplete answershigh
Cell-index misalignmentCell content read correctly but placed in wrong row/columnNumerically correct values bound to wrong headershigh
Element sequencing errorsHeaders, paragraphs, figures emitted in wrong orderBroken context in downstream chunksmedium
Inconsistent output formatModel deviates from schema between documentsPipeline failures; manual post-processing requiredmedium
Mid-sentence chunk splitsCharacter-based chunking ignores structureDiluted embeddings; poor retrieval precisionmedium

Hallucinated content doesn't announce itself. It looks like extracted text. It flows into your vector store, gets retrieved, and feeds into your LLM as if it were real. The downstream effect is answers that are confidently wrong in ways that are hard to catch after the fact.

The Architecture of a Production Pipeline

Closing the gap between a raw model call and production-grade extraction isn't a question of finding a better model. It's a question of building the right scaffolding around the model. A production pipeline has five distinct layers, each addressing a specific failure mode from the benchmark results. The architecture is sequential at the document level but operates on different representations of the document at each stage.

Pipeline architecture
Production document extraction pipeline architectureA five-layer extraction pipeline showing raw PDFs flowing through partitioning, prompted extraction, post-processing, structure-aware chunking, and confidence-gated routing, with the final output split between automated system-of-record ingestion and human review.InputRaw PDFsPipeline1. PartitioningPages → typed elements2. Prompted extractionSchema-constrained VLM3. Post-processingNormalize, validate4. Structure-aware chunkingBy title, section, similarity5. Confidence scoring & routingPer-field composite scoreOutputSystem of recordHigh-confidence fieldsHuman review queueLow-confidence fieldsPipeline stageDecision / reviewInput / external system

Each layer takes a more refined representation of the document than the last. Partitioning operates on rendered pages. Prompted extraction operates on typed elements. Post-processing operates on the model's structured output. Chunking operates on extracted, normalized content. Confidence scoring operates on the fully extracted field set. The probabilistic component — the model call — is bounded to stage 2, and everything around it is deterministic, inspectable, and testable.

Stage 1 — Document partitioning

Partitioning converts a raw PDF into a structured set of typed document elements: paragraphs, titles, tables, list items, figures, headers, footers. Each element carries a type label, a bounding box, a page reference, and a position in the document's logical order. This is the step that creates a structured representation for everything downstream to operate on.

Addresses: element sequencing errors, table misalignment, structural context loss. A model working on partitioned elements knows what type of content it is processing and where it sits in the document hierarchy — information that has been irretrievably lost by the time a text-only pipeline reaches the extraction step.

Stage 2 — Schema-constrained extraction

The VLM is prompted with an extraction schema — a JSON schema declaring the fields to extract, their types, and their cardinality. Constrained decoding (grammar-based or logit-mask techniques) guarantees structural validity at generation time. The values themselves are still subject to the model's epistemic uncertainty, which is what confidence scoring exists to quantify.

Addresses: the recall/hallucination tradeoff, table column misattribution, section boundary confusion. Schema constraints close the format inconsistency gap; structured prompting on typed elements closes the element alignment gap.

Stage 3 — Post-processing and normalization

Even well-prompted models produce edge cases. Post-processing normalizes output to canonical forms — "ninety (90) days" and "90 days" and "P90D" are unified to the same ISO 8601 duration. Schema validation runs against expected types and known value sets (jurisdictions, currencies, party roles). Cross-reference consistency checks compare fields extracted from multiple document locations.

Addresses: silent format drift across document types, downstream pipeline failures from malformed values, undetected disagreement between redundantly-extracted fields.

Stage 4 — Structure-aware chunking

With partitioned, extracted content, chunking can respect the document's logical structure rather than imposing arbitrary character boundaries. Chunk boundaries follow section breaks, clause boundaries, and table edges. By-title chunking ensures chunks never cross section headers. By-similarity chunking groups topically related elements when structure alone doesn't define clean boundaries.

Addresses: diluted embeddings from mixed-topic chunks, mid-sentence splits, retrieval precision degradation from coarse vector representations. The four chunking approaches are covered in detail in the next section.

Stage 5 — Confidence scoring and review routing

Not all extracted fields carry equal certainty. A production pipeline assigns a composite confidence score to each field — aggregated from token logprobs, schema validation results, and cross-reference consistency checks — and uses thresholds to route the field. High-confidence fields flow to the system of record. Medium-confidence fields go to spot-check review. Low-confidence fields route to full human review with the model's top-k alternatives surfaced.

Addresses: silent extraction failures that look correct but aren't, downstream decisions made on unreliable data, audit-trail requirements for regulated industries. Thresholds are calibrated per field type because the cost asymmetry of a wrong indemnification cap is not the cost asymmetry of a wrong notice address.

A Worked Example: One Clause Through the Pipeline

The abstractions above are easier to read than to apply. Here is what they look like in practice — a single clause from a Master Services Agreement traced through every stage of the pipeline.

Input — raw text from the PDF

The starting point is the rendered page. After OCR or text-layer extraction, the relevant section of the document looks like an undifferentiated text stream. Visual structure — the section number, the indentation, the bold heading — is preserved only in the rendered image, not in the text dump:

page-14.txt (raw extraction)
8.2 Termination for Convenience

(a) Either party may terminate this Agreement for convenience
upon providing the other party with not less than ninety (90)
days' prior written notice. Such notice shall be delivered in
accordance with Section 14.3 (Notices) and shall specify the
effective date of termination.

(b) Upon termination for convenience, Customer shall pay
Provider all fees accrued through the effective date of
termination, including any non-cancellable third-party costs
incurred by Provider prior to receipt of the termination notice.

This is what a text-only pipeline has to work with. The section number "8.2" is just two characters in the stream; nothing tells the downstream system that this is a section heading rather than, say, a paragraph reference. The clause hierarchy — (a) and (b) as sub-clauses of 8.2 — is a typographical convention that has been flattened. The reference to "Section 14.3 (Notices)" is just a string; no link to the actual notices clause exists. This is the representation a naive prompt-driven pipeline tries to extract from.

Stage 1 output — partitioned elements

Partitioning converts the same page into a typed element graph. Each element carries a type, a position, and the structural information that was visible in the rendered page but invisible in the text dump:

partitioned.json
[
  {
    "element_id": "elem_0142",
    "type": "Title",
    "level": 2,
    "text": "Termination for Convenience",
    "section_number": "8.2",
    "page": 14,
    "bbox": [72, 188, 412, 210]
  },
  {
    "element_id": "elem_0143",
    "type": "ListItem",
    "marker": "(a)",
    "parent_section": "8.2",
    "text": "Either party may terminate this Agreement for convenience upon providing the other party with not less than ninety (90) days' prior written notice. Such notice shall be delivered in accordance with Section 14.3 (Notices) and shall specify the effective date of termination.",
    "page": 14,
    "bbox": [72, 220, 540, 312],
    "internal_references": ["§14.3"]
  },
  {
    "element_id": "elem_0144",
    "type": "ListItem",
    "marker": "(b)",
    "parent_section": "8.2",
    "text": "Upon termination for convenience, Customer shall pay Provider all fees accrued through the effective date of termination, including any non-cancellable third-party costs incurred by Provider prior to receipt of the termination notice.",
    "page": 14,
    "bbox": [72, 322, 540, 398]
  }
]

The downstream stages now have access to information that was lost in the raw text. The clause is identified as termination for convenience, not as an arbitrary paragraph. Sub-clauses (a) and (b) are explicitly nested under section 8.2. Internal references have been detected and surfaced as structured fields rather than left as inline strings. Every element carries a bounding box, which is the substrate for provenance later in the pipeline.

Stage 2 output — schema-constrained extraction

With typed elements as input, the VLM extracts against an explicit schema. The prompt declares what fields to extract (notice period, terminating party, fee obligations), their types (duration, enum, monetary), and the element scope to extract them from. The model emits structured output that conforms to the schema by construction:

extraction.json (raw)
{
  "termination_for_convenience": {
    "available_to": "either_party",
    "notice_period": {
      "value": "ninety (90) days",
      "type": "duration"
    },
    "notice_delivery_clause": "Section 14.3",
    "post_termination_obligations": {
      "customer_pays": [
        "accrued fees through effective date",
        "non-cancellable third-party costs"
      ]
    },
    "source_elements": ["elem_0143", "elem_0144"]
  }
}

The model has done what it's good at: reading the clause text and emitting structured fields. What it has not yet done is normalize values, link cross-references, or attach confidence — those are deterministic downstream steps.

Stage 3 output — normalized and validated

Post-processing normalizes the values to canonical types and resolves the internal reference. "Ninety (90) days" becomes an ISO 8601 duration. The "Section 14.3" reference is resolved against the document's section index and bound to the actual notices clause element. Schema validation confirms that every field matches its declared type:

extraction.json (normalized)
{
  "field_path": "termination_for_convenience",
  "values": {
    "available_to": "either_party",
    "notice_period_iso": "P90D",
    "notice_period_surface": "ninety (90) days",
    "notice_delivery": {
      "reference_text": "Section 14.3",
      "resolved_element": "elem_0287",
      "resolved_section": "14.3 Notices"
    },
    "customer_obligations_on_termination": [
      "accrued_fees_through_effective_date",
      "non_cancellable_third_party_costs"
    ]
  },
  "provenance": {
    "document_id": "msa-2024-acme-v3.pdf",
    "primary_clause": "§8.2(a)",
    "page": 14,
    "bbox": [72, 220, 540, 312]
  }
}

Stage 5 output — confidence-scored and routed

The composite confidence score is assembled from three independent signals: the model's token-level logprob over the extracted span, the schema validation result, and a cross-reference consistency check (the notice period mentioned in §8.2 also appears in the termination summary in §1, where it can be cross-validated). The routing decision falls out of threshold comparisons against per-field calibration:

extraction.json (final, with confidence)
{
  "field_path": "termination_for_convenience",
  "values": { /* ...as above... */ },
  "provenance": { /* ...as above... */ },
  "confidence": {
    "extraction_logprob": 0.94,
    "schema_validation": 1.00,
    "cross_reference_consistency": 0.97,
    "composite": 0.95
  },
  "thresholds": {
    "field_class": "termination_provision",
    "high_confidence_min": 0.92,
    "review_required_below": 0.80
  },
  "routing": "system_of_record",
  "alternatives": []
}

The final record carries everything a downstream consumer needs: the normalized values, the original surface forms, full provenance to the source pixels, decomposed confidence signals, and an explicit routing decision. A reviewer who wants to audit this extraction can jump from the composite score directly to the bounding box on page 14 and verify the source text — every step in the pipeline is inspectable, and the probabilistic component is bounded to a single, auditable stage.

This is what "production-ready" means in practice. Not better prompts, not a larger model — but a pipeline where every extracted field carries normalization, provenance, decomposed confidence, and a routing decision, produced through stages that are deterministic except for the bounded model call at stage 2.

Chunking: The Step That Determines Retrieval Quality

The pipeline above produces structured extraction. For applications that also need retrieval — RAG over the document portfolio, semantic search, question-answering against the underlying text — there's a parallel decision about how the extracted content gets chunked for embedding and indexing. Chunking strategy is one of the highest-leverage decisions in a RAG pipeline, and poor chunking will degrade retrieval quality regardless of how capable the downstream model is.

Why chunking matters mechanically

Embedding models compress text into a fixed-dimension vector regardless of input length. A short focused passage and a long mixed-topic passage both produce a single vector of the same dimensionality. The compression is lossy, and it gets lossier as the input gets more semantically diverse. Large chunks that mix multiple topics produce coarse representations where no single topic dominates the vector — and similarity search against such vectors returns coarse, low-precision results.

The retrieved chunks also feed directly into the prompt as context for the generating model. Filling a context window with retrieved chunks creates the needle-in-a-haystack problem: the relevant content is present, but buried in surrounding context the model has to wade through to find it. Smaller, focused chunks both retrieve more precisely and present more cleanly downstream. A starting point of around 250 tokens — roughly 1,000 characters — is a sensible baseline for most enterprise document types.

The four chunking approaches

Character splitting
Baseline

Divides text into fixed N-character segments with optional overlap between consecutive chunks. The simplest possible approach — and the most problematic.

Strengths
Trivial to implement
Works on any text format
Limitations
Ignores document structure entirely
Splits sentences mid-word
Mixes unrelated topics in single chunks
Overlap doesn't fix structural breaks
Best for: Prototyping only — not suitable for production enterprise document pipelines
Recursive / sentence-level chunking
Improved

Splits text using an ordered list of separators — paragraph breaks, newlines, sentences, spaces — applied recursively until chunks reach the target size.

Strengths
Reduces mid-sentence splits
Respects paragraph boundaries
Works across plain text formats
Limitations
Still ignores tables, lists, headers
Requires different separator sets per format
Fails on image-based PDFs
No semantic awareness
Best for: Plain text documents with consistent formatting; not suitable for mixed-format enterprise PDFs
Structure-aware smart chunking
Production

Operates on document elements produced during partitioning — paragraphs, titles, tables, list items — rather than raw text. Chunk boundaries follow the document's logical structure.

Strengths
Preserves semantic boundaries
Handles tables, lists, headers natively
Universal across document types
Supports by-title, by-page, by-similarity strategies
Limitations
Requires a partitioning step before chunking
More complex pipeline setup
Best for: Enterprise document portfolios with mixed formats, complex layouts, and tables
Contextual chunking
Advanced

Prepends a short, chunk-specific explanatory context to each chunk before embeddings are generated — giving each segment additional information about where it sits within the broader document.

Strengths
Significantly improves retrieval accuracy
Each chunk carries its own context
No manual context writing required
Limitations
Higher processing cost per chunk
Requires a capable model for context generation
Best for: High-value document portfolios where retrieval precision is critical — contracts, financial reports, regulatory filings

The key insight from structure-aware chunking is that you are not splitting text — you are splitting a document that has already been understood. The partitioning step does the structural analysis; the chunking step respects it. This is the difference between cutting a document at arbitrary character counts and cutting it at the boundaries the document itself defines.

What This Means for Contract Documents

Contract PDFs are among the most demanding document types for extraction pipelines. They combine dense legal prose with structured data (dates, parties, monetary values), semi-structured content (clause hierarchies, cross-references), and tabular data (pricing schedules, SLA tiers, jurisdiction matrices). They are frequently scanned rather than digitally-native. They vary enormously in formatting across counterparties, time periods, and document types.

The failure modes that matter most in contract extraction are not the ones that produce obviously wrong output. They are the ones that produce plausible-looking output that is structurally wrong — a liability cap attributed to the wrong party, a renewal date extracted from the wrong clause, a governing law field populated with the wrong jurisdiction because the model confused a recital with a substantive provision.

Clause-level partitioning
Contract extraction requires partitioning at the clause level, not just the paragraph level. A liability clause, an indemnification clause, and a force majeure clause may all appear in the same section — and must be distinguished for accurate field extraction.
Cross-reference resolution
Contracts routinely define terms in one clause and use them in another. Effective extraction requires resolving these cross-references — understanding that "the Effective Date" in clause 7.1 refers to the date defined in clause 1.1.
Amendment-aware processing
A contract portfolio is not a static set of base agreements. Amendments, addenda, and side letters modify the terms of base agreements. Extraction pipelines must process the full document set for each agreement and reconcile conflicting provisions.
Confidence-gated field extraction
High-consequence fields — liability caps, payment terms, termination rights, governing law — require higher confidence thresholds before being passed downstream. A 60% confidence extraction of a liability cap is not useful data; it is a liability.

The Bottom Line

Extracting structured intelligence from PDF documents is a solved problem in the sense that the tools and techniques exist. It is not a solved problem in the sense that pointing a capable model at a document with a simple prompt produces production-quality output. The benchmark data is clear on this: the gap between a raw model call and a purpose-built pipeline is real, it is consistent across model families, and it shows up most consequentially in exactly the places that matter most for enterprise use — tables, document structure, and the balance between recall and hallucination.

The pipeline that closes that gap has five layers: partitioning, schema-constrained extraction, post-processing and normalization, structure-aware chunking, and confidence-gated review routing. Each layer operates on a more refined representation of the document than the last, and each addresses a specific failure mode. The probabilistic component is bounded to a single stage; everything around it is deterministic and inspectable. The output is not just extracted fields but extracted fields with normalization, provenance, confidence, and routing decisions attached — which is the difference between data that can be used and data that can't.

The organizations that treat document extraction as an infrastructure investment — not a one-time prompt engineering exercise — are the ones that end up with contract data they can actually use. The difference is not in the model. It is in the pipeline built around it.

Document ParsingRAGChunkingPDF ExtractionContract IntelligenceVision-Language ModelsConfidence Scoring