← Back to Insights
Contract Intelligence May 2026 18 min read

Clause Lineage & Linkage:
Resolving Relationships and Provenance in Documents

How do you track commitments across MSAs, statements of work, and amending agreements—and prove, at audit time, exactly which pixel in which PDF produced each obligation? This article walks the full technical stack: extraction grounding, DAG construction, edge-type semantics, clause resolution traversal, storage schema, and OpenLineage event emission.

When enterprises negotiate contracts, they rarely sign a single isolated file. A commercial relationship evolves into a network: a Master Services Agreement anchors three subsequent amendments, five Statements of Work, and several Service Level Agreements and data processing addenda. Each document can override, inherit, or extend terms from its parents.

Traditional CLM systems treat this portfolio as flat metadata records. That works until a compliance team asks: "What is the effective limitation of liability for Acme Corp right now, accounting for all amendments?" Answering that correctly requires modeling the portfolio as a Directed Acyclic Graph (DAG) and implementing deterministic data lineage at clause resolution time.

The Mesh Viewpoint: Contracts are not text documents—they are structured data programs. The relationships between documents and clauses form a compiler's dependency graph. Lineage tracking enforces the integrity of that program's execution. Every extracted obligation must resolve to a single, verifiable byte offset in a source file.

01 / Visual Grounding & Schema-Constrained Extraction

Before linkage can happen, structured data must be extracted from unstructured sources—PDF, DOCX, scanned images. Flat text parsers discard layout, table borders, and margin annotations, destroying semantic signals that determine clause scope and hierarchy.

Extraction Pipeline

INGEST PDF / DOCX sha256 hash RENDER Page rasterize 2D coord index VLM EXTRACT Text + visual patch encoding CONSTRAIN Logit grammar JSON schema CLAUSE NODE text + bbox + sha256 ⑤ output Every ClauseNode carries: clauseId · pageNumber · boundingBox [x0,y0,x1,y1] · sha256 · confidence
LAYOUT
Spatial Layout Indexing
2D coordinates are preserved alongside token text. Tables, headings, and margin callouts are indexed by page-coordinate regions. The coordinate tuple becomes the permanent address for any fact extracted from that region.
GRAMMAR
Schema-Constrained Decoding
LLM extraction is forced to conform to a JSON schema during decoding via logit-level grammar masks (e.g., clause_schema_v4.gbnf). This eliminates hallucinated keys, malformed types, and invalid output before it enters the pipeline.

02 / Building the Lineage Graph: Nodes, Edges, and Types

With clause nodes extracted and anchored, the pipeline constructs the lineage graph. This is a DAG where nodes are ClauseNodes or DocumentNodes, and typed edges encode the legal relationship between them.

Inter-Document DAG

// Inter-Document Linkage — Document-Level DAG
[ doc_msa_acme_v1 ] ← root instrument │ ├── CHILD_OF ──▶ [ doc_sow_acme_001 ]inherits: liability, IP ownership, dispute resolution │ ├── CHILD_OF ──▶ [ doc_sow_acme_002 ] │ │ │ └── AMENDS ──▶ [ doc_amd_sow002_delivery ]overrides: delivery date clause § 3.1 │ └── SUPERSEDED_BY ──▶ [ doc_msa_acme_v2 ]mutates: liability cap, auto-renewal clause │ └── SUPERSEDED_BY ──▶ [ doc_msa_acme_v3 ] ← effective root

Edge Types and Their Resolution Behavior

The lineage graph is only as precise as its edge vocabulary. Each edge type carries distinct traversal semantics—the resolver's behavior at query time depends entirely on which edge type it is walking.

CHILD_OF SOW #1 child doc CHILD_OF MSA v3 parent doc Resolution: walk to parent if clause absent in child. SOW inherits terms without overriding them. AMENDS Amd §4.2 amendment clause AMENDS MSA §4.2 original clause Resolution: amendment clause is the effective text. Original is archived; never returned as active law. SUPERSEDED_BY MSA v1 old version SUPERSEDED_BY MSA v3 new version Resolution: traverse to newest non-superseded node. Old version becomes read-only historical record. TERMINATES Term. Notice termination clause TERMINATES Obligation target clause Resolution: node marked INACTIVE. Excluded from all active obligation queries.

How Edges Are Built: Reference Resolution Pass

EXPLICIT
Citation Pattern Matching
Regex + NER detects phrases like "pursuant to the MSA dated…" or "Article 4.2 is hereby amended to read…". These produce high-confidence edges (p > 0.95) with the cited document and section as the edge target.
SEMANTIC
Embedding Similarity
For clauses without explicit citations, dense vectors (e.g., text-embedding-3-large) are generated per clause type. Cosine similarity above threshold triggers a candidate AMENDS edge, confirmed by LLM classification.

03 / Effective-Clause Resolution: Graph Traversal

With the graph constructed, the core query is: "What is the current effective text of clause X, as of today?" This requires a deterministic traversal that respects edge types and temporal ordering.

Resolution Algorithm Flow

QUERY clause_id + as_of graph.get_node(clause_id) fetch ClauseNode from store status == TERMINATED? YES ObligationTerminated Error raised NO get_inbound_edges(AMENDS | SUPERSEDED_BY) filter: effective_from ≤ as_of · sort: desc Inbound edges found? NO RETURN this node YES DFS recurse into source of newest edge →

The returned node carries a resolution_path—an ordered list of every edge traversed to reach the effective clause. This path is the audit trail inspectable by compliance teams to explain exactly why a given clause version was returned.

Inheritance Walk for Missing Clauses

When a SOW has no local version of a clause type (e.g., no indemnification clause), the resolver walks CHILD_OF edges upward to the parent MSA. This is prototype-chain resolution: the closest ancestor that has the attribute wins.

resolve_inherited_clause.pyPython
def resolve_with_inheritance(doc_id: str, clause_type: str, as_of: datetime):
    local = doc.get_clause_by_type(clause_type)
    if local:
        return resolve_effective_clause(local.clause_id, as_of)

    # Walk CHILD_OF edges breadth-first toward root
    for parent_doc in graph.get_parents(doc_id, edge_type="CHILD_OF"):
        result = resolve_with_inheritance(parent_doc.doc_id, clause_type, as_of)
        if result:
            result.inherited_from = parent_doc.doc_id   # flag as inherited
            return result
    return None

04 / Obligation Extraction and Risk Classification

With structural lineage established, semantic intelligence is layered on top: classifying what each clause obligates, and how that obligation deviates from the company's playbook.

Obligations are classified across four domains. Each obligation record links by clauseId to its parent ClauseNode, preserving the full chain back to the source pixel.

FINANCIAL
Financial Obligations
Payment deadlines, invoice triggers, milestone disbursement, tax responsibilities, late-payment penalties. Numeric fields are extracted as typed values, not strings.
OPERATIONAL
Operational Obligations
Delivery dates, SLA thresholds (uptime, MTTR), reporting cycles. ISO 8601 dates are normalized; relative deadlines are flagged as unresolvable until a trigger event is defined.
REGULATORY
Regulatory & Compliance
GDPR/DPA clauses, data residency, export controls, SOC 2 audit rights, breach notification timelines. Linked to specific regulation identifiers from a compliance taxonomy.
RESTRICTIVE
Restrictive Covenants
Non-competes, non-solicitation, exclusivity zones, IP assignment clauses. Geographic and temporal scope parameters are extracted as structured fields.

Risk Classification: Playbook Deviation Scoring

Low Risk
Standard Deviation
Minor wording variance; standard governing law (Delaware); Net 30 payment terms; liability cap at 1× annual contract value.
Medium Risk
Playbook Out-of-Bounds
Unfavorable governing law; Net 60+ payment terms; liability cap at 1.5× trailing fees; GDPR DPA absent but data transfer implied.
High Risk
Critical Violations
Uncapped IP indemnification; missing Limitation of Liability; unilateral termination with zero notice period.

05 / Clause Lineage Tracking with OpenLineage

The graph and resolution algorithm answer what the effective clause is. OpenLineage answers how we got there—recording every transformation step as an immutable, inspectable audit stream. Each pipeline stage emits a typed event with custom facets carrying provenance metadata.

Pipeline Stage → OpenLineage Event Map

① INGEST START event Input: PDF path Facet: sha256 + file metadata ② LAYOUT COMPLETE event Facet: page count table regions bbox index size ③ EXTRACT COMPLETE event clauseProvenance facet per clause page + bbox + conf ← pixel anchor ④ RESOLVE COMPLETE event Facet: resolution path array edge chain ⑤ RISK SCORE COMPLETE event Facet: risk vector playbook version score components —————————————————— pipeline execution timeline —————————————————— All events posted async to OpenLineage backend (Marquez) · runId ties events to a single pipeline execution
openlineage_event.jsonOpenLineage Spec 2.0
{
  "eventTime": "2026-05-25T14:55:00Z",
  "eventType": "COMPLETE",
  "run": {
    "runId": "4892c90c-60e1-4569-b570-98df872a08cc",
    "facets": {
      "modelContext": { "model": "gemini-1.5-pro", "constraintGrammar": "clause_schema_v4.gbnf" },
      "lineageResolutionPath": {
        "path": [
          { "clauseId": "cl_4.2_amd1",              "edgeType": "AMENDS" },
          { "clauseId": "cl_4.2_limitation_of_liability", "edgeType": "ROOT"   }
        ]
      }
    }
  },
  "inputs": [{ "namespace": "contract-lake", "name": "MSA_Acme_Corp_v3.pdf",
    "facets": { "documentMetadata": { "sha256": "8f309a96e5dbe4889c20a9a1488e4d0c" } }
  }],
  "outputs": [{ "namespace": "obligation-registry", "name": "obligation_nodes",
    "facets": { "clauseProvenance": {
      "clauseId":    "cl_4.2_amd1",
      "pageNumber":  14,
      "boundingBox": [120, 450, 480, 590],
      "confidence":  0.98
    }}
  }]
}

06 / Closing the Loop: Dashboard Click to Source Pixel

The full value of the architecture is realized at query time. When a compliance analyst clicks an obligation and asks "Show me the source," the following sequence executes end-to-end:

Dashboard query: obligation_id = "limitation-of-liability / Acme Corp" Obligation Registry lookup → resolves to clauseId: cl_4.2_limitation_of_liability Graph traversal: resolve_effective_clause(clause_id, as_of=today) DFS walks AMENDS edge → effective node: cl_4.2_amd1 · resolution_path recorded Fetch OpenLineage event: runId from clause_node.extracted_at Read clauseProvenance facet → source_file, sha256, page=14, bbox=[120,450,480,590] Integrity check: recompute sha256 of stored PDF Must match stored hash → mismatch raises ProvenanceIntegrityException · pipeline halted Render highlighted PDF: fetch page 14 raster Draw highlight rect at [120,450,480,590] · return image + text + resolution_path to dashboard Every claim anchored to an immutable, hash-verified byte range in a source file. No trust in AI output required — the PDF pixels are the ground truth.

07 / How LLMs Actually Handle Lineage—and Where They Break

The prior sections describe a deterministic pipeline. This section addresses a direct question: what does it look like when an LLM actually performs the reference detection and edge classification steps? We walk through the real prompt structures, the tool-calling payloads, the embedding pass, and the exact failure modes that make a deterministic graph layer non-negotiable beneath them.

Why a Single "Parse This" Prompt Fails

The instinct is to write one prompt that does everything: detect the cross-reference, extract the target, and infer the relationship type. That structure is wrong for three concrete reasons. First, it asks the model to simultaneously perform extraction (find a span) and reasoning (judge legal intent) in a single pass—two tasks with different reliability profiles. Second, the relation_hint field in a single-prompt design asks the model to classify the relationship before it has seen the target clause, which means it is guessing from one side of the relationship. Third, when that combined output flows toward a graph write, a wrong relation inference arrives in the same JSON object as a high-confidence span extraction, and there is no mechanism to decompose the error.

The correct structure is three separate prompts with a deterministic resolution step in between. Each prompt has a single falsifiable output that can be tested independently.

PROMPT A Citation Span Extraction verbatim spans only no relation inference PORTFOLIO LOOKUP Deterministic resolve span → clauseId doc registry match no LLM involved PROMPT B Pairwise Edge Classification both clause texts in context PROMPT C Obligation Extraction separate task separate call

Prompt A — Citation Span Extraction Only

This prompt has one job: find verbatim phrases in the clause that point to an external document or section. It does not infer what the relationship means. The negative-space instructions are as important as the positive ones—without them, the model flags definition references, governing law citations, and self-references as lineage candidates.

prompt_a_citation_span.txtSystem + User Prompt
## SYSTEM
Extract cross-document citation spans from the clause below.

A citation span is a verbatim phrase that explicitly names a different contract
document or a numbered section within a different document that this clause
depends on or references.

Do NOT extract:
  - References to defined terms within the same agreement
    (e.g. "as defined in Section 1.2 of this Agreement")
  - References to applicable law or regulation
    (e.g. "pursuant to GDPR Article 17", "under applicable export control laws")
  - Self-references to other sections within the same document
  - Generic incorporation phrases without a named external document
    (e.g. "incorporated herein by reference" with no document named)

Rules:
  - Return the span character-exact — do not paraphrase
  - If a document name AND a section are both present, include both in the span
  - If no qualifying citation exists, return {"citations": []}
  - Return ONLY valid JSON. No explanation, no commentary.

Output schema:
{
  "citations": [
    {
      "span":           string,        // verbatim, character-exact substring
      "doc_name_hint":  string | null, // e.g. "Master Services Agreement"
      "doc_date_hint":  string | null, // e.g. "January 15, 2026" if present
      "section_hint":   string | null  // e.g. "Article 4.2" if present
    }
  ]
}

## USER
Clause (source document: Amendment_1_to_MSA.pdf · section: § 4.2 · executed: 2026-03-01):
"""
Article 4.2 of this Amendment hereby supersedes and replaces Article 4.2
of the Master Services Agreement dated January 15, 2026 in its entirety.
The liability cap shall not exceed two times (2x) the fees paid in the
twelve (12) months preceding the claim.
"""
prompt_a_response.jsonLLM Output
{
  "citations": [
    {
      "span":          "Article 4.2 of the Master Services Agreement dated January 15, 2026",
      "doc_name_hint": "Master Services Agreement",
      "doc_date_hint": "January 15, 2026",
      "section_hint":  "Article 4.2"
    }
  ]
}

The span field is validated mechanically: the pipeline confirms it is a literal substring of the source clause text before proceeding. If it is not, the output is rejected as hallucinated—the model invented text that does not exist in the clause. This check costs one string operation and catches a real class of model errors.

Deterministic Portfolio Resolution (No LLM)

Prompt A's output feeds a deterministic lookup, not another LLM call. The pipeline matches doc_name_hint and doc_date_hint against the document registry to resolve a specific doc_id, then queries the extracted clause index for a clause matching section_hint within that document. This step produces the target_clause_id needed for the graph edge—without any model inference involved.

portfolio_resolution.pyPython
def resolve_citation_to_clause(citation: dict, doc_registry, clause_store) -> str | None:
    """
    Resolve a Prompt A citation output to a concrete clauseId.
    Entirely deterministic — no LLM involved.
    """
    # Step 1: validate span exists verbatim in source text (hallucination check)
    if citation["span"] not in source_clause.effective_text:
        log.warn(f"Span not found in source text — likely hallucinated: {citation['span']}")
        return None

    # Step 2: match doc_name_hint + doc_date_hint against registry
    candidate_docs = doc_registry.fuzzy_match(
        name=citation["doc_name_hint"],
        date=citation["doc_date_hint"]   # date disambiguates MSA v1 vs v3
    )
    if not candidate_docs:
        return None  # document not in portfolio — flag for review

    target_doc = candidate_docs[0]  # highest-confidence match

    # Step 3: resolve section_hint to a clauseId within the target document
    if citation["section_hint"]:
        clause = clause_store.get_by_section(target_doc.doc_id, citation["section_hint"])
        return clause.clause_id if clause else None

    # No section hint — return doc-level node for inheritance edge
    return target_doc.root_clause_id

Prompt B — Pairwise Edge Classification

Only after portfolio resolution produces a concrete target_clause_id does Prompt B run. This call receives the full text of both clauses—source and target—so the model can actually read both sides of the relationship before classifying it. Asking for relationship type without the target clause in context (as the original single-prompt design did) is the equivalent of asking a lawyer to characterize an amendment without showing them the original contract.

prompt_b_edge_classification.txtSystem + User Prompt
## SYSTEM
Classify the legal relationship between two clauses from the same contract
portfolio. You will be given the full text of both clauses.

Relationship definitions:
  AMENDS      – Clause A replaces or modifies specific language in Clause B.
                Partial amendment (changes one term, preserves others) counts.
  SUPPLEMENTS – Clause A adds obligations that do not conflict with Clause B.
                Both remain fully in force after the newer document.
  CHILD_OF    – Clause A is in a document that inherits from Clause B's document
                with no override of this specific clause text.
  TERMINATES  – Clause A explicitly ends or cancels the obligation in Clause B.
  UNRELATED   – Same clause type (e.g. both are payment terms) but govern
                different subjects or counterparties. Do not infer a relationship
                from surface similarity alone.

Output ONLY valid JSON. The "reasoning" field is one sentence max and is
retained for the audit log — it does not affect the graph write.

{
  "edge_type":  "AMENDS" | "SUPPLEMENTS" | "CHILD_OF" | "TERMINATES" | "UNRELATED",
  "confidence": number,  // 0.0–1.0. Be conservative — prefer UNRELATED over
                         // a low-confidence AMENDS that could corrupt the graph.
  "reasoning":  string
}

## USER
Clause A — source (newer):
  Document: SOW_002_Amendment_1.pdf · § 3.1 · executed: 2026-03-15
"""
The Delivery Date defined in Section 3.1 of SOW #2 is hereby revised to
April 30, 2026. All other terms of Section 3.1 remain in full force.
"""

Clause B — target (older):
  Document: SOW_002.pdf · § 3.1 · executed: 2025-11-01
"""
Vendor shall deliver the completed integration no later than March 15, 2026,
subject to the change order process defined in Article 7.
"""
prompt_b_response.jsonLLM Output
{
  "edge_type":  "AMENDS",
  "confidence": 0.96,
  "reasoning": "Clause A explicitly revises the delivery date in Clause B while preserving all other § 3.1 terms — a partial amendment pattern."
}

The confidence calibration instruction matters: "prefer UNRELATED over a low-confidence AMENDS." An incorrectly written AMENDS edge causes the graph traversal to return the wrong effective clause. An incorrectly written UNRELATED means a real amendment is missed — bad, but recoverable when a human reviews the queue. The asymmetry in damage severity should be reflected in the prompt's explicit bias instruction, not just the confidence gate threshold.

Prompt C — Obligation Extraction

Obligation extraction is a completely separate LLM call and should never be combined with reference detection. It operates on the resolved effective clause — the output of the graph traversal — not the raw extracted text. This sequencing matters: you want to classify the obligation that is currently in force, not a version that may have been superseded.

prompt_c_obligation_extraction.txtSystem + User Prompt
## SYSTEM
Extract structured obligation data from the effective clause text below.
This clause has already been resolved through the lineage graph — it is
the currently active version. Do not reference other versions.

Classify the obligation domain and extract all typed parameter fields.
Dates must be ISO 8601. Monetary amounts must include currency.
For relative deadlines (e.g. "within 30 days of Go-Live"), set
trigger_event to the condition name and absolute_date to null.

Output ONLY valid JSON:
{
  "obligation_domain": "FINANCIAL" | "OPERATIONAL" | "REGULATORY" | "RESTRICTIVE",
  "obligation_type":   string,          // specific type within domain
  "party_obligated":   "VENDOR" | "CLIENT" | "BOTH" | "UNCLEAR",
  "parameters": {
    "amount":          string | null,   // e.g. "2x trailing 12-month fees"
    "absolute_date":   string | null,   // ISO 8601 if determinable
    "trigger_event":   string | null,   // condition for relative deadlines
    "jurisdiction":    string | null,
    "notice_period":   string | null
  },
  "risk_flags": [string]                // e.g. ["UNCAPPED", "MISSING_NOTICE_PERIOD"]
}

## USER
Effective clause (resolved as of 2026-05-25):
  clauseId: cl_4.2_amd1_limitation · source: Amendment_1_to_MSA.pdf · § 4.2
  Resolution path: cl_4.2_amd1 → AMENDS → cl_4.2_msa_v3 (original)
"""
The liability of either party under this Agreement shall not exceed two
times (2x) the fees paid by Client in the twelve (12) months immediately
preceding the claim giving rise to liability.
"""
prompt_c_response.jsonLLM Output
{
  "obligation_domain": "FINANCIAL",
  "obligation_type":   "LIMITATION_OF_LIABILITY",
  "party_obligated":   "BOTH",
  "parameters": {
    "amount":        "2x fees paid in preceding 12 months",
    "absolute_date": null,
    "trigger_event": "claim giving rise to liability",
    "jurisdiction":  null,
    "notice_period": null
  },
  "risk_flags": []
}

Note what Prompt C receives in the user message: the resolution path. The model is told this clause superseded an earlier version. This prevents it from flagging the obligation as potentially incomplete because it "only" covers one scenario — it has the context that this is the final, authoritative text for this clause type in this portfolio.

Step 4: Tool-Calling for Graph Writes

The naive approach is to post-process the LLM's JSON output in application code and write edges imperatively. The better pattern is to let the model itself call a typed graph-write tool, making the write operation part of the model's structured output contract. This is supported natively in the OpenAI, Anthropic, and Gemini APIs via function/tool calling.

The model is given a tool definition at inference time. When it determines a lineage edge exists, it emits a tool_use block rather than a text response. The application layer validates the arguments and commits to the graph—or rejects and queues for review.

tool_definition.jsonAnthropic / OpenAI Tool Spec
{
  "name": "add_lineage_edge",
  "description": "Record a directional lineage edge between two clause nodes in the contract graph. Call this once per detected relationship.",
  "input_schema": {
    "type": "object",
    "properties": {
      "source_clause_id": { "type": "string", "description": "clauseId of the newer/overriding clause" },
      "target_clause_id": { "type": "string", "description": "clauseId of the older/original clause" },
      "edge_type":        { "type": "string", "enum": ["AMENDS","SUPPLEMENTS","CHILD_OF","TERMINATES"] },
      "confidence":       { "type": "number", "minimum": 0, "maximum": 1 },
      "citation_text":    { "type": "string", "description": "Verbatim phrase that establishes the relationship, or empty string if inferred semantically" },
      "effective_from":   { "type": "string", "format": "date" }
    },
    "required": ["source_clause_id", "target_clause_id", "edge_type", "confidence"]
  }
}

The model's tool call output for the amendment example above looks like this:

tool_use_block.jsonModel Response (tool_use content block)
{
  "type":  "tool_use",
  "name":  "add_lineage_edge",
  "input": {
    "source_clause_id": "cl_3.1_amd1_delivery_date",
    "target_clause_id": "cl_3.1_sow002_delivery_date",
    "edge_type":        "AMENDS",
    "confidence":       0.96,
    "citation_text":    "The Delivery Date defined in Section 3.1 of SOW #2 is hereby revised",
    "effective_from":   "2026-03-15"
  }
}

Step 5: The Confidence Gate — Code, Not Policy

The confidence gate is the most important single component in the pipeline. It sits between the LLM's tool call and the graph write. Its job is to reject low-confidence edges before they corrupt the DAG, and route them to a human review queue instead. This is not a configuration option—it is implemented as a hard code path.

confidence_gate.pyPython
from dataclasses import dataclass
from enum import Enum

class GateOutcome(Enum):
    COMMIT       = "commit"         # write to graph immediately
    REVIEW_QUEUE = "review_queue"   # paralegal review before graph write
    REJECT       = "reject"         # discard — too low to queue

# Thresholds differ by edge type — TERMINATES is irreversible, so threshold is higher
CONFIDENCE_THRESHOLDS = {
    "AMENDS":      { "commit": 0.88, "review": 0.60 },
    "SUPPLEMENTS": { "commit": 0.85, "review": 0.55 },
    "CHILD_OF":    { "commit": 0.80, "review": 0.50 },
    "TERMINATES":  { "commit": 0.95, "review": 0.75 },  # strictest
}

def gate_edge(edge_type: str, confidence: float, derivation: str) -> GateOutcome:
    thresholds = CONFIDENCE_THRESHOLDS[edge_type]

    # Explicit citations get a 0.05 bonus — citation_text is hard evidence
    if derivation == "EXPLICIT_CITATION":
        confidence = min(1.0, confidence + 0.05)

    if confidence >= thresholds["commit"]:
        return GateOutcome.COMMIT
    elif confidence >= thresholds["review"]:
        return GateOutcome.REVIEW_QUEUE
    else:
        return GateOutcome.REJECT

def process_tool_call(tool_input: dict, graph, review_queue):
    outcome = gate_edge(
        tool_input["edge_type"],
        tool_input["confidence"],
        "EXPLICIT_CITATION" if tool_input.get("citation_text") else "SEMANTIC_SIMILARITY"
    )

    if outcome == GateOutcome.COMMIT:
        graph.add_edge(**tool_input)
        return { "status": "committed", "edge_id": graph.last_edge_id }

    elif outcome == GateOutcome.REVIEW_QUEUE:
        review_queue.enqueue({
            "edge_candidate": tool_input,
            "reason": f"confidence {tool_input['confidence']:.2f} below commit threshold",
            "priority": "HIGH" if tool_input["edge_type"] == "TERMINATES" else "NORMAL"
        })
        return { "status": "queued_for_review" }

    else:
        return { "status": "rejected", "reason": "confidence below minimum threshold" }

Step 6: Embedding Pass for Implicit References

Not all lineage relationships are stated explicitly. An amendment may change payment terms without citing the original clause by name. The embedding pass surfaces these candidates using cosine similarity between clause vectors, scoped to matching clause types to reduce false positives.

embedding_candidate_search.pyPython
import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def find_implicit_candidates(
    new_clause: ClauseNode,
    clause_store: ClauseStore,
    similarity_threshold: float = 0.82,
    max_candidates: int = 5
) -> list[dict]:
    """
    For a newly extracted clause with no explicit citation, find existing
    clauses in the graph that it may implicitly amend or supplement.
    Scoped to matching clause_type to reduce cross-domain false positives.
    """
    candidates = []

    # Only compare against clauses of the same legal type
    same_type_clauses = clause_store.get_by_type(new_clause.clause_type)

    for existing in same_type_clauses:
        # Skip clauses from the same document — intra-doc hierarchy handled separately
        if existing.document_id == new_clause.document_id:
            continue

        sim = cosine_similarity(new_clause.embedding, existing.embedding)

        if sim >= similarity_threshold:
            candidates.append({
                "target_clause_id": existing.clause_id,
                "similarity":       sim,
                "target_doc_date":  existing.document_date,
                "needs_llm_confirm": True   # always confirm before edge write
            })

    # Return top-N by similarity, but only candidates older than new clause
    candidates = [c for c in candidates
                  if c["target_doc_date"] < new_clause.document_date]
    candidates.sort(key=lambda x: x["similarity"], reverse=True)
    return candidates[:max_candidates]

Each candidate from this pass is then fed into the pairwise edge classification prompt above. The embedding score becomes a prior—a candidate with similarity 0.93 that the LLM also classifies as AMENDS at 0.91 confidence gets committed. A candidate with similarity 0.84 that the LLM rates 0.61 gets queued. This two-stage filter—embedding similarity then LLM confirmation—prevents single-model failures from writing bad edges.

Step 7: Graph-Augmented RAG for Query Time

The counterpart to edge-writing is query-time retrieval. The naive approach—embedding the user's question and doing nearest-neighbor search across all clause vectors—returns semantically similar text but has no concept of which version of a clause is currently effective. A clause that was amended two years ago is still in the vector index, and its embedding may score higher than the amendment that replaced it.

Graph-augmented RAG fixes this by using the lineage graph as the retrieval layer, not the vector index. The query first resolves the effective clause through the DAG, then feeds only the resolved text to the LLM context window.

graph_rag_query.pyPython
def answer_legal_query(
    natural_language_query: str,
    counterparty_id: str,
    as_of: datetime,
    llm_client,
    graph: LineageGraph
) -> dict:
    """
    Graph-augmented RAG: resolve effective clauses through the DAG first,
    then build a minimal context window for the LLM. Never retrieves
    superseded or terminated clauses.
    """

    # 1. Classify which clause types the query is about
    #    (a small fast classifier call — not the full graph query)
    relevant_types = classify_query_intent(natural_language_query)
    # e.g. → ["LIMITATION_OF_LIABILITY", "INDEMNIFICATION"]

    # 2. Fetch all documents for this counterparty
    doc_ids = graph.get_documents_by_counterparty(counterparty_id)

    # 3. For each relevant clause type, resolve the effective clause
    #    through the lineage graph — NOT the vector index
    resolved_clauses = []
    for clause_type in relevant_types:
        for doc_id in doc_ids:
            effective = resolve_with_inheritance(doc_id, clause_type, as_of)
            if effective:
                resolved_clauses.append({
                    "clause_type":     clause_type,
                    "effective_text":  effective.effective_text,
                    "source_doc":      effective.source_file,
                    "resolution_path": effective.resolution_path,
                    "page":            effective.page_number,
                    "inherited_from":  getattr(effective, "inherited_from", None)
                })

    # 4. Build a minimal, provenance-annotated context for the LLM
    context_blocks = []
    for c in resolved_clauses:
        path_summary = " → ".join([e["clauseId"] for e in c["resolution_path"]])
        context_blocks.append(f"""
[{c['clause_type']}]
Source: {c['source_doc']}, page {c['page']}
Resolution path: {path_summary}
Effective text:
{c['effective_text']}
""")

    context = "\n---\n".join(context_blocks)

    # 5. Single LLM call with resolved, provenance-annotated context
    response = llm_client.complete(
        system="You are a legal analyst. Answer the query using ONLY the clause text provided. "
               "Cite the source document and page number for every claim. "
               "Do not infer or extrapolate beyond the provided text.",
        user=f"Clauses (as of {as_of.date()}, already resolved for amendments):\n\n{context}\n\nQuery: {natural_language_query}"
    )

    return {
        "answer":          response.text,
        "resolved_clauses": resolved_clauses,   # full provenance returned to caller
        "as_of":           as_of.isoformat()
    }

The critical line is step 3: resolve_with_inheritance(doc_id, clause_type, as_of). The LLM never sees superseded clause text. It receives only the output of the deterministic graph traversal—the version that is legally effective on the query date. The LLM's job at step 5 is synthesis and plain-language explanation, not resolution. These are distinct tasks and conflating them is the root cause of most legal AI errors.

What Breaks Without the Graph Layer

To make the failure modes concrete: consider a portfolio where an MSA liability cap was raised from 1× to 2× annual fees by Amendment #1 in March 2026. Without graph-augmented retrieval, a naive RAG query against the full vector index may score the original MSA clause higher than the amendment—because the original clause contains more canonical liability language and appears in more chunked fragments. The LLM gets stale text, states the wrong cap, and the error is not visible in the output.

// Naive RAG vs. Graph-Augmented RAG — What the LLM Actually Receives
NAIVE RAG (vector similarity only) Query: "What is Acme Corp's liability cap?" Vector search top-3 results: score=0.94 → MSA v3 § 4.2 (original: 1× fees) ← STALE — superseded March 2026 score=0.91 → MSA v2 § 4.2 (original: 0.5× fees) ← STALE — superseded Nov 2025 score=0.87 → Amendment #1 § 4.2 (current: 2× fees) ← CORRECT but ranked 3rd LLM receives all three. Likely returns: "liability cap is 1× annual fees" ← WRONG GRAPH-AUGMENTED RAG (lineage traversal first) resolve_effective_clause("cl_4.2_limitation_of_liability", as_of=today) → DFS walks AMENDS edge from Amendment #1 → Returns: cl_4.2_amd1 (current: 2× fees, page 3, bbox=[...]) LLM receives ONLY the resolved effective clause. Returns: "liability cap is 2× fees paid in preceding 12 months" ← CORRECT + source: Amendment_1_to_MSA.pdf, page 3

The design principle: LLMs are good at reading a small, correct context window. They are unreliable at building or traversing a document graph, resolving temporal state, or producing verifiable audit trails. Partition the work accordingly: the DAG owns resolution, the LLM owns synthesis. Every failure mode in legal AI traces back to assigning one of these jobs to the wrong layer.

08 / The Bottom Line

A document processing system that stops at OCR or flat extraction has solved the easy problem. The hard problem is the graph: knowing which version of a clause is currently effective across a multi-document portfolio with a history of amendments, and being able to prove at audit time exactly where that clause came from.

The architecture described here—VLM grounding for provenance, DAG modeling for clause relationships, typed edge semantics for resolution, recursive traversal for effective-clause queries, relational storage with graph indexes, and OpenLineage for audit-stream emission—forms a complete, defensible lineage stack.

Auditing Guarantee: Without provenance tracking, every AI-extracted obligation is legally opaque. With it, each obligation resolves to a specific byte range in a hash-verified source file, with a complete chain-of-custody record of every transformation step. That is the difference between automation and evidence.

Contract Intelligence Data Engineering OpenLineage Knowledge Graph DAG Legal Operations Provenance Graph Traversal