Clause Lineage & Linkage: Resolving Parent-Child Relationships and Provenance

When enterprises negotiate contracts, they rarely sign a single isolated file. A commercial relationship evolves into a network: a Master Services Agreement anchors three subsequent amendments, five Statements of Work, and several Service Level Agreements and data processing addenda. Each document can override, inherit, or extend terms from its parents.

Traditional CLM systems treat this portfolio as flat metadata records. That works until a compliance team asks: "What is the effective limitation of liability for Acme Corp right now, accounting for all amendments?" Answering that correctly requires modeling the portfolio as a Directed Acyclic Graph (DAG) and implementing deterministic data lineage at clause resolution time.

The Mesh Viewpoint: Contracts are not text documents—they are structured data programs. The relationships between documents and clauses form a compiler's dependency graph. Lineage tracking enforces the integrity of that program's execution. Every extracted obligation must resolve to a single, verifiable byte offset in a source file.

01 / Visual Grounding & Schema-Constrained Extraction

Before linkage can happen, structured data must be extracted from unstructured sources—PDF, DOCX, scanned images. Flat text parsers discard layout, table borders, and margin annotations, destroying semantic signals that determine clause scope and hierarchy.

Extraction Pipeline

LAYOUT

Spatial Layout Indexing

2D coordinates are preserved alongside token text. Tables, headings, and margin callouts are indexed by page-coordinate regions. The coordinate tuple becomes the permanent address for any fact extracted from that region.

GRAMMAR

Schema-Constrained Decoding

LLM extraction is forced to conform to a JSON schema during decoding via logit-level grammar masks (e.g., clause_schema_v4.gbnf). This eliminates hallucinated keys, malformed types, and invalid output before it enters the pipeline.

02 / Building the Lineage Graph: Nodes, Edges, and Types

With clause nodes extracted and anchored, the pipeline constructs the lineage graph. This is a DAG where nodes are ClauseNodes or DocumentNodes, and typed edges encode the legal relationship between them.

Inter-Document DAG

// Inter-Document Linkage — Document-Level DAG

[ doc_msa_acme_v1 ] ← root instrument │ ├── CHILD_OF ──▶ [ doc_sow_acme_001 ] │ inherits: liability, IP ownership, dispute resolution │ ├── CHILD_OF ──▶ [ doc_sow_acme_002 ] │ │ │ └── AMENDS ──▶ [ doc_amd_sow002_delivery ] │ overrides: delivery date clause § 3.1 │ └── SUPERSEDED_BY ──▶ [ doc_msa_acme_v2 ] │ mutates: liability cap, auto-renewal clause │ └── SUPERSEDED_BY ──▶ [ doc_msa_acme_v3 ] ← effective root

Edge Types and Their Resolution Behavior

The lineage graph is only as precise as its edge vocabulary. Each edge type carries distinct traversal semantics—the resolver's behavior at query time depends entirely on which edge type it is walking.

How Edges Are Built: Reference Resolution Pass

EXPLICIT

Citation Pattern Matching

Regex + NER detects phrases like "pursuant to the MSA dated…" or "Article 4.2 is hereby amended to read…". These produce high-confidence edges (p > 0.95) with the cited document and section as the edge target.

SEMANTIC

Embedding Similarity

For clauses without explicit citations, dense vectors (e.g., text-embedding-3-large) are generated per clause type. Cosine similarity above threshold triggers a candidate AMENDS edge, confirmed by LLM classification.

03 / Effective-Clause Resolution: Graph Traversal

With the graph constructed, the core query is: "What is the current effective text of clause X, as of today?" This requires a deterministic traversal that respects edge types and temporal ordering.

Resolution Algorithm Flow

The returned node carries a resolution_path—an ordered list of every edge traversed to reach the effective clause. This path is the audit trail inspectable by compliance teams to explain exactly why a given clause version was returned.

Inheritance Walk for Missing Clauses

When a SOW has no local version of a clause type (e.g., no indemnification clause), the resolver walks CHILD_OF edges upward to the parent MSA. This is prototype-chain resolution: the closest ancestor that has the attribute wins.

resolve_inherited_clause.pyPython

def resolve_with_inheritance(doc_id: str, clause_type: str, as_of: datetime):
    local = doc.get_clause_by_type(clause_type)
    if local:
        return resolve_effective_clause(local.clause_id, as_of)

    # Walk CHILD_OF edges breadth-first toward root
    for parent_doc in graph.get_parents(doc_id, edge_type="CHILD_OF"):
        result = resolve_with_inheritance(parent_doc.doc_id, clause_type, as_of)
        if result:
            result.inherited_from = parent_doc.doc_id   # flag as inherited
            return result
    return None

04 / Obligation Extraction and Risk Classification

With structural lineage established, semantic intelligence is layered on top: classifying what each clause obligates, and how that obligation deviates from the company's playbook.

Obligations are classified across four domains. Each obligation record links by clauseId to its parent ClauseNode, preserving the full chain back to the source pixel.

FINANCIAL

Financial Obligations

Payment deadlines, invoice triggers, milestone disbursement, tax responsibilities, late-payment penalties. Numeric fields are extracted as typed values, not strings.

OPERATIONAL

Operational Obligations

Delivery dates, SLA thresholds (uptime, MTTR), reporting cycles. ISO 8601 dates are normalized; relative deadlines are flagged as unresolvable until a trigger event is defined.

REGULATORY

Regulatory & Compliance

GDPR/DPA clauses, data residency, export controls, SOC 2 audit rights, breach notification timelines. Linked to specific regulation identifiers from a compliance taxonomy.

RESTRICTIVE

Restrictive Covenants

Non-competes, non-solicitation, exclusivity zones, IP assignment clauses. Geographic and temporal scope parameters are extracted as structured fields.

Risk Classification: Playbook Deviation Scoring

Low Risk

Standard Deviation

Minor wording variance; standard governing law (Delaware); Net 30 payment terms; liability cap at 1× annual contract value.

Medium Risk

Playbook Out-of-Bounds

Unfavorable governing law; Net 60+ payment terms; liability cap at 1.5× trailing fees; GDPR DPA absent but data transfer implied.

High Risk

Critical Violations

Uncapped IP indemnification; missing Limitation of Liability; unilateral termination with zero notice period.

05 / Clause Lineage Tracking with OpenLineage

The graph and resolution algorithm answer what the effective clause is. OpenLineage answers how we got there—recording every transformation step as an immutable, inspectable audit stream. Each pipeline stage emits a typed event with custom facets carrying provenance metadata.

Pipeline Stage → OpenLineage Event Map

openlineage_event.jsonOpenLineage Spec 2.0

{
  "eventTime": "2026-05-25T14:55:00Z",
  "eventType": "COMPLETE",
  "run": {
    "runId": "4892c90c-60e1-4569-b570-98df872a08cc",
    "facets": {
      "modelContext": { "model": "gemini-1.5-pro", "constraintGrammar": "clause_schema_v4.gbnf" },
      "lineageResolutionPath": {
        "path": [
          { "clauseId": "cl_4.2_amd1",              "edgeType": "AMENDS" },
          { "clauseId": "cl_4.2_limitation_of_liability", "edgeType": "ROOT"   }
        ]
      }
    }
  },
  "inputs": [{ "namespace": "contract-lake", "name": "MSA_Acme_Corp_v3.pdf",
    "facets": { "documentMetadata": { "sha256": "8f309a96e5dbe4889c20a9a1488e4d0c" } }
  }],
  "outputs": [{ "namespace": "obligation-registry", "name": "obligation_nodes",
    "facets": { "clauseProvenance": {
      "clauseId":    "cl_4.2_amd1",
      "pageNumber":  14,
      "boundingBox": [120, 450, 480, 590],
      "confidence":  0.98
    }}
  }]
}

06 / Closing the Loop: Dashboard Click to Source Pixel

The full value of the architecture is realized at query time. When a compliance analyst clicks an obligation and asks "Show me the source," the following sequence executes end-to-end:

07 / How LLMs Actually Handle Lineage—and Where They Break

The prior sections describe a deterministic pipeline. This section addresses a direct question: what does it look like when an LLM actually performs the reference detection and edge classification steps? We walk through the real prompt structures, the tool-calling payloads, the embedding pass, and the exact failure modes that make a deterministic graph layer non-negotiable beneath them.

Why a Single "Parse This" Prompt Fails

The instinct is to write one prompt that does everything: detect the cross-reference, extract the target, and infer the relationship type. That structure is wrong for three concrete reasons. First, it asks the model to simultaneously perform extraction (find a span) and reasoning (judge legal intent) in a single pass—two tasks with different reliability profiles. Second, the relation_hint field in a single-prompt design asks the model to classify the relationship before it has seen the target clause, which means it is guessing from one side of the relationship. Third, when that combined output flows toward a graph write, a wrong relation inference arrives in the same JSON object as a high-confidence span extraction, and there is no mechanism to decompose the error.

The correct structure is three separate prompts with a deterministic resolution step in between. Each prompt has a single falsifiable output that can be tested independently.

Prompt A — Citation Span Extraction Only

This prompt has one job: find verbatim phrases in the clause that point to an external document or section. It does not infer what the relationship means. The negative-space instructions are as important as the positive ones—without them, the model flags definition references, governing law citations, and self-references as lineage candidates.

prompt_a_citation_span.txtSystem + User Prompt

## SYSTEM
Extract cross-document citation spans from the clause below.

A citation span is a verbatim phrase that explicitly names a different contract
document or a numbered section within a different document that this clause
depends on or references.

Do NOT extract:
  - References to defined terms within the same agreement
    (e.g. "as defined in Section 1.2 of this Agreement")
  - References to applicable law or regulation
    (e.g. "pursuant to GDPR Article 17", "under applicable export control laws")
  - Self-references to other sections within the same document
  - Generic incorporation phrases without a named external document
    (e.g. "incorporated herein by reference" with no document named)

Rules:
  - Return the span character-exact — do not paraphrase
  - If a document name AND a section are both present, include both in the span
  - If no qualifying citation exists, return {"citations": []}
  - Return ONLY valid JSON. No explanation, no commentary.

Output schema:
{
  "citations": [
    {
      "span":           string,        // verbatim, character-exact substring
      "doc_name_hint":  string | null, // e.g. "Master Services Agreement"
      "doc_date_hint":  string | null, // e.g. "January 15, 2026" if present
      "section_hint":   string | null  // e.g. "Article 4.2" if present
    }
  ]
}

## USER
Clause (source document: Amendment_1_to_MSA.pdf · section: § 4.2 · executed: 2026-03-01):
"""
Article 4.2 of this Amendment hereby supersedes and replaces Article 4.2
of the Master Services Agreement dated January 15, 2026 in its entirety.
The liability cap shall not exceed two times (2x) the fees paid in the
twelve (12) months preceding the claim.
"""

prompt_a_response.jsonLLM Output

{
  "citations": [
    {
      "span":          "Article 4.2 of the Master Services Agreement dated January 15, 2026",
      "doc_name_hint": "Master Services Agreement",
      "doc_date_hint": "January 15, 2026",
      "section_hint":  "Article 4.2"
    }
  ]
}

The span field is validated mechanically: the pipeline confirms it is a literal substring of the source clause text before proceeding. If it is not, the output is rejected as hallucinated—the model invented text that does not exist in the clause. This check costs one string operation and catches a real class of model errors.

Deterministic Portfolio Resolution (No LLM)

Prompt A's output feeds a deterministic lookup, not another LLM call. The pipeline matches doc_name_hint and doc_date_hint against the document registry to resolve a specific doc_id, then queries the extracted clause index for a clause matching section_hint within that document. This step produces the target_clause_id needed for the graph edge—without any model inference involved.

portfolio_resolution.pyPython

def resolve_citation_to_clause(citation: dict, doc_registry, clause_store) -> str | None:
    """
    Resolve a Prompt A citation output to a concrete clauseId.
    Entirely deterministic — no LLM involved.
    """
    # Step 1: validate span exists verbatim in source text (hallucination check)
    if citation["span"] not in source_clause.effective_text:
        log.warn(f"Span not found in source text — likely hallucinated: {citation['span']}")
        return None

    # Step 2: match doc_name_hint + doc_date_hint against registry
    candidate_docs = doc_registry.fuzzy_match(
        name=citation["doc_name_hint"],
        date=citation["doc_date_hint"]   # date disambiguates MSA v1 vs v3
    )
    if not candidate_docs:
        return None  # document not in portfolio — flag for review

    target_doc = candidate_docs[0]  # highest-confidence match

    # Step 3: resolve section_hint to a clauseId within the target document
    if citation["section_hint"]:
        clause = clause_store.get_by_section(target_doc.doc_id, citation["section_hint"])
        return clause.clause_id if clause else None

    # No section hint — return doc-level node for inheritance edge
    return target_doc.root_clause_id

Prompt B — Pairwise Edge Classification

Only after portfolio resolution produces a concrete target_clause_id does Prompt B run. This call receives the full text of both clauses—source and target—so the model can actually read both sides of the relationship before classifying it. Asking for relationship type without the target clause in context (as the original single-prompt design did) is the equivalent of asking a lawyer to characterize an amendment without showing them the original contract.

prompt_b_edge_classification.txtSystem + User Prompt

## SYSTEM
Classify the legal relationship between two clauses from the same contract
portfolio. You will be given the full text of both clauses.

Relationship definitions:
  AMENDS      – Clause A replaces or modifies specific language in Clause B.
                Partial amendment (changes one term, preserves others) counts.
  SUPPLEMENTS – Clause A adds obligations that do not conflict with Clause B.
                Both remain fully in force after the newer document.
  CHILD_OF    – Clause A is in a document that inherits from Clause B's document
                with no override of this specific clause text.
  TERMINATES  – Clause A explicitly ends or cancels the obligation in Clause B.
  UNRELATED   – Same clause type (e.g. both are payment terms) but govern
                different subjects or counterparties. Do not infer a relationship
                from surface similarity alone.

Output ONLY valid JSON. The "reasoning" field is one sentence max and is
retained for the audit log — it does not affect the graph write.

{
  "edge_type":  "AMENDS" | "SUPPLEMENTS" | "CHILD_OF" | "TERMINATES" | "UNRELATED",
  "confidence": number,  // 0.0–1.0. Be conservative — prefer UNRELATED over
                         // a low-confidence AMENDS that could corrupt the graph.
  "reasoning":  string
}

## USER
Clause A — source (newer):
  Document: SOW_002_Amendment_1.pdf · § 3.1 · executed: 2026-03-15
"""
The Delivery Date defined in Section 3.1 of SOW #2 is hereby revised to
April 30, 2026. All other terms of Section 3.1 remain in full force.
"""

Clause B — target (older):
  Document: SOW_002.pdf · § 3.1 · executed: 2025-11-01
"""
Vendor shall deliver the completed integration no later than March 15, 2026,
subject to the change order process defined in Article 7.
"""

prompt_b_response.jsonLLM Output

{
  "edge_type":  "AMENDS",
  "confidence": 0.96,
  "reasoning": "Clause A explicitly revises the delivery date in Clause B while preserving all other § 3.1 terms — a partial amendment pattern."
}

The confidence calibration instruction matters: "prefer UNRELATED over a low-confidence AMENDS." An incorrectly written AMENDS edge causes the graph traversal to return the wrong effective clause. An incorrectly written UNRELATED means a real amendment is missed — bad, but recoverable when a human reviews the queue. The asymmetry in damage severity should be reflected in the prompt's explicit bias instruction, not just the confidence gate threshold.

Prompt C — Obligation Extraction

Obligation extraction is a completely separate LLM call and should never be combined with reference detection. It operates on the resolved effective clause — the output of the graph traversal — not the raw extracted text. This sequencing matters: you want to classify the obligation that is currently in force, not a version that may have been superseded.

prompt_c_obligation_extraction.txtSystem + User Prompt

## SYSTEM
Extract structured obligation data from the effective clause text below.
This clause has already been resolved through the lineage graph — it is
the currently active version. Do not reference other versions.

Classify the obligation domain and extract all typed parameter fields.
Dates must be ISO 8601. Monetary amounts must include currency.
For relative deadlines (e.g. "within 30 days of Go-Live"), set
trigger_event to the condition name and absolute_date to null.

Output ONLY valid JSON:
{
  "obligation_domain": "FINANCIAL" | "OPERATIONAL" | "REGULATORY" | "RESTRICTIVE",
  "obligation_type":   string,          // specific type within domain
  "party_obligated":   "VENDOR" | "CLIENT" | "BOTH" | "UNCLEAR",
  "parameters": {
    "amount":          string | null,   // e.g. "2x trailing 12-month fees"
    "absolute_date":   string | null,   // ISO 8601 if determinable
    "trigger_event":   string | null,   // condition for relative deadlines
    "jurisdiction":    string | null,
    "notice_period":   string | null
  },
  "risk_flags": [string]                // e.g. ["UNCAPPED", "MISSING_NOTICE_PERIOD"]
}

## USER
Effective clause (resolved as of 2026-05-25):
  clauseId: cl_4.2_amd1_limitation · source: Amendment_1_to_MSA.pdf · § 4.2
  Resolution path: cl_4.2_amd1 → AMENDS → cl_4.2_msa_v3 (original)
"""
The liability of either party under this Agreement shall not exceed two
times (2x) the fees paid by Client in the twelve (12) months immediately
preceding the claim giving rise to liability.
"""

prompt_c_response.jsonLLM Output

{
  "obligation_domain": "FINANCIAL",
  "obligation_type":   "LIMITATION_OF_LIABILITY",
  "party_obligated":   "BOTH",
  "parameters": {
    "amount":        "2x fees paid in preceding 12 months",
    "absolute_date": null,
    "trigger_event": "claim giving rise to liability",
    "jurisdiction":  null,
    "notice_period": null
  },
  "risk_flags": []
}

Note what Prompt C receives in the user message: the resolution path. The model is told this clause superseded an earlier version. This prevents it from flagging the obligation as potentially incomplete because it "only" covers one scenario — it has the context that this is the final, authoritative text for this clause type in this portfolio.

Step 4: Tool-Calling for Graph Writes

The naive approach is to post-process the LLM's JSON output in application code and write edges imperatively. The better pattern is to let the model itself call a typed graph-write tool, making the write operation part of the model's structured output contract. This is supported natively in the OpenAI, Anthropic, and Gemini APIs via function/tool calling.

The model is given a tool definition at inference time. When it determines a lineage edge exists, it emits a tool_use block rather than a text response. The application layer validates the arguments and commits to the graph—or rejects and queues for review.

tool_definition.jsonAnthropic / OpenAI Tool Spec

{
  "name": "add_lineage_edge",
  "description": "Record a directional lineage edge between two clause nodes in the contract graph. Call this once per detected relationship.",
  "input_schema": {
    "type": "object",
    "properties": {
      "source_clause_id": { "type": "string", "description": "clauseId of the newer/overriding clause" },
      "target_clause_id": { "type": "string", "description": "clauseId of the older/original clause" },
      "edge_type":        { "type": "string", "enum": ["AMENDS","SUPPLEMENTS","CHILD_OF","TERMINATES"] },
      "confidence":       { "type": "number", "minimum": 0, "maximum": 1 },
      "citation_text":    { "type": "string", "description": "Verbatim phrase that establishes the relationship, or empty string if inferred semantically" },
      "effective_from":   { "type": "string", "format": "date" }
    },
    "required": ["source_clause_id", "target_clause_id", "edge_type", "confidence"]
  }
}

The model's tool call output for the amendment example above looks like this:

tool_use_block.jsonModel Response (tool_use content block)

{
  "type":  "tool_use",
  "name":  "add_lineage_edge",
  "input": {
    "source_clause_id": "cl_3.1_amd1_delivery_date",
    "target_clause_id": "cl_3.1_sow002_delivery_date",
    "edge_type":        "AMENDS",
    "confidence":       0.96,
    "citation_text":    "The Delivery Date defined in Section 3.1 of SOW #2 is hereby revised",
    "effective_from":   "2026-03-15"
  }
}

Step 5: The Confidence Gate — Code, Not Policy

The confidence gate is the most important single component in the pipeline. It sits between the LLM's tool call and the graph write. Its job is to reject low-confidence edges before they corrupt the DAG, and route them to a human review queue instead. This is not a configuration option—it is implemented as a hard code path.

confidence_gate.pyPython

from dataclasses import dataclass
from enum import Enum

class GateOutcome(Enum):
    COMMIT       = "commit"         # write to graph immediately
    REVIEW_QUEUE = "review_queue"   # paralegal review before graph write
    REJECT       = "reject"         # discard — too low to queue

# Thresholds differ by edge type — TERMINATES is irreversible, so threshold is higher
CONFIDENCE_THRESHOLDS = {
    "AMENDS":      { "commit": 0.88, "review": 0.60 },
    "SUPPLEMENTS": { "commit": 0.85, "review": 0.55 },
    "CHILD_OF":    { "commit": 0.80, "review": 0.50 },
    "TERMINATES":  { "commit": 0.95, "review": 0.75 },  # strictest
}

def gate_edge(edge_type: str, confidence: float, derivation: str) -> GateOutcome:
    thresholds = CONFIDENCE_THRESHOLDS[edge_type]

    # Explicit citations get a 0.05 bonus — citation_text is hard evidence
    if derivation == "EXPLICIT_CITATION":
        confidence = min(1.0, confidence + 0.05)

    if confidence >= thresholds["commit"]:
        return GateOutcome.COMMIT
    elif confidence >= thresholds["review"]:
        return GateOutcome.REVIEW_QUEUE
    else:
        return GateOutcome.REJECT

def process_tool_call(tool_input: dict, graph, review_queue):
    outcome = gate_edge(
        tool_input["edge_type"],
        tool_input["confidence"],
        "EXPLICIT_CITATION" if tool_input.get("citation_text") else "SEMANTIC_SIMILARITY"
    )

    if outcome == GateOutcome.COMMIT:
        graph.add_edge(**tool_input)
        return { "status": "committed", "edge_id": graph.last_edge_id }

    elif outcome == GateOutcome.REVIEW_QUEUE:
        review_queue.enqueue({
            "edge_candidate": tool_input,
            "reason": f"confidence {tool_input['confidence']:.2f} below commit threshold",
            "priority": "HIGH" if tool_input["edge_type"] == "TERMINATES" else "NORMAL"
        })
        return { "status": "queued_for_review" }

    else:
        return { "status": "rejected", "reason": "confidence below minimum threshold" }

Step 6: Embedding Pass for Implicit References

Not all lineage relationships are stated explicitly. An amendment may change payment terms without citing the original clause by name. The embedding pass surfaces these candidates using cosine similarity between clause vectors, scoped to matching clause types to reduce false positives.

embedding_candidate_search.pyPython

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def find_implicit_candidates(
    new_clause: ClauseNode,
    clause_store: ClauseStore,
    similarity_threshold: float = 0.82,
    max_candidates: int = 5
) -> list[dict]:
    """
    For a newly extracted clause with no explicit citation, find existing
    clauses in the graph that it may implicitly amend or supplement.
    Scoped to matching clause_type to reduce cross-domain false positives.
    """
    candidates = []

    # Only compare against clauses of the same legal type
    same_type_clauses = clause_store.get_by_type(new_clause.clause_type)

    for existing in same_type_clauses:
        # Skip clauses from the same document — intra-doc hierarchy handled separately
        if existing.document_id == new_clause.document_id:
            continue

        sim = cosine_similarity(new_clause.embedding, existing.embedding)

        if sim >= similarity_threshold:
            candidates.append({
                "target_clause_id": existing.clause_id,
                "similarity":       sim,
                "target_doc_date":  existing.document_date,
                "needs_llm_confirm": True   # always confirm before edge write
            })

    # Return top-N by similarity, but only candidates older than new clause
    candidates = [c for c in candidates
                  if c["target_doc_date"] < new_clause.document_date]
    candidates.sort(key=lambda x: x["similarity"], reverse=True)
    return candidates[:max_candidates]

Each candidate from this pass is then fed into the pairwise edge classification prompt above. The embedding score becomes a prior—a candidate with similarity 0.93 that the LLM also classifies as AMENDS at 0.91 confidence gets committed. A candidate with similarity 0.84 that the LLM rates 0.61 gets queued. This two-stage filter—embedding similarity then LLM confirmation—prevents single-model failures from writing bad edges.

Step 7: Graph-Augmented RAG for Query Time

The counterpart to edge-writing is query-time retrieval. The naive approach—embedding the user's question and doing nearest-neighbor search across all clause vectors—returns semantically similar text but has no concept of which version of a clause is currently effective. A clause that was amended two years ago is still in the vector index, and its embedding may score higher than the amendment that replaced it.

Graph-augmented RAG fixes this by using the lineage graph as the retrieval layer, not the vector index. The query first resolves the effective clause through the DAG, then feeds only the resolved text to the LLM context window.

graph_rag_query.pyPython

def answer_legal_query(
    natural_language_query: str,
    counterparty_id: str,
    as_of: datetime,
    llm_client,
    graph: LineageGraph
) -> dict:
    """
    Graph-augmented RAG: resolve effective clauses through the DAG first,
    then build a minimal context window for the LLM. Never retrieves
    superseded or terminated clauses.
    """

    # 1. Classify which clause types the query is about
    #    (a small fast classifier call — not the full graph query)
    relevant_types = classify_query_intent(natural_language_query)
    # e.g. → ["LIMITATION_OF_LIABILITY", "INDEMNIFICATION"]

    # 2. Fetch all documents for this counterparty
    doc_ids = graph.get_documents_by_counterparty(counterparty_id)

    # 3. For each relevant clause type, resolve the effective clause
    #    through the lineage graph — NOT the vector index
    resolved_clauses = []
    for clause_type in relevant_types:
        for doc_id in doc_ids:
            effective = resolve_with_inheritance(doc_id, clause_type, as_of)
            if effective:
                resolved_clauses.append({
                    "clause_type":     clause_type,
                    "effective_text":  effective.effective_text,
                    "source_doc":      effective.source_file,
                    "resolution_path": effective.resolution_path,
                    "page":            effective.page_number,
                    "inherited_from":  getattr(effective, "inherited_from", None)
                })

    # 4. Build a minimal, provenance-annotated context for the LLM
    context_blocks = []
    for c in resolved_clauses:
        path_summary = " → ".join([e["clauseId"] for e in c["resolution_path"]])
        context_blocks.append(f"""
[{c['clause_type']}]
Source: {c['source_doc']}, page {c['page']}
Resolution path: {path_summary}
Effective text:
{c['effective_text']}
""")

    context = "\n---\n".join(context_blocks)

    # 5. Single LLM call with resolved, provenance-annotated context
    response = llm_client.complete(
        system="You are a legal analyst. Answer the query using ONLY the clause text provided. "
               "Cite the source document and page number for every claim. "
               "Do not infer or extrapolate beyond the provided text.",
        user=f"Clauses (as of {as_of.date()}, already resolved for amendments):\n\n{context}\n\nQuery: {natural_language_query}"
    )

    return {
        "answer":          response.text,
        "resolved_clauses": resolved_clauses,   # full provenance returned to caller
        "as_of":           as_of.isoformat()
    }

The critical line is step 3: resolve_with_inheritance(doc_id, clause_type, as_of). The LLM never sees superseded clause text. It receives only the output of the deterministic graph traversal—the version that is legally effective on the query date. The LLM's job at step 5 is synthesis and plain-language explanation, not resolution. These are distinct tasks and conflating them is the root cause of most legal AI errors.

What Breaks Without the Graph Layer

To make the failure modes concrete: consider a portfolio where an MSA liability cap was raised from 1× to 2× annual fees by Amendment #1 in March 2026. Without graph-augmented retrieval, a naive RAG query against the full vector index may score the original MSA clause higher than the amendment—because the original clause contains more canonical liability language and appears in more chunked fragments. The LLM gets stale text, states the wrong cap, and the error is not visible in the output.

// Naive RAG vs. Graph-Augmented RAG — What the LLM Actually Receives

NAIVE RAG (vector similarity only) Query: "What is Acme Corp's liability cap?" Vector search top-3 results: score=0.94 → MSA v3 § 4.2 (original: 1× fees) ← STALE — superseded March 2026 score=0.91 → MSA v2 § 4.2 (original: 0.5× fees) ← STALE — superseded Nov 2025 score=0.87 → Amendment #1 § 4.2 (current: 2× fees) ← CORRECT but ranked 3rd LLM receives all three. Likely returns: "liability cap is 1× annual fees" ← WRONG GRAPH-AUGMENTED RAG (lineage traversal first) resolve_effective_clause("cl_4.2_limitation_of_liability", as_of=today) → DFS walks AMENDS edge from Amendment #1 → Returns: cl_4.2_amd1 (current: 2× fees, page 3, bbox=[...]) LLM receives ONLY the resolved effective clause. Returns: "liability cap is 2× fees paid in preceding 12 months" ← CORRECT + source: Amendment_1_to_MSA.pdf, page 3

The design principle: LLMs are good at reading a small, correct context window. They are unreliable at building or traversing a document graph, resolving temporal state, or producing verifiable audit trails. Partition the work accordingly: the DAG owns resolution, the LLM owns synthesis. Every failure mode in legal AI traces back to assigning one of these jobs to the wrong layer.

08 / The Bottom Line

A document processing system that stops at OCR or flat extraction has solved the easy problem. The hard problem is the graph: knowing which version of a clause is currently effective across a multi-document portfolio with a history of amendments, and being able to prove at audit time exactly where that clause came from.

The architecture described here—VLM grounding for provenance, DAG modeling for clause relationships, typed edge semantics for resolution, recursive traversal for effective-clause queries, relational storage with graph indexes, and OpenLineage for audit-stream emission—forms a complete, defensible lineage stack.

Auditing Guarantee: Without provenance tracking, every AI-extracted obligation is legally opaque. With it, each obligation resolves to a specific byte range in a hash-verified source file, with a complete chain-of-custody record of every transformation step. That is the difference between automation and evidence.

Contract Intelligence Data Engineering OpenLineage Knowledge Graph DAG Legal Operations Provenance Graph Traversal

Clause Lineage & Linkage:Resolving Relationships and Provenance in Documents

01 / Visual Grounding & Schema-Constrained Extraction

Extraction Pipeline

02 / Building the Lineage Graph: Nodes, Edges, and Types

Inter-Document DAG

Edge Types and Their Resolution Behavior

How Edges Are Built: Reference Resolution Pass

03 / Effective-Clause Resolution: Graph Traversal

Resolution Algorithm Flow

Inheritance Walk for Missing Clauses

04 / Obligation Extraction and Risk Classification

Risk Classification: Playbook Deviation Scoring

05 / Clause Lineage Tracking with OpenLineage

Pipeline Stage → OpenLineage Event Map

06 / Closing the Loop: Dashboard Click to Source Pixel

07 / How LLMs Actually Handle Lineage—and Where They Break

Why a Single "Parse This" Prompt Fails

Prompt A — Citation Span Extraction Only

Deterministic Portfolio Resolution (No LLM)

Prompt B — Pairwise Edge Classification

Prompt C — Obligation Extraction

Step 4: Tool-Calling for Graph Writes

Step 5: The Confidence Gate — Code, Not Policy

Step 6: Embedding Pass for Implicit References

Step 7: Graph-Augmented RAG for Query Time

What Breaks Without the Graph Layer

08 / The Bottom Line

Clause Lineage & Linkage:
Resolving Relationships and Provenance in Documents