| Internet-Draft | Substrate Provenance Grammar | May 2026 |
| Morrison | Expires 29 November 2026 | [Page] |
This memo describes a wire-level annotation grammar by which a large-language-model output may carry, at emission and at the granularity of an individual assertion, a provenance label drawn from a closed enumerated vocabulary of substrate-class identifiers. The memo defines the closed vocabulary, the per-assertion attachment form, the admissibility discipline a relying party MAY apply to the labels, and two terminal output states (UNVERIFIED-INFERENCE and DECAYED-TO-UNCERTAINTY) equal-rank with assertion and denial. The memo does not say what an inference system MUST do. It defines the wire grammar by which a relying party may inspect what the inference system DID with respect to the substrates it consulted. The memo is Informational.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 29 November 2026.¶
Copyright (c) 2026 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
When a large-language-model output is consumed by another agent, by a downstream automation, or by a relying party with action authority, the consuming entity has no wire-level mechanism to distinguish three cases: (i) the model emitted the assertion from training-corpus-resident pattern without consulting any external substrate; (ii) the model emitted the assertion after consulting an external substrate whose state corroborated the assertion within an admissibility window; (iii) the model emitted the assertion after consulting an external substrate whose state did not corroborate the assertion, and the model proceeded anyway.¶
Existing approaches to this problem operate at the prose layer: post-hoc citation insertion by a separate retrieval orchestrator, natural-language hedge phrasing ("I believe", "it appears", "according to"), per-paragraph confidence scores rendered as adjectives, or refusal to answer. All four are parsed from the surface form rather than carried as a distinct output element. All four can be defeated by a model trained to substitute hedge phrasing for substrate consultation.¶
This memo defines a wire-level grammar by which the inference system declares, at the granularity of an individual assertion within its output, which substrate-class (if any) corroborated the assertion at emission. The grammar is closed-vocabulary, finite, and version-anchored. The consuming entity parses the annotation without interpreting prose. Two terminal annotations, UNVERIFIED-INFERENCE and DECAYED-TO-UNCERTAINTY, are equal-rank with assertion and denial. They are a distinct output state, not a confidence score and not a hedge phrase.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The following terms are defined for the purposes of this document:¶
The vocabulary defined by this memo is closed, finite, and version-anchored. An implementation parsing a provenance annotation MUST recognise the annotation if and only if the substrate-class identifier appears in the version of the vocabulary the implementation has loaded. Unknown substrate-class identifiers MUST NOT be silently interpreted; an implementation that encounters one MUST treat the annotation as if it were UNVERIFIED-INFERENCE (Section 5).¶
The version-anchor scheme used by this memo is a dotted
major.minor pair appearing as a leading element of the
substrate-class identifier. The vocabulary defined in this
revision of the memo carries the version anchor 1.0.
Future revisions of this memo MAY add substrate-classes; addition
is a minor-version bump. Future revisions MUST NOT remove
substrate-classes without a major-version bump.¶
The version-1.0 closed substrate-class vocabulary:¶
substrate.git.loggit log on a repository observable
to the relying party, with the assertion's content appearing
within a commit reachable from a named reference.
Compute-location: the relying party's local working tree or a
trusted mirror.¶
substrate.grepgrep over a file-set observable to
the relying party, with the assertion's content appearing in a
named region of a named file. Compute-location: the relying
party's local file system.¶
substrate.code.readsubstrate.fs.mtimesubstrate.mcp.briefsubstrate.do.sse-countsubstrate.unix.peercredunverified-inferenceThe eight identifiers above constitute the entirety of the version-1.0 closed vocabulary.¶
Two annotation values are terminal: they signal a distinct output state of the inference system, equal-rank with assertion and denial, rather than corroboration by any substrate.¶
unverified-inferencedecayed-to-uncertaintyBoth terminal annotations are first-class output tokens. A relying party parsing the wire output observes them in the same structural slot in which substrate-class identifiers appear; the parser applies the terminal-annotation disposition without inspecting prose. The two terminal annotations are distinct from refusal to answer, from explicit denial, and from the absence of annotation.¶
A provenance annotation is attached to an individual assertion in the inference system's output. This memo describes the abstract attachment relationship; the concrete wire encoding is the implementation's choice and is not normative herein.¶
The annotation form is the tuple:¶
(assertion-span, substrate-class-identifier, observation-id?, ts?)¶
where:¶
Two concrete encodings are illustrative and not normative:¶
JSON-structured-output encoding per [RFC8259]:¶
{
"assertion": "CHANGELOG.md contains an entry dated 2026-05-25",
"provenance": {
"substrate_class": "substrate.code.read",
"observation_id": "sha256:e3b0c4...",
"ts": "2026-05-28T11:40:00Z"
}
}
¶
In-line bracketed annotation, for free-text outputs:¶
The file CHANGELOG.md contains an entry dated 2026-05-25. [substrate.code.read; observation-id=sha256:e3b0c4...; ts=2026-05-28T11:40:00Z]¶
A relying party MAY apply a cardinality-thresholded admissibility discipline to inference system output annotated under this grammar. The discipline is parameterised by:¶
Reference values:¶
The relying party applies the discipline by counting, for each
assertion in the inference system's output, the distinct
substrate-class identifiers appearing in the assertion's provenance
annotations whose ts field is within W. Assertions not
meeting the cardinality floor are not admitted; assertions
annotated with unverified-inference or
decayed-to-uncertainty are not admitted by virtue of the
terminal annotation itself.¶
This memo does not specify what the relying party MUST do with an inadmissible assertion. Common dispositions include: discarding the assertion silently, surfacing it to a human reviewer, requesting re-emission from the inference system, or substituting an explicit refusal in the relying party's own output to the next consuming entity. Each disposition is the relying party's own policy choice and is not constrained by this memo.¶
A reader may ask why the substrate-class vocabulary is closed rather than open-extensible.¶
An open-extensible vocabulary would permit any inference system to introduce a new substrate-class identifier and emit corroboration annotations under it. A relying party encountering an unrecognised identifier would face a choice: trust the new identifier on its face, refuse the assertion, or treat the identifier as equivalent to a fallback known identifier. Each choice is inferior to the closed-vocabulary posture of this memo:¶
The closed-vocabulary posture treats unrecognised identifiers
as unverified-inference (Section 5). This
preserves the admissibility discipline under vocabulary drift
while leaving the relying party free to upgrade its parser to a
newer vocabulary version.¶
The contribution of this memo is the joint articulation of a closed substrate-class vocabulary, per-assertion attachment, and two terminal annotations as a single wire-level grammar. Each adjacent prior-art family is distinct from this grammar in at least one of the three components.¶
Retrieval-Augmented Generation performs retrieval against an external corpus and conditions generation on the retrieved context. RAG is an inference-system architecture; this memo describes an output-side annotation grammar. A RAG-architected inference system MAY emit annotations under this grammar; a non-RAG inference system MAY emit annotations under this grammar. The grammar is orthogonal to the architecture.¶
Constitutional AI and self-critique architectures apply a second model pass to evaluate the first pass's output against a specification. The output of such a system is not annotated at the granularity of an individual assertion against an external substrate-class; it is annotated, if at all, with a critique-pass verdict against a specification authored by the model vendor. Annotation against a vendor-authored specification differs in kind from annotation against a substrate observable to a relying party.¶
Multi-agent debate and related multi-pass deliberation architectures produce a single consensus output from multiple agent passes. The output's provenance is the agents' agreement process, not a substrate observable to a relying party.¶
Hidden-state probing inspects the inference system's internal activations to estimate the system's own confidence in its output. Confidence is a property of the inference system; substrate-class corroboration is a property of the relying party's own observation surface. The two are categorically different information sources.¶
Cryptographically-anchored append-only logs (Certificate Transparency [RFC6962], trusted timestamping per [RFC3161]) are candidate corroborating substrates under this grammar; each is a substrate-class a future revision of the vocabulary MAY add. A chained log considered in isolation is not a per-assertion annotation grammar.¶
This memo requires no IANA actions in its present revision. A future revision may request establishment of an IANA registry for substrate-class identifiers, governed by the closed-vocabulary discipline of Section 4 and Section 8.¶
The grammar specified by this memo surfaces three classes of attack absent from prose-only or hedge-phrasing approaches. The mitigations described below are operational rather than wire-level; this memo defines the grammar only, and an implementation's operational posture is its own.¶
An inference system may emit a provenance annotation citing a substrate-class corroboration that did not occur. The grammar specified by this memo provides no cryptographic binding between the annotation and any observed substrate state. A relying party MUST NOT treat the annotation as evidence of corroboration; the annotation is a declaration of the inference system's claim about its own behaviour, which the relying party MAY independently verify by re-querying the named substrate with the optional observation-id. Cryptographic binding of annotations to observed substrate state is out of scope for this memo and is the subject of separate work.¶
An inference system implemented against a newer version of
the vocabulary may emit substrate-class identifiers not present
in a relying party's older vocabulary. Per Section 4,
the relying party treats unrecognised identifiers as
unverified-inference. This is a fail-closed posture and
is correct. A relying party operating at a substantially older
vocabulary version SHOULD upgrade its parser to the current
published version of this memo.¶
An adversary controlling a substrate identified in the vocabulary may engineer the substrate's state to corroborate assertions of the adversary's choice. The admissibility discipline of Section 7, with k = 2 or k = 3, mitigates this by requiring corroboration from substrate-class-distinct sources before admission. An adversary controlling all k substrate-classes can defeat the discipline; selection of independent substrate-classes is the relying party's operational responsibility and is not specified by this memo.¶
Provenance annotations expose to consumers of the inference system's output the categories of substrate the inference system consulted. In typical deployments, the substrate-class identifier is a category not a record; the per-record observation-id, if emitted, may carry the privacy properties of the underlying substrate (a filesystem path, a content hash, a peer credential identifier).¶
A relying party emitting annotations under this grammar to a further downstream consumer SHOULD apply standard identity-binding hygiene to substrate observables: pseudonymous tier observations need not carry strong identifiers; identity-bound tier observations carry the identity-binding strength of the underlying substrate.¶
The grammar specified by this memo does not require, and does not recommend, attachment of identifiers tying the inference system's output to a particular human end-user. Such attachments, if made, are outside the scope of this memo.¶
This memo articulates a grammar layered above the substrate-observation primitive of related work in the morrison-* family on IETF datatracker. Its development is the joint product of deployed agentic-system experience and structural analysis of adjacent prior art.¶
The applicant of any patent rights that may be construed to read on the grammar specified by this memo will file an IPR disclosure under the IETF's standard procedures. The applicant's intended licensing posture for any such rights is royalty-free with defensive-termination, consistent with the applicant's published IPR disclosures on companion memos in the morrison-* family.¶