Schemas
Declare a document-structure schema inline on a kind or in a proto.md file, validate headings and front matter, and tighten rule config per section.
A schema describes what a Markdown document’s front matter, filename, and heading tree must look like. Schemas are the engine behind MDS020 required-structure ; they are the canonical place to lock down the shape of a recurring document type (plan, RFC, runbook, rule README).
mdsmith reads schemas from three sources:
- Inline — a
schema:block on a kind body in.mdsmith.yml. Uses the new matcher engine (regex:,repeat:,\#(digits),\#(fmvar(...))). - Named YAML — a
schema: <name>reference to a.mdsmith/schemas/<name>.yamlfile (or an inlineschemas:registry entry of the same name). Same matcher engine as the inline source, shared across kinds. See File-based YAML schemas . - proto.md — a
proto.mdreferenced byrules.required-structure.schema:. MDS020 still validates this source through its legacy parser today; the schema-package parser shipped with plan 156 lifts proto.md into the same matcher shape but is exercised only by tests until the cutover follow-up wires MDS020 through.
A kind may use only one source; setting two is a config error.
# Inline schemas on kinds
Inline schemas keep the structure declaration next to the kind’s other rule settings. They are best for small schemas (one or two screens) that do not need templated body content.
kinds:
rfc:
schema:
filename: "RFC-[0-9][0-9][0-9][0-9].md"
frontmatter:
id: '=~"^RFC-[0-9]{4}$"'
status: '"draft" | "ratified" | "deprecated"'
authors: '[...string] & [_, ...string]'
closed: true
sections:
- heading: null
- heading: "Overview"
- heading: "Decision"
- heading:
regex: '.+'
repeat: { min: 0 }
- heading: "References"The frontmatter: mapping reuses CUE expressions per
key: regex, disjunction, list, and any other CUE form
is accepted. Trailing ? on a key marks it optional.
Named shortcuts —
date, datetime, time, email, url, filename,
nonEmpty — substitute for their canonical CUE so a
schema can write created: date instead of repeating
the ISO regex; see
Schema field types
for the registered names and how they are matched.
filename: is a glob the document basename must
match. It sits at the top of the schema block (no
require: wrapper).
closed: true makes the scope strict — unlisted
headings produce a diagnostic. closed: false (the
default) tolerates unlisted headings between listed
sections. closed: is only meaningful when the
schema declares sections:; setting it on a
frontmatter-only kind is a parse error.
#
The heading: discriminator
Every section-array entry sets heading:. The value
takes one of three shapes:
null— the preamble: content from line 1 up to the first heading. Only valid as the first entry in a section list. Carriesclosed:/rules:/content:for that range; rejectssections:.- string — sugar for a literal match. The string
is regex-escaped and used as the matcher’s pattern,
so
heading: "(WIP)"matches a heading whose text is exactly(WIP)with the parens taken literally. Cardinality is one. - mapping — the full form:
{ regex, repeat?, sequential? }.regex:is required; the body is a Go RE2 pattern that accepts two interpolation references —\#(digits)and\#(fmvar(name)).repeat:bounds the run;sequential:(withdigits) asserts ordering. See the section-schema reference for the full grammar.
# The matcher mapping
sections:
- heading:
regex: 'Step \#(digits)'
repeat: { min: 1, max: 5 }
sequential: true
sections: [...]
content: [...]
- heading:
regex: '\#(fmvar(id)): \#(fmvar(name))'
- heading:
regex: '.+'
repeat: { min: 0 }regex: is whole-string anchored against the
heading’s rendered plain text (inline emphasis stripped,
link wrappers unwrapped, code-span backticks dropped).
Backslashes pass through to RE2; interpolation uses
\#(expr). Two helpers are in scope:
digits— expands to the named capture(?P<n>[0-9]+). One per pattern. Withsequential: truethe validator asserts the captured numbers are strictly increasing without gaps.fmvar(name)— looks up frontmatter fieldname, regex-escapes its value, and substitutes it.
repeat: bounds how many consecutive matching
headings the matcher claims. Omitting repeat: means
exactly one; { min: 0 } is zero-or-more;
{ min: 0, max: 1 } is optional; { min: 1 } is
one-or-more; bounded forms enforce both bounds.
repeat: { max: 0 } and repeat: { min > max } each
parse-error.
The wildcard-slot shape — regex: '.+' with
repeat: { min: 0 } — is positional: it absorbs
zero or more unlisted sections at its slot. A
heading whose text matches a later listed entry is
claimed for that entry, not absorbed by the slot.
# Nested sections
Levels come from depth. Root sections: entries are
H2; nested sections: lists are H3, H4, …. A runbook
that wants Diagnosis → Step → Check / Expected
expresses that as:
sections:
- heading:
regex: 'Symptoms|Indicators'
- heading: "Diagnosis"
sections:
- heading: "Step"
sections:
- heading: "Check"
- heading: "Expected"
- heading:
regex: 'If different'
repeat: { min: 0, max: 1 }
- heading:
regex: 'References'
repeat: { min: 0, max: 1 }A scope that accepts alternate heading texts encodes
the disjunction in its regex: regex: 'A|B' matches
a heading whose text is A or B.
# Section content
sections: constrains nested headings. To pin down
what AST nodes must appear inside a section’s body — a
required YAML code block, a settings table with
specific columns — add a content: list alongside the
scope’s existing fields.
sections:
- heading: "Examples"
closed: true
content:
- kind: code-block
lang: yaml
- kind: unlistedEach entry sets kind: and a small set of optional
kind-specific fields:
kind: code-block—lang:constrains the fenced block’s info string (exact match).kind: table—columns:is the exact header row the GFM table must carry.kind: list—ordered:(true/false),min-items:,max-items:bound the list’s shape.kind: paragraph— no extra keys.kind: unlisted— a positional slot. Tolerates any non-matching nodes at that position even underclosed: true.
Entries match in declared order. A node that matches a later listed entry is claimed out-of-order with a diagnostic, the same rule the heading-tree walker uses. Sub-shape mismatches (wrong code-block language, wrong table columns, list violating ordered/min/max) emit their own diagnostics but still consume the slot. Missing required entries anchor at the section’s heading line.
content: is rejected on a slot scope (the
wildcard-slot shape has no fixed identity to
constrain). Set content: only on entries that
match named sections.
# Per-scope rule overrides
Any scope may carry a rules: block. The override
sits on top of the rule’s defaults — keys it sets
replace the defaults wholesale; keys it omits keep
their default value. It applies only inside that
scope’s heading range, so one section can be stricter
than the rest of the document without glob overrides.
sections:
- heading: "Decision"
rules:
paragraph-readability:
max-index: 12.0
max-section-length:
max-words: 200The override has two limitations: it is not a config-style deep merge (nested maps and list append modes behave like a plain ApplySettings call), and it stacks on rule defaults rather than the rule’s full per-file config.
If a scope’s rules: block names a rule that does
not exist or supplies settings the rule rejects, the
override surfaces as an MDS020 diagnostic at the
scope’s heading line.
# Content constraints
Five rules ship per-scope prose constraints. Each is
default-disabled and reuses the standard rules:
surface — there is no separate schema vocabulary for
“max words” or “forbidden text”.
| Rule | Setting | Effect |
|---|---|---|
| MDS036 max-section-length | max-words, min-words, max-paragraphs | Cap word counts and paragraph counts in addition to today’s line cap. |
| MDS055 forbidden-paragraph-starts | starts: [str, ...] | Flag paragraphs that begin with any listed prefix. |
| MDS056 forbidden-text | contains: [str, ...] | Flag paragraphs whose text contains any listed substring. |
| MDS057 required-text-patterns | patterns: [{pattern, message, skip-indices}] | Flag a section whose body does not match every configured regex. |
| MDS058 required-mentions | mentions: [str, ...] | Flag a section that does not contain every listed substring. |
Set them under top-level rules: for the whole
document, or under a scope’s rules: block for one
section — scoped to a Diagnosis section here:
kinds:
runbook:
schema:
sections:
- heading: "Diagnosis"
rules:
forbidden-text:
contains: ["should", "may"]
required-mentions:
mentions: ["forward reference"]MDS057 and MDS058 anchor diagnostics at the section’s heading line. MDS055 and MDS056 anchor at the offending paragraph’s line. In both cases the per-scope filter keeps only diagnostics inside the section’s range, so the same rule code works for document-wide and per-section enforcement.
skip-indices: on MDS057 parses but is inert until
section-content children: ships in a later plan.
# Cross-references, acronyms, and index
Three top-level schema blocks add document-wide checks and a JSON side-output:
kinds:
runbook:
schema:
cross-references:
- pattern: "\\bStep (\\d+)\\b"
must-match: "Step {n}"
skip-lines-matching: "^> "
acronyms:
known-safe: [API, HTTP, TLS, JSON]
scope: ["Check", "Expected"]
index:
output: ".runbook-index.json"
include: [step-map, cross-ref-graph, word-counts, headings]cross-references: checks that every match of
pattern: in the body resolves to a heading slug
after filling captures into must-match: ({n} for
the first capture, {1} / {2} for numbered groups,
or a named group). The skip-lines-matching: regex
exempts blockquoted or historical lines.
acronyms: flags first-use all-caps tokens
(length 2-6) that lack a parenthesised expansion.
known-safe: is the allowlist. scope: restricts
the check to sections whose heading text matches one
of the listed names; omitting scope: applies the
check document-wide.
index: asks mdsmith fix to write a JSON side-
output next to the source file describing requested
sub-objects (step-map, cross-ref-graph,
word-counts, headings); mdsmith check never
writes it. Output paths resolve relative to the source
file; absolute paths and .. traversal are rejected.
See
MDS020 required-structure
for the JSON shape per include entry.
# File-based YAML schemas
A named YAML schema moves an inline schema:
block into a .mdsmith/schemas/<name>.yaml file (or
a top-level schemas: registry entry), and a kind
references it by name. The body shares the inline
matcher engine, so every regex:, repeat:, nested
sections:, and content: shape carries across. The
gain is reuse: one schema drives several kinds.
# .mdsmith/schemas/rfc-v1.yaml — referenced by
# `schema: rfc-v1` on one or more kinds
filename: "RFC-[0-9][0-9][0-9][0-9].md"
sections:
- heading: "Overview"
- heading: "Decision"The schema: key is polymorphic: a scalar names a
registry entry, a mapping is an inline body. The
resolver substitutes a named reference for the
schema’s body before the kind validates, so a
schema: rfc-v1 kind matches a kind with the same
body inline. A named schema: is mutually exclusive
with the other sources, the same as an inline block.
The schema files reference covers the directory layout, the basename rule, the undeclared-name error, and the registry-vs-file collision in full.
#
File-based schemas (proto.md)
A proto.md schema is a Markdown file whose headings
describe required structure and whose front matter
holds CUE constraints. This form is best for larger
schemas, schemas that want to template a body, or
schemas reused across kinds via <?include?>.
---
id: '=~"^MDS[0-9]{3}$"'
name: 'string & != ""'
status: '"ready" | "not-ready"'
---
<?require
filename: "MDS*-*.md"
?>
# {id}: {name}
## Settings
## Examples
### Good
### BadThe # ? (or # {field}: {field} form) acts as the
title placeholder. ## ... rows mark wildcard slots.
Front-matter keys map directly to CUE expressions.
<?require?> declares the filename pattern.
MDS020’s file-schema check routes through its
legacy parser: {field} in a proto.md heading row
is a wildcard matching a non-empty run, not a
fmvar(...) substitution of the frontmatter value.
{field} in a proto.md body is fully wired:
MDS020 resolves the placeholder against the
document’s front matter and flags any mismatch.
mdsmith fix rewrites stale body lines to the
current front-matter value for files that match a
single file-based schema source. Composed or
multi-source schemas do not get Fix body rewrites.
A <?content?> directive row in a section body
declares one content entry, so a proto-based kind
validates and extracts body content like the
equivalent inline content: list. The
section-schema reference
documents the directive.
# Choosing a source
| Need | Inline | Named YAML | proto.md |
|---|---|---|---|
| Short schema with no templated body | yes | works | works |
| Same schema shared across kinds | no | yes | yes |
Schema reused via <?include?> | no | no | yes |
Frontmatter-body {field} sync | no | no | yes |
| Nested section tree | yes | yes | via heading levels |
| Section content entries | yes | yes | via <?content?> |
| Per-scope rule overrides | yes | yes | no |
| Stays next to other kind rule settings | yes | no | indirect |
A project can mix sources across kinds — some kinds
use inline schemas, others a named YAML schema or a
proto.md — but a single kind must pick one. A named
YAML schema carries the inline matcher engine, trading
the inline source’s adjacency for reuse across kinds.
#
Schema inheritance with extends
A kind can build on another kind’s schema by setting
extends: <parent-name> next to its schema: block.
Frontmatter constraints unify under CUE refinement: a
child that re-declares a parent key joins both with &,
so the effective constraint is the intersection. A
child’s sections: list wholly replaces the parent’s,
so heading templates compose by sequence rather than by
constraint. Filename and other document-wide blocks
inherit when the child does not set them.
A proto.md file schema declares the same relationship
via an extends: <path> key in its front matter; the
path is resolved relative to the schema file with the
same ..-traversal and absolute-path guards used by
<?include?>.
See Schema inheritance with extends
for the worked RFC example, conflict semantics, and the
mdsmith kinds show audit surface.
# Composition across kinds
A file resolved by multiple kinds that each declare a
required-structure schema gets the composition of all
of them — not just the last one. The merge layer
accumulates each kind’s schema: or inline-schema:
into a schema-sources list, and MDS020 loads every
source and composes them at check time.
The composition rules are:
- Frontmatter keys union across schemas. A key
required by any input is required. Two schemas
constraining the same key get the intersection of
their CUE expressions (joined with
&). - Sections merge by literal heading text. Scopes
that share the same heading combine their child
lists recursively. Scopes that differ — including
wildcard slots (
{unlisted: true}), the preamble (null), and the bare?wildcard — append in input order. closed:is OR-ed across inputs. Any scope that was strict in any input is strict in the composed scope.require.filenamepicks the first non-empty pattern. Conflicting patterns are a config error.
# Worked example: directive-rule-readme + rule-readme
The four directive READMEs in this repository
(MDS019-catalog, MDS021-include, MDS038-toc,
MDS039-build) resolve to both rule-readme and
directive-rule-readme. The first kind contributes
the common rule-README structure (Config,
Examples, Meta-Information); the second only adds
a required Pattern section.
kinds:
rule-readme:
rules:
required-structure:
schema: internal/rules/proto.md
directive-rule-readme:
rules:
required-structure:
schema: internal/rules/directive-proto.md
kind-assignment:
- glob: ["internal/rules/MDS*/README.md"]
kinds: [rule-readme]
- glob: ["internal/rules/MDS019-catalog/README.md", …]
kinds: [directive-rule-readme]directive-proto.md declares only what’s specific to
directive rules:
---
nature: '"directive"'
---
# {id}: {name}
## ...
## Pattern
### Without the directive
### With the directive
## ...The composed schema requires the union of both
sections lists. rule-readme’s nature is
"directive" | "generator" | "content" | "style" | "structure"; directive-rule-readme’s narrower
"directive" intersects to require exactly
"directive" on every file resolving to both kinds.
# Picking an input order
The composed section list concatenates each schema’s
sections (same-heading scopes merged). Order matters
for the “last required section”: when the later
schema’s required sections must precede the earlier
schema’s, reorder the kinds in kind-assignment or
rewrite the document to match composed order. The
directive READMEs put Pattern after
Meta-Information so the composed ordering matches
the document layout.
# Extracting data
A schema doubles as an extraction contract. Once
mdsmith check confirms a file conforms,
mdsmith extract <kind> --format json|yaml|msgpack <file>emits a data tree whose nesting mirrors the schema
hierarchy — no annotations required. Literal headings
key by slug, repeating sections become arrays, and
code-block / list / table / paragraph content
entries project their bodies. Set bind: <name> on a
scope or content entry to override the default key.
See mdsmith extract
for the full projection rules and exit codes.
# Diagnostics
Schema diagnostics surface through
MDS020 required-structure
.
The message text is the same regardless of source, so
this is the place to look up what missing required section, unexpected section, heading level mismatch, and out of order mean.
# See also
- Section schema reference — the entry-shape grammar in full.
- Schema files
— one
file per named schema under
.mdsmith/schemas/. - File kinds — how kinds attach schemas (and other rule config) to file groups.
- Enforcing document structure with schemas — the file-based reference.
- Placeholder grammar — opt-in tokens for template-friendly source.