mdsmith
Esc
    v0.52.0 GitHub

    Schemas

    Declare a document-structure schema inline on a kind or in a proto.md file, validate headings and front matter, and tighten rule config per section.

    A schema describes what a Markdown document’s front matter, filename, and heading tree must look like. Schemas are the engine behind MDS020 required-structure ; they are the canonical place to lock down the shape of a recurring document type (plan, RFC, runbook, rule README).

    mdsmith reads schemas from three sources:

    • Inline — a schema: block on a kind body in .mdsmith.yml. Uses the new matcher engine (regex:, repeat:, \#(digits), \#(fmvar(...))).
    • Named YAML — a schema: <name> reference to a .mdsmith/schemas/<name>.yaml file (or an inline schemas: registry entry of the same name). Same matcher engine as the inline source, shared across kinds. See File-based YAML schemas .
    • proto.md — a proto.md referenced by rules.required-structure.schema:. MDS020 still validates this source through its legacy parser today; the schema-package parser shipped with plan 156 lifts proto.md into the same matcher shape but is exercised only by tests until the cutover follow-up wires MDS020 through.

    A kind may use only one source; setting two is a config error.

    # Inline schemas on kinds

    Inline schemas keep the structure declaration next to the kind’s other rule settings. They are best for small schemas (one or two screens) that do not need templated body content.

    kinds:
      rfc:
        schema:
          filename: "RFC-[0-9][0-9][0-9][0-9].md"
          frontmatter:
            id: '=~"^RFC-[0-9]{4}$"'
            status: '"draft" | "ratified" | "deprecated"'
            authors: '[...string] & [_, ...string]'
          closed: true
          sections:
            - heading: null
            - heading: "Overview"
            - heading: "Decision"
            - heading:
                regex: '.+'
                repeat: { min: 0 }
            - heading: "References"

    The frontmatter: mapping reuses CUE expressions per key: regex, disjunction, list, and any other CUE form is accepted. Trailing ? on a key marks it optional. Named shortcuts — date, datetime, time, email, url, filename, nonEmpty — substitute for their canonical CUE so a schema can write created: date instead of repeating the ISO regex; see Schema field types for the registered names and how they are matched.

    filename: is a glob the document basename must match. It sits at the top of the schema block (no require: wrapper).

    closed: true makes the scope strict — unlisted headings produce a diagnostic. closed: false (the default) tolerates unlisted headings between listed sections. closed: is only meaningful when the schema declares sections:; setting it on a frontmatter-only kind is a parse error.

    # The heading: discriminator

    Every section-array entry sets heading:. The value takes one of three shapes:

    • null — the preamble: content from line 1 up to the first heading. Only valid as the first entry in a section list. Carries closed: / rules: / content: for that range; rejects sections:.
    • string — sugar for a literal match. The string is regex-escaped and used as the matcher’s pattern, so heading: "(WIP)" matches a heading whose text is exactly (WIP) with the parens taken literally. Cardinality is one.
    • mapping — the full form: { regex, repeat?, sequential? }. regex: is required; the body is a Go RE2 pattern that accepts two interpolation references — \#(digits) and \#(fmvar(name)). repeat: bounds the run; sequential: (with digits) asserts ordering. See the section-schema reference for the full grammar.

    # The matcher mapping

    sections:
      - heading:
          regex: 'Step \#(digits)'
          repeat: { min: 1, max: 5 }
          sequential: true
        sections: [...]
        content: [...]
      - heading:
          regex: '\#(fmvar(id)): \#(fmvar(name))'
      - heading:
          regex: '.+'
          repeat: { min: 0 }

    regex: is whole-string anchored against the heading’s rendered plain text (inline emphasis stripped, link wrappers unwrapped, code-span backticks dropped). Backslashes pass through to RE2; interpolation uses \#(expr). Two helpers are in scope:

    • digits — expands to the named capture (?P<n>[0-9]+). One per pattern. With sequential: true the validator asserts the captured numbers are strictly increasing without gaps.
    • fmvar(name) — looks up frontmatter field name, regex-escapes its value, and substitutes it.

    repeat: bounds how many consecutive matching headings the matcher claims. Omitting repeat: means exactly one; { min: 0 } is zero-or-more; { min: 0, max: 1 } is optional; { min: 1 } is one-or-more; bounded forms enforce both bounds. repeat: { max: 0 } and repeat: { min > max } each parse-error.

    The wildcard-slot shape — regex: '.+' with repeat: { min: 0 } — is positional: it absorbs zero or more unlisted sections at its slot. A heading whose text matches a later listed entry is claimed for that entry, not absorbed by the slot.

    # Nested sections

    Levels come from depth. Root sections: entries are H2; nested sections: lists are H3, H4, …. A runbook that wants Diagnosis → Step → Check / Expected expresses that as:

    sections:
      - heading:
          regex: 'Symptoms|Indicators'
      - heading: "Diagnosis"
        sections:
          - heading: "Step"
            sections:
              - heading: "Check"
              - heading: "Expected"
              - heading:
                  regex: 'If different'
                  repeat: { min: 0, max: 1 }
      - heading:
          regex: 'References'
          repeat: { min: 0, max: 1 }

    A scope that accepts alternate heading texts encodes the disjunction in its regex: regex: 'A|B' matches a heading whose text is A or B.

    # Section content

    sections: constrains nested headings. To pin down what AST nodes must appear inside a section’s body — a required YAML code block, a settings table with specific columns — add a content: list alongside the scope’s existing fields.

    sections:
      - heading: "Examples"
        closed: true
        content:
          - kind: code-block
            lang: yaml
          - kind: unlisted

    Each entry sets kind: and a small set of optional kind-specific fields:

    • kind: code-blocklang: constrains the fenced block’s info string (exact match).
    • kind: tablecolumns: is the exact header row the GFM table must carry.
    • kind: listordered: (true / false), min-items:, max-items: bound the list’s shape.
    • kind: paragraph — no extra keys.
    • kind: unlisted — a positional slot. Tolerates any non-matching nodes at that position even under closed: true.

    Entries match in declared order. A node that matches a later listed entry is claimed out-of-order with a diagnostic, the same rule the heading-tree walker uses. Sub-shape mismatches (wrong code-block language, wrong table columns, list violating ordered/min/max) emit their own diagnostics but still consume the slot. Missing required entries anchor at the section’s heading line.

    content: is rejected on a slot scope (the wildcard-slot shape has no fixed identity to constrain). Set content: only on entries that match named sections.

    # Per-scope rule overrides

    Any scope may carry a rules: block. The override sits on top of the rule’s defaults — keys it sets replace the defaults wholesale; keys it omits keep their default value. It applies only inside that scope’s heading range, so one section can be stricter than the rest of the document without glob overrides.

    sections:
      - heading: "Decision"
        rules:
          paragraph-readability:
            max-index: 12.0
          max-section-length:
            max-words: 200

    The override has two limitations: it is not a config-style deep merge (nested maps and list append modes behave like a plain ApplySettings call), and it stacks on rule defaults rather than the rule’s full per-file config.

    If a scope’s rules: block names a rule that does not exist or supplies settings the rule rejects, the override surfaces as an MDS020 diagnostic at the scope’s heading line.

    # Content constraints

    Five rules ship per-scope prose constraints. Each is default-disabled and reuses the standard rules: surface — there is no separate schema vocabulary for “max words” or “forbidden text”.

    RuleSettingEffect
    MDS036 max-section-lengthmax-words, min-words, max-paragraphsCap word counts and paragraph counts in addition to today’s line cap.
    MDS055 forbidden-paragraph-startsstarts: [str, ...]Flag paragraphs that begin with any listed prefix.
    MDS056 forbidden-textcontains: [str, ...]Flag paragraphs whose text contains any listed substring.
    MDS057 required-text-patternspatterns: [{pattern, message, skip-indices}]Flag a section whose body does not match every configured regex.
    MDS058 required-mentionsmentions: [str, ...]Flag a section that does not contain every listed substring.

    Set them under top-level rules: for the whole document, or under a scope’s rules: block for one section — scoped to a Diagnosis section here:

    kinds:
      runbook:
        schema:
          sections:
            - heading: "Diagnosis"
              rules:
                forbidden-text:
                  contains: ["should", "may"]
                required-mentions:
                  mentions: ["forward reference"]

    MDS057 and MDS058 anchor diagnostics at the section’s heading line. MDS055 and MDS056 anchor at the offending paragraph’s line. In both cases the per-scope filter keeps only diagnostics inside the section’s range, so the same rule code works for document-wide and per-section enforcement.

    skip-indices: on MDS057 parses but is inert until section-content children: ships in a later plan.

    # Cross-references, acronyms, and index

    Three top-level schema blocks add document-wide checks and a JSON side-output:

    kinds:
      runbook:
        schema:
          cross-references:
            - pattern: "\\bStep (\\d+)\\b"
              must-match: "Step {n}"
              skip-lines-matching: "^> "
          acronyms:
            known-safe: [API, HTTP, TLS, JSON]
            scope: ["Check", "Expected"]
          index:
            output: ".runbook-index.json"
            include: [step-map, cross-ref-graph, word-counts, headings]

    cross-references: checks that every match of pattern: in the body resolves to a heading slug after filling captures into must-match: ({n} for the first capture, {1} / {2} for numbered groups, or a named group). The skip-lines-matching: regex exempts blockquoted or historical lines.

    acronyms: flags first-use all-caps tokens (length 2-6) that lack a parenthesised expansion. known-safe: is the allowlist. scope: restricts the check to sections whose heading text matches one of the listed names; omitting scope: applies the check document-wide.

    index: asks mdsmith fix to write a JSON side- output next to the source file describing requested sub-objects (step-map, cross-ref-graph, word-counts, headings); mdsmith check never writes it. Output paths resolve relative to the source file; absolute paths and .. traversal are rejected. See MDS020 required-structure for the JSON shape per include entry.

    # File-based YAML schemas

    A named YAML schema moves an inline schema: block into a .mdsmith/schemas/<name>.yaml file (or a top-level schemas: registry entry), and a kind references it by name. The body shares the inline matcher engine, so every regex:, repeat:, nested sections:, and content: shape carries across. The gain is reuse: one schema drives several kinds.

    # .mdsmith/schemas/rfc-v1.yaml — referenced by
    # `schema: rfc-v1` on one or more kinds
    filename: "RFC-[0-9][0-9][0-9][0-9].md"
    sections:
      - heading: "Overview"
      - heading: "Decision"

    The schema: key is polymorphic: a scalar names a registry entry, a mapping is an inline body. The resolver substitutes a named reference for the schema’s body before the kind validates, so a schema: rfc-v1 kind matches a kind with the same body inline. A named schema: is mutually exclusive with the other sources, the same as an inline block.

    The schema files reference covers the directory layout, the basename rule, the undeclared-name error, and the registry-vs-file collision in full.

    # File-based schemas (proto.md)

    A proto.md schema is a Markdown file whose headings describe required structure and whose front matter holds CUE constraints. This form is best for larger schemas, schemas that want to template a body, or schemas reused across kinds via <?include?>.

    ---
    id: '=~"^MDS[0-9]{3}$"'
    name: 'string & != ""'
    status: '"ready" | "not-ready"'
    ---
    <?require
    filename: "MDS*-*.md"
    ?>
    
    # {id}: {name}
    
    ## Settings
    
    ## Examples
    
    ### Good
    
    ### Bad

    The # ? (or # {field}: {field} form) acts as the title placeholder. ## ... rows mark wildcard slots. Front-matter keys map directly to CUE expressions. <?require?> declares the filename pattern.

    MDS020’s file-schema check routes through its legacy parser: {field} in a proto.md heading row is a wildcard matching a non-empty run, not a fmvar(...) substitution of the frontmatter value.

    {field} in a proto.md body is fully wired: MDS020 resolves the placeholder against the document’s front matter and flags any mismatch. mdsmith fix rewrites stale body lines to the current front-matter value for files that match a single file-based schema source. Composed or multi-source schemas do not get Fix body rewrites.

    A <?content?> directive row in a section body declares one content entry, so a proto-based kind validates and extracts body content like the equivalent inline content: list. The section-schema reference documents the directive.

    # Choosing a source

    NeedInlineNamed YAMLproto.md
    Short schema with no templated bodyyesworksworks
    Same schema shared across kindsnoyesyes
    Schema reused via <?include?>nonoyes
    Frontmatter-body {field} syncnonoyes
    Nested section treeyesyesvia heading levels
    Section content entriesyesyesvia <?content?>
    Per-scope rule overridesyesyesno
    Stays next to other kind rule settingsyesnoindirect

    A project can mix sources across kinds — some kinds use inline schemas, others a named YAML schema or a proto.md — but a single kind must pick one. A named YAML schema carries the inline matcher engine, trading the inline source’s adjacency for reuse across kinds.

    # Schema inheritance with extends

    A kind can build on another kind’s schema by setting extends: <parent-name> next to its schema: block. Frontmatter constraints unify under CUE refinement: a child that re-declares a parent key joins both with &, so the effective constraint is the intersection. A child’s sections: list wholly replaces the parent’s, so heading templates compose by sequence rather than by constraint. Filename and other document-wide blocks inherit when the child does not set them.

    A proto.md file schema declares the same relationship via an extends: <path> key in its front matter; the path is resolved relative to the schema file with the same ..-traversal and absolute-path guards used by <?include?>.

    See Schema inheritance with extends for the worked RFC example, conflict semantics, and the mdsmith kinds show audit surface.

    # Composition across kinds

    A file resolved by multiple kinds that each declare a required-structure schema gets the composition of all of them — not just the last one. The merge layer accumulates each kind’s schema: or inline-schema: into a schema-sources list, and MDS020 loads every source and composes them at check time.

    The composition rules are:

    • Frontmatter keys union across schemas. A key required by any input is required. Two schemas constraining the same key get the intersection of their CUE expressions (joined with &).
    • Sections merge by literal heading text. Scopes that share the same heading combine their child lists recursively. Scopes that differ — including wildcard slots ({unlisted: true}), the preamble (null), and the bare ? wildcard — append in input order.
    • closed: is OR-ed across inputs. Any scope that was strict in any input is strict in the composed scope.
    • require.filename picks the first non-empty pattern. Conflicting patterns are a config error.

    # Worked example: directive-rule-readme + rule-readme

    The four directive READMEs in this repository (MDS019-catalog, MDS021-include, MDS038-toc, MDS039-build) resolve to both rule-readme and directive-rule-readme. The first kind contributes the common rule-README structure (Config, Examples, Meta-Information); the second only adds a required Pattern section.

    kinds:
      rule-readme:
        rules:
          required-structure:
            schema: internal/rules/proto.md
      directive-rule-readme:
        rules:
          required-structure:
            schema: internal/rules/directive-proto.md
    
    kind-assignment:
      - glob: ["internal/rules/MDS*/README.md"]
        kinds: [rule-readme]
      - glob: ["internal/rules/MDS019-catalog/README.md", …]
        kinds: [directive-rule-readme]

    directive-proto.md declares only what’s specific to directive rules:

    ---
    nature: '"directive"'
    ---
    # {id}: {name}
    
    ## ...
    
    ## Pattern
    
    ### Without the directive
    ### With the directive
    
    ## ...

    The composed schema requires the union of both sections lists. rule-readme’s nature is "directive" | "generator" | "content" | "style" | "structure"; directive-rule-readme’s narrower "directive" intersects to require exactly "directive" on every file resolving to both kinds.

    # Picking an input order

    The composed section list concatenates each schema’s sections (same-heading scopes merged). Order matters for the “last required section”: when the later schema’s required sections must precede the earlier schema’s, reorder the kinds in kind-assignment or rewrite the document to match composed order. The directive READMEs put Pattern after Meta-Information so the composed ordering matches the document layout.

    # Extracting data

    A schema doubles as an extraction contract. Once mdsmith check confirms a file conforms,

    mdsmith extract <kind> --format json|yaml|msgpack <file>

    emits a data tree whose nesting mirrors the schema hierarchy — no annotations required. Literal headings key by slug, repeating sections become arrays, and code-block / list / table / paragraph content entries project their bodies. Set bind: <name> on a scope or content entry to override the default key. See mdsmith extract for the full projection rules and exit codes.

    # Diagnostics

    Schema diagnostics surface through MDS020 required-structure . The message text is the same regardless of source, so this is the place to look up what missing required section, unexpected section, heading level mismatch, and out of order mean.

    # See also