Extract Markdown as data
When a Markdown file’s payload is prose, put it in the body under H2 sections — not in YAML frontmatter. mdsmith extract projects body structure into a JSON tree the same way it projects frontmatter, so the file stays editable as Markdown.
mdsmith extract projects a schema-conformant
Markdown file into a JSON / YAML / msgpack tree. Two
parts of the file feed the tree:
- Frontmatter — decoded YAML, written under a
frontmatterkey. - Body sections — H1 / H2 / H3 headings and the content under them, projected as siblings.
This guide is about when to use which. Two modes share
the work: declared schema entries constrain and
rename the slice you name, while
projection: blocks
captures everything else — a section’s whole body,
or, at the schema root, the whole document.
# The principle
Frontmatter is for the file’s metadata: title,
kind, status, dates, cross-references, the
fields a non-prose tool (a workflow, a release
script, a status badge) reads alongside the prose.
Body sections are for the file’s payload: the prose, paragraphs, lists, and code blocks the file exists to hold. If the value is a sentence or two of copy, it belongs under a heading — not in a folded YAML scalar.
The trap is to reach for frontmatter for everything
because it’s structured. A 60-character tagline in
frontmatter and the same 60 characters in a
## Tagline body section project the same string;
only the key moves, from frontmatter.tagline to
tagline.text. The body version is shorter to edit,
diffs cleanly when wrapped, and is lintable as
Markdown.
# Worked example
A product-copy file at docs/copy/product.md with a
tagline, a lead, and one per-surface description.
# Frontmatter-heavy (the trap)
---
title: Product copy
tagline: >-
Mark down your ideas; smith them into shipping
docs.
lead: >-
A lint-and-fix tool that keeps your Markdown
consistent across every surface — READMEs, docs
site, editor extensions.
vscode-description: >-
Inline diagnostics, fix-on-save, and instant
navigation for Markdown in VS Code.
---
# Product copy
This file holds the tagline, lead, and VS Code
description. Edit a field above and re-run the
sync.Three folded scalars (>-); the body is bookkeeping;
line breaks inside the scalars are cosmetic
(folded-strip collapses them to spaces); a leading
punctuation character in any value would force
double-quotes.
# Body-structured (the principle)
---
title: Product copy
---
# Product copy
## Tagline
Mark down your ideas; smith them into shipping
docs.
## Lead
A lint-and-fix tool that keeps your Markdown
consistent across every surface — READMEs, docs
site, editor extensions.
## VS Code
Inline diagnostics, fix-on-save, and instant
navigation for Markdown in VS Code.With a matching schema and kind assignment in
.mdsmith.yml:
kinds:
product-copy:
schema:
sections:
- heading: { regex: '^Tagline$' }
content:
- { kind: paragraph }
- heading: { regex: '^Lead$' }
content:
- { kind: paragraph }
- heading: { regex: '^VS Code$' }
bind: vscode-description
content:
- { kind: paragraph }
kind-assignment:
- glob: ["docs/copy/product.md"]
kinds: [product-copy]Each content: entry declares the paragraph its
section projects. A section without one projects as
an empty object — the schema, not the body, decides
what extract emits.
mdsmith extract product-copy --format json docs/copy/product.md
emits:
{
"frontmatter": { "title": "Product copy" },
"title": "Product copy",
"lead": { "text": "A lint-and-fix tool that keeps your Markdown consistent across every surface — READMEs, docs site, editor extensions." },
"tagline": { "text": "Mark down your ideas; smith them into shipping docs." },
"vscode-description": { "text": "Inline diagnostics, fix-on-save, and instant navigation for Markdown in VS Code." }
}The H1 # Product copy projects as the top-level
title string. Keys come out sorted, not in document
order. The consumer reads the same strings the
frontmatter version held, and the body version is the
editable artifact.
# Projecting inline structure
A paragraph projects as plain text by default. When
the consumer needs the structure inside the
paragraph — which fragment is emphasised, which token
is code, which span is a link — set
projection: inline on the content entry. The
paragraph then projects under an inline key as a
typed, recursive span list instead of a flat string.
The canonical case is a website headline whose hero template renders one emphasised word from the data:
---
title: Product copy
---
# Product copy
## Headline
Mark*down*, smithed.kinds:
product-copy:
schema:
sections:
- heading: { regex: '^Headline$' }
content:
- { kind: paragraph, projection: inline, required: true }mdsmith extract product-copy --format json docs/copy/product.md
emits the headline as a span list: text, then the
level-1 emphasis span with its own children, then
the trailing text:
{
"frontmatter": { "title": "Product copy" },
"headline": {
"inline": [
{ "span": "text", "value": "Mark" },
{
"span": "emphasis", "level": 1,
"children": [{ "span": "text", "value": "down" }]
},
{ "span": "text", "value": ", smithed." }
]
}
}Nesting composes through the same shape: a paragraph
run **`mdsmith fix`** daily projects the strong
span with the code span nested in its children —
the consumer walks one uniform tree, with no
flat-versus-recursive mode switch.
Leaf spans (text, code, autolink) carry a value;
container spans (emphasis, strong, link) carry
children. A wrapped line emits a break span between
the surrounding text spans (hard: true for a backslash
or double-space break). An image, inline raw HTML, or
any node outside that set is a hard error — the same
exit code as a non-conformant file. The full mapping
table is in the extract reference
.
# Projecting list structure
A list projects as an array of own-text strings by
default. That flat shape loses nesting and strips a task
checkbox to a literal [x] prefix. When the consumer
needs the structure — which items are checked, which
nest children — set projection: tree on the list
entry. Each item then projects as an object with its own
text, a checked bool on task items, and a recursive
children array on items that nest a sub-list.
The canonical case is a sprint checklist whose
status tool reads checked and walks children:
---
title: Sprint tasks
---
# Sprint tasks
## Tasks
- [x] done item
- [ ] open item with **bold**
- nested child
- plain itemkinds:
checklist:
schema:
sections:
- heading: { regex: '^Tasks$' }
content:
- { kind: list, projection: tree }
kind-assignment:
- glob: ["tasks.md"]
kinds: [checklist]mdsmith extract checklist --format json tasks.md emits:
{
"frontmatter": { "title": "Sprint tasks" },
"tasks": {
"items": [
{ "checked": true, "text": "done item" },
{
"checked": false, "text": "open item with bold",
"children": [{ "text": "nested child" }]
},
{ "text": "plain item" }
]
}
}The [x] / [ ] marker becomes the checked bool and
never leaks into text; **bold** flattens to its
text; nested child rides inside its parent’s
children, not concatenated into the parent string. A
plain item is just {text}. Array order is item order;
YAML and msgpack emit the same tree.
# Projecting a table as positional rows
A kind: table content entry projects as rows (an
array of record objects) by default. Set
projection: rows when the consumer needs column order
preserved, tolerates duplicate headers, or works with
positional data (a chart script, a CSV writer).
The rows projection injects two sibling keys into the
section object — columns (the header array) and rows
(string arrays, one per body row). Short rows are padded
with empty strings to the header width.
A benchmark table in a performance section:
---
title: Benchmark results
---
# Benchmark results
## Latency
| Operation | p50 ms | p99 ms |
| --------- | ------ | ------ |
| check | 12 | 45 |
| fix | 18 | 70 |kinds:
benchmark:
schema:
sections:
- heading: { regex: '^Latency$' }
content:
- { kind: table, projection: rows }
kind-assignment:
- glob: ["docs/benchmarks.md"]
kinds: [benchmark]mdsmith extract benchmark --format json docs/benchmarks.md
emits:
{
"frontmatter": { "title": "Benchmark results" },
"latency": {
"columns": ["Operation", "p50 ms", "p99 ms"],
"rows": [["check", "12", "45"], ["fix", "18", "70"]]
}
}A chart script reads latency.rows[0][1] by index.
The default records projection emits the same
table as an array of objects keyed by column header.
The full projection matrix and duplicate-header
semantics are in the
extract reference
.
# Projecting a whole section body
Everything above names a slice and constrains it. The
opposite need is to capture a section’s whole body
without listing each node. Set projection: blocks on
the scope — or once at the schema root, as the default
for every section. The body projects as a typed,
recursive blocks list in document order: paragraphs,
code, lists, tables, quotes, and deeper headings (as
nested section blocks).
A schema-level switch projects a whole document in one
line. Declared sections still project as keyed objects,
and gain a blocks list. The sections the walker would
skip — wildcard and unlisted headings — now project
too. Each lands under its slug, its heading text in a
heading field:
---
title: Release notes
---
# Release notes
## Summary
Ships **block projection**.
## Details
Two new switches.kinds:
notebook:
schema:
projection: blocks
sections:
- heading: { regex: '^Summary$' }
kind-assignment:
- glob: ["notes.md"]
kinds: [notebook]mdsmith extract notebook --format json notes.md
emits the H1 under title, the declared summary, and
the unlisted details (with its heading text):
{
"frontmatter": { "title": "Release notes" },
"title": "Release notes",
"summary": {
"blocks": [{ "block": "paragraph", "text": "Ships block projection." }]
},
"details": {
"heading": "Details",
"blocks": [{ "block": "paragraph", "text": "Two new switches." }]
}
}A paragraph block defaults to flat text. Add
block-paragraphs: inline beside projection: blocks
to project each paragraph’s span list under inline
instead — the same span shape as above, lenient about
images. The full grammar and its CUE contract are in
the extract reference
.
# When frontmatter is the right call
- Short scalars where YAML’s typing earns its
keep: booleans (
draft: true), dates (published: 2026-05-24), enums (status: "✅"), numbers. - Metadata other tools read:
title,kind,weight,tags— anything Hugo’s frontmatter, a release script, or a status dashboard consumes directly. - Fields that participate in
<?catalog?>directives: catalog templating reads frontmatter keys ({title},{summary}). - Strict, machine-controlled values: a generated version stamp, a hash, a per-file identifier — values an automated tool writes and a human should not edit by hand.
Prose paragraphs, multi-line copy, anything wider than one line, and anything that benefits from Markdown formatting (code, emphasis, links) all belong in the body.
#
Frontmatter title and the H1
The worked example carries the same string twice:
title: Product copy in frontmatter and
# Product copy as the H1. Nothing checks the two
against each other by default, so they can drift
apart edit by edit.
The test from the previous section decides it. When
no catalog row, site template, or release script
reads frontmatter.title, delete the field; the H1
alone is the title. When a tool does read the
field, keep it and let MDS020 enforce the match.
# H1 title in the projection
When the schema roots at H2 (all inline schemas do),
mdsmith extract emits the document H1’s plain text
under a reserved title key beside frontmatter —
the "title": "Product copy" line in the
worked example
output above.
When there is no H1 the key is omitted, and a scope
that resolves to title (via slug or bind:) is
reported as a collision before any data is emitted;
rename the scope with bind: to resolve it.
An <?include?> with extract: title splices the
H1 plain text directly into a host file, exactly
like the tagline.text embed shown in
Reading a value back
.
# Enforcing H1 ↔ frontmatter consistency
Enforcement requires a file-based schema. An inline
schema: starts matching at H2 — the H1 belongs to
first-line-heading
— so the kind switches
to a proto.md rooted at the {title} placeholder.
A <?content?> directive row in a section body
declares the same content entry the inline
content: list declares, so the worked example’s
sections keep their text projections instead of
collapsing to empty objects:
# {title}
## Tagline
<?content
kind: paragraph
?>
## Lead
<?content
kind: paragraph
?>kinds:
product-copy:
rules:
required-structure:
schema: copy-proto.mdThe {title} row requires the frontmatter field
and checks the H1 text against its value. A drifted
H1 fails mdsmith check:
docs/copy/product.md:4:1 MDS020 heading does not match frontmatter: expected "Product copy" (from title), got "Product page copy"The synced H1 also becomes data: mdsmith extract
projects the H1 scope under a title object, the
captured heading text and each section’s paragraph
nested inside it.
One limit remains. Every schema source on a file must
declare the same root level, so an H1-rooted proto.md
cannot compose with an H2-rooted inline schema. The H1
sync and the body extraction both live on the
proto.md. And mdsmith cannot project the H1 text
without a frontmatter field behind it — a {title} row
with no title field matches any heading, and extract
skips wildcard scopes — so keep the title field when
the kind syncs the H1.
#
bind: patterns
bind: renames the JSON key that a heading or
content entry projects under. Use it when the
human-readable heading and the consumer-friendly key
don’t match.
- Heading-to-key rename.
## VS Codeslugs tovs-code; setbind: vscode-descriptionso the consumer reads the field name its code expects. - Collapse a wrapper.
bind: ""on a parent scope hoists its children into the grandparent — for a heading that should not nest in the data tree. - Repeating sections. A
repeat: {min, max}section with a placeholder heading projects as an array;bind:renames the array key.
See the section-schema reference for the full grammar.
# Reading a value back into Markdown
mdsmith extract writes the projection out for a
release script, a Hugo data file, or any non-Markdown
consumer. The read-side counterpart lives on the
<?include?> directive: its extract: parameter walks
the same JSON tree and splices one leaf into the host
file’s Markdown body.
Re-using the product-copy example, a README embed reads the tagline directly:
<?include
file: docs/copy/product.md
extract: tagline.text
?>
Mark down your ideas; smith them into shipping docs.
<?/include?>The directive runs the same projection rules, walks the dotted path, and splices the leaf — no intermediate “fragment” file to keep in sync. The supported paths, content-key shortcuts, and lint-time errors are in generating-content.md .
# See also
mdsmith extract— the CLI reference, including default projection rules per content entry type (code →code, list →items, table →rows, paragraph →textorinline).- Schemas guide — declaring the kind schema that doubles as the extraction contract.
- Generating Content with Directives
—
the
<?include ... extract:?>read-side documentation.