MDS037: duplicated-content
Paragraphs should not repeat verbatim across Markdown files.
# Settings
| Setting | Type | Default | Description |
|---|---|---|---|
include | list | [] | glob patterns limiting which siblings to scan |
exclude | list | [] | glob patterns of siblings to skip |
min-chars | int | 200 | minimum normalized paragraph length to compare |
Before comparing, the rule normalizes each paragraph. Whitespace
collapses to single spaces. Letters become lowercase. Leading and
trailing space is trimmed. A paragraph shorter than min-chars runes
is skipped; short stubs would otherwise produce noise.
The rule walks RootFS when the project root is known. Otherwise it
falls back to the file’s own directory. An include list narrows the
scan to matching paths. An exclude entry takes precedence.
# Generated sections
Paragraphs inside <?include?> and <?catalog?> directive bodies are
skipped automatically. This applies to the file being checked and to
every corpus file scanned for matches.
These paragraphs are copies of content owned by another file. Flagging them would produce false positives on any project that uses generated sections. The same skip applies during corpus indexing: a host file’s generated body is not added to the index. That prevents the original source file from matching its own text in a host’s generated copy.
# Performance
Each checked file reads every other Markdown file in scope
(.md and .markdown). A project
with N Markdown files performs O(N²) reads. Small and medium
corpora stay fast. For large corpora add an exclude entry for
generated or vendored directories.
# Config
rules:
duplicated-content:
include:
- "docs/**"
exclude:
- "docs/generated/**"
min-chars: 200Disable:
rules:
duplicated-content: false# Examples
# Good
# Simple Fixture
One short fixture sits alone in its folder and exists to exercise
the duplicate detector. Every other rule stays quiet because the
text is simple and brief. The paragraph holds enough characters to
pass two hundred runes after normalization. Each sentence is plain
and ends early. No other file here repeats this wording.# Bad – duplicated paragraph
# Duplicate Fixture
A distinctive paragraph appears in this file and in a sibling
fixture, so MDS037 must flag the match and point at the other
location. The wording stays above the default two-hundred character
threshold after normalization. It stays unique relative to the
other rule fixtures so nothing matches by accident across the test
suite.# Bad – duplicated source
# Source Fixture
A distinctive paragraph appears in this file and in a sibling
fixture, so MDS037 must flag the match and point at the other
location. The wording stays above the default two-hundred character
threshold after normalization. It stays unique relative to the
other rule fixtures so nothing matches by accident across the test
suite.# Diagnostics
| Condition | Message |
|---|---|
| paragraph repeats | paragraph duplicated in {other}:{line} |
| invalid glob | duplicated-content: {include,exclude}: invalid glob pattern “{pat}”: … |
# Meta-Information
- ID: MDS037
- Name:
duplicated-content - Status: ready
- Default: disabled (opt-in via
.mdsmith.yml); include: [], exclude: [], min-chars: 200 - Fixable: no
- Implementation: source
- Category: prose