mdsmith
Esc
    v0.52.0 GitHub
    MDS037 prose ready

    MDS037: duplicated-content

    Paragraphs should not repeat verbatim across Markdown files.

    # Settings

    SettingTypeDefaultDescription
    includelist[]glob patterns limiting which siblings to scan
    excludelist[]glob patterns of siblings to skip
    min-charsint200minimum normalized paragraph length to compare

    Before comparing, the rule normalizes each paragraph. Whitespace collapses to single spaces. Letters become lowercase. Leading and trailing space is trimmed. A paragraph shorter than min-chars runes is skipped; short stubs would otherwise produce noise.

    The rule walks RootFS when the project root is known. Otherwise it falls back to the file’s own directory. An include list narrows the scan to matching paths. An exclude entry takes precedence.

    # Generated sections

    Paragraphs inside <?include?> and <?catalog?> directive bodies are skipped automatically. This applies to the file being checked and to every corpus file scanned for matches.

    These paragraphs are copies of content owned by another file. Flagging them would produce false positives on any project that uses generated sections. The same skip applies during corpus indexing: a host file’s generated body is not added to the index. That prevents the original source file from matching its own text in a host’s generated copy.

    # Performance

    Each checked file reads every other Markdown file in scope (.md and .markdown). A project with N Markdown files performs O(N²) reads. Small and medium corpora stay fast. For large corpora add an exclude entry for generated or vendored directories.

    # Config

    rules:
      duplicated-content:
        include:
          - "docs/**"
        exclude:
          - "docs/generated/**"
        min-chars: 200

    Disable:

    rules:
      duplicated-content: false

    # Examples

    # Good

    # Simple Fixture
    
    One short fixture sits alone in its folder and exists to exercise
    the duplicate detector. Every other rule stays quiet because the
    text is simple and brief. The paragraph holds enough characters to
    pass two hundred runes after normalization. Each sentence is plain
    and ends early. No other file here repeats this wording.

    # Bad – duplicated paragraph

    # Duplicate Fixture
    
    A distinctive paragraph appears in this file and in a sibling
    fixture, so MDS037 must flag the match and point at the other
    location. The wording stays above the default two-hundred character
    threshold after normalization. It stays unique relative to the
    other rule fixtures so nothing matches by accident across the test
    suite.

    # Bad – duplicated source

    # Source Fixture
    
    A distinctive paragraph appears in this file and in a sibling
    fixture, so MDS037 must flag the match and point at the other
    location. The wording stays above the default two-hundred character
    threshold after normalization. It stays unique relative to the
    other rule fixtures so nothing matches by accident across the test
    suite.

    # Diagnostics

    ConditionMessage
    paragraph repeatsparagraph duplicated in {other}:{line}
    invalid globduplicated-content: {include,exclude}: invalid glob pattern “{pat}”: …

    # Meta-Information

    • ID: MDS037
    • Name: duplicated-content
    • Status: ready
    • Default: disabled (opt-in via .mdsmith.yml); include: [], exclude: [], min-chars: 200
    • Fixable: no
    • Implementation: source
    • Category: prose