Guide · 7 min read

Markdown chunking for LLMs - a practical guide

Published 2026-06-09 · By the DocDigest team

You converted your PDFs to Markdown. Now what? Whether you're building a RAG pipeline, generating embeddings, or pasting context into Claude or ChatGPT, how you split that Markdown matters, often more than the model you pick.

Why chunk Markdown at all?

Three reasons: context windows (even Gemini's 1M token window costs money per token), retrieval quality (embeddings work best on focused, single-topic passages), and attention dilution (long contexts measurably reduce model precision on the "needle-in-haystack" middle).

Strategy 1: Fixed-size chunking

Split every N tokens (commonly 512 or 1024) with a 10-20% overlap. Simple, fast, framework-agnostic. The downside: it slices through headings, code blocks, and tables - discarding the structure that made Markdown useful.

Strategy 2: Heading-based chunking

Split at ## boundaries (or ### for finer granularity). Each chunk becomes a semantically coherent section. Pair this with a max-token cap so a runaway section gets sub-split. This is the default we recommend in DocDigest.

Strategy 3: Recursive structural split

LangChain's RecursiveCharacterTextSplitter with Markdown separators tries headings first, then paragraphs, then sentences, then characters. Robust default for mixed-quality inputs.

Pitfalls to avoid

Splitting tables mid-row. Markdown tables become garbage if cut between rows. Always treat tables as atomic blocks.
Splitting code fences. Half a code block is worse than no code block. Detect ``` fences and keep them whole.
Losing the source. Every chunk should carry its source file name and heading path so the LLM (and your debugger) knows where it came from.
Counting characters, not tokens. 1 char is not 1 token. Use the actual tokenizer for your target model (tiktoken for GPT, anthropic-tokenizer for Claude).

A reasonable default

For most RAG workloads: heading-based chunks capped at 800 tokens with 100-token overlap, tables and code blocks preserved atomically, each chunk prefixed with its source path. This is the approach DocDigest is built for: it emits Markdown and JSON digests with source headers that you can chunk with the strategy above.

chunk_markdown.py

import re
import tiktoken
 
enc = tiktoken.encoding_for_model("gpt-4o")
 
def chunk_markdown(md, max_tokens=800, overlap=100):
    # Split at H2 boundaries so each chunk is one coherent section.
    sections = re.split(r"(?m)^(?=## )", md)
    chunks = []
    for sec in sections:
        sec = sec.strip()
        if not sec:
            continue
        toks = enc.encode(sec)
        if len(toks) <= max_tokens:
            chunks.append(sec)
            continue
        # Long section: window it with overlap.
        for i in range(0, len(toks), max_tokens - overlap):
            chunks.append(enc.decode(toks[i : i + max_tokens]))
    return chunks

Skip the chunking code

DocDigest converts PDFs, DOCX, and folders into a single Markdown digest with token counts and source headers - ready to chunk.

Try PDF → Markdown See pricing