Guide · 7 min read
Markdown chunking for LLMs — a practical guide
Published 2026-06-09 · By the DocDigest team
You converted your PDFs to Markdown. Now what? Whether you're building a RAG pipeline, generating embeddings, or pasting context into Claude or GPT-4o, how you split that Markdown matters — often more than the model you pick.
Why chunk Markdown at all?
Three reasons: context windows (even Gemini's 1M token window costs money per token), retrieval quality (embeddings work best on focused, single-topic passages), and attention dilution (long contexts measurably reduce model precision on the "needle-in-haystack" middle).
Strategy 1: Fixed-size chunking
Split every N tokens (commonly 512 or 1024) with a 10–20% overlap. Simple, fast, framework-agnostic. The downside: it slices through headings, code blocks, and tables — destroying exactly the structure that made Markdown useful in the first place.
Strategy 2: Heading-based chunking
Split at ## boundaries (or ### for finer granularity). Each chunk becomes a semantically coherent section. Pair this with a max-token cap so a runaway section gets sub-split. This is the default we recommend in DocDigest.
Strategy 3: Recursive structural split
LangChain's RecursiveCharacterTextSplitter with Markdown separators tries headings first, then paragraphs, then sentences, then characters. Robust default for mixed-quality inputs.
Pitfalls to avoid
- Splitting tables mid-row. Markdown tables become garbage if cut between rows. Always treat tables as atomic blocks.
- Splitting code fences. Half a code block is worse than no code block. Detect
```fences and keep them whole. - Losing the source. Every chunk should carry its source file name and heading path so the LLM (and your debugger) knows where it came from.
- Counting characters, not tokens. 1 char ≠ 1 token. Use the actual tokenizer for your target model (tiktoken for GPT, anthropic-tokenizer for Claude).
A reasonable default
For most RAG workloads: heading-based chunks capped at 800 tokens with 100-token overlap, tables and code blocks preserved atomically, each chunk prefixed with its source path. This is what DocDigest emits when you enable RAG export on a digest.
Skip the chunking code
DocDigest converts PDFs, DOCX, and folders into a single Markdown digest with token counts, source headers, and optional RAG chunks — ready to embed.