Guide · 7 min read

Markdown chunking for LLMs — a practical guide

Published 2026-06-09 · By the DocDigest team

You converted your PDFs to Markdown. Now what? Whether you're building a RAG pipeline, generating embeddings, or pasting context into Claude or GPT-4o, how you split that Markdown matters — often more than the model you pick.

Why chunk Markdown at all?

Three reasons: context windows (even Gemini's 1M token window costs money per token), retrieval quality (embeddings work best on focused, single-topic passages), and attention dilution (long contexts measurably reduce model precision on the "needle-in-haystack" middle).

Strategy 1: Fixed-size chunking

Split every N tokens (commonly 512 or 1024) with a 10–20% overlap. Simple, fast, framework-agnostic. The downside: it slices through headings, code blocks, and tables — destroying exactly the structure that made Markdown useful in the first place.

Strategy 2: Heading-based chunking

Split at ## boundaries (or ### for finer granularity). Each chunk becomes a semantically coherent section. Pair this with a max-token cap so a runaway section gets sub-split. This is the default we recommend in DocDigest.

Strategy 3: Recursive structural split

LangChain's RecursiveCharacterTextSplitter with Markdown separators tries headings first, then paragraphs, then sentences, then characters. Robust default for mixed-quality inputs.

Pitfalls to avoid

  • Splitting tables mid-row. Markdown tables become garbage if cut between rows. Always treat tables as atomic blocks.
  • Splitting code fences. Half a code block is worse than no code block. Detect ``` fences and keep them whole.
  • Losing the source. Every chunk should carry its source file name and heading path so the LLM (and your debugger) knows where it came from.
  • Counting characters, not tokens. 1 char ≠ 1 token. Use the actual tokenizer for your target model (tiktoken for GPT, anthropic-tokenizer for Claude).

A reasonable default

For most RAG workloads: heading-based chunks capped at 800 tokens with 100-token overlap, tables and code blocks preserved atomically, each chunk prefixed with its source path. This is what DocDigest emits when you enable RAG export on a digest.

Skip the chunking code

DocDigest converts PDFs, DOCX, and folders into a single Markdown digest with token counts, source headers, and optional RAG chunks — ready to embed.