Guide · 9 min read
RAG document processing — from PDF to embeddings
Published 2026-06-09 · By the DocDigest team
Retrieval-augmented generation lives or dies on document quality. Pick a great LLM and a great vector DB, then feed them messy PDF text, and your answers will be confidently wrong. Here's the pipeline we recommend for production RAG, learned from shipping DocDigest.
1. Parse with layout awareness
Don't use pdf2txt or PyPDF for anything you care about. They strip layout, mangle tables, and lose reading order on multi-column documents. Use a layout-aware parser like Docling, Unstructured, or Marker. The output should preserve headings, lists, tables, and code blocks — ideally as Markdown.
2. Clean and normalize
- Strip repeated headers and footers (page numbers, "CONFIDENTIAL" stamps).
- De-hyphenate words broken across line ends.
- Collapse multiple blank lines.
- Normalize whitespace in tables so embedding models see consistent input.
3. Chunk semantically, not arbitrarily
Heading-based chunks (split on ##) capped at ~800 tokens beats fixed-size splitting in almost every retrieval benchmark. Keep tables and code blocks atomic — never cut them mid-block. We wrote a deeper breakdown in Markdown chunking for LLMs.
4. Enrich each chunk with metadata
Every chunk should carry: source filename, heading path (Item 7 > Risk Factors), page number, and a stable chunk ID. This metadata enables filtering, citation, and debugging — and you'll thank yourself the first time a user asks "where did this answer come from?"
5. Embed with a model that matches your domain
For general English, OpenAI's text-embedding-3-large or Voyage's voyage-3 are strong defaults. For code, use a code-specific embedder. For multilingual, check cohere-embed-multilingual-v3. Always benchmark on your own queries — published MTEB scores don't predict your domain.
6. Retrieve with hybrid search
Dense vector search alone misses exact-match queries (product codes, function names, error strings). Combine dense embeddings with BM25 keyword search and a re-ranker like Cohere Rerank or bge-reranker. The lift is usually 10–20% on real-world benchmarks.
7. Evaluate, don't vibe-check
Build a small (50–100 question) eval set with known correct answers from your corpus. Run it after every pipeline change. RAGAS and TruLens automate the metrics. Without an eval, you cannot tell whether your "improvement" actually improved anything.
DocDigest handles steps 1–3 for you
Upload PDFs, DOCX, or folders. Get a Docling-parsed, cleaned, heading-chunked Markdown digest with source metadata — ready to embed.