Guide · 9 min read

RAG document processing - from PDF to embeddings

Published 2026-06-09 · By the DocDigest team

Retrieval-augmented generation lives or dies on document quality. Pick a great LLM and a great vector DB, then feed them messy PDF text, and your answers will be wrong. Here's the pipeline we recommend for production RAG, learned from shipping DocDigest.

rag_pipeline.py

from docling.document_converter import DocumentConverter
 
# 1-2. Parse with layout awareness, export clean Markdown
md = DocumentConverter().convert("report.pdf").document.export_to_markdown()
 
# 3. Chunk on headings, capped at 800 tokens
chunks = chunk_markdown(md, max_tokens=800, overlap=100)
 
# 4-5. Enrich with metadata, then embed
records = [
    {"text": c, "source": "report.pdf", "vector": embed(c)}
    for c in chunks
]
 
# 6. Hybrid retrieve (dense + keyword), then re-rank
hits = rerank(query, dense_search(query) + keyword_search(query))

1. Parse with layout awareness

Don't use pdf2txt or PyPDF for anything you care about. They strip layout, mangle tables, and lose reading order on multi-column documents. Use a layout-aware parser like Docling, Unstructured, or Marker. The output should preserve headings, lists, tables, and code blocks - ideally as Markdown.

2. Clean and normalize

Strip repeated headers and footers (page numbers, "CONFIDENTIAL" stamps).
De-hyphenate words broken across line ends.
Collapse multiple blank lines.
Normalize whitespace in tables so embedding models see consistent input.

3. Chunk semantically, not arbitrarily

Heading-based chunks (split on ##) capped at ~800 tokens beats fixed-size splitting in almost every retrieval benchmark. Keep tables and code blocks atomic - never cut them mid-block. We wrote a deeper breakdown in Markdown chunking for LLMs.

4. Enrich each chunk with metadata

Every chunk should carry: source filename, heading path (Item 7 > Risk Factors), page number, and a stable chunk ID. This metadata enables filtering, citation, and debugging - and you'll thank yourself the first time a user asks "where did this answer come from?"

5. Embed with a model that matches your domain

For general English, OpenAI's text-embedding-3-large or Voyage's voyage-3 are strong defaults. For code, use a code-specific embedder. For multilingual, check cohere-embed-multilingual-v3. Always benchmark on your own queries - published MTEB scores don't predict your domain.

6. Retrieve with hybrid search

Dense vector search alone misses exact-match queries (product codes, function names, error strings). Combine dense embeddings with BM25 keyword search and a re-ranker like Cohere Rerank or bge-reranker. The lift is usually 10-20% on real-world benchmarks.

7. Evaluate, don't guess

Build a small (50-100 question) eval set with known correct answers from your corpus. Run it after every pipeline change. RAGAS and TruLens automate the metrics. Without an eval, you cannot tell whether your change improved anything.

DocDigest handles steps 1-2 for you

Upload PDFs, DOCX, or folders. Get a Docling-parsed, cleaned Markdown digest with source metadata and token counts - ready to chunk.

Try the converter Pricing