Kick off your book project in 2 hours! Live workshop on Zoom. You’ll leave with a real book project, progress on your first chapter, and a clear plan to keep going. Tuesday, June 16, 2026. Learn more…

Leanpub Header

Skip to main content

Retrieval-Augmented Generation

An Engineer's Guide to Building RAG Systems with Your Own Data

This book is 100% completeLast updated on 2026-06-12

The engineer's guide to RAG systems that survive a deploy.

Minimum price

$19.00

$29.00

You pay

Author earns

$

Also available for 1 book credit with a Reader Membership

PDF
EPUB
WEB
About

About

About the Book

Most teams trying to ship a RAG system stall at the prototype stage. The notebook works, the demo wins the meeting, the system never reaches users at scale. The gap between "this works on my laptop" and "this runs reliably in production" is wide and full of engineering challenges. This book is about that gap.

It's written for engineers who need to ship something real. Not for researchers writing benchmarks, not for managers picking vendors. For the person at the keyboard who needs to make decisions about chunking strategy, vector store choice, evaluation methodology, and production operations, and who's tired of vendor-shaped blog posts and examples that don't survive a deploy.

Each chapter pairs concept with implementation. Real code on a real corpus, runnable end to end. The seven failure points of a RAG pipeline are introduced in chapter 1 and traced through every subsequent chapter, so you learn to recognize *where* things break, not just patch them when they do.

The book

Why standalone LLMs fail on private data, what RAG actually is, and the building blocks underneath: embeddings, chunking strategies, vector storage (FAISS vs pgvector vs Qdrant with measured benchmarks), and a complete ingestion pipeline that handles the messiness of real documents.

Wiring retrieval into generation. Sparse vs dense retrieval, BM25, hybrid search with reciprocal rank fusion, reranking with cross-encoders, query transformation patterns (multi-query, sub-question decomposition, HyDE). Every chapter measures the improvement instead of just describing it.

Evaluation done right (separate retrieval and generation metrics, RAGAS, ablation testing). Hardening the pipeline (observability, semantic caching, citation systems, embedding staleness, cost optimization, load testing). Advanced retrieval patterns (GraphRAG, Corrective RAG, Self-RAG) with honest takes on when each earns its keep. Then agentic RAG with realistic guardrails for production.

By the end you'll be able to

  • Choose a chunking strategy on retrieval evidence, not intuition
  • Pick FAISS, pgvector, or Qdrant based on your actual constraints
  • Build a RAG pipeline that handles real PDFs with OCR artifacts, encoding issues, and dirty markdown
  • Evaluate retrieval quality separately from generation quality, and prove your changes help
  • Add reranking, hybrid search, and query transformation when (and only when) they earn it
  • Catch the seven failure points before they reach production
  • Scale, monitor, and cost-optimize a RAG system that survives a deploy

Author

About the Author

Jeroen Herczeg

Jeroen Herczeg is a senior software engineer who builds AI systems for production.

He has 20 years of engineering experience across software platforms, distributed systems, microservices, Kubernetes, and product teams. His current work focuses on retrieval-augmented generation, AI agent orchestration, and practical AI engineering.

Most recently, he built the orchestrator agent for the Google + BBC AI Agents demo at IBC2025, winner of the Broadcast Tech Innovation Award. His interest in AI goes back to 2017, when he completed Udacity’s Artificial Intelligence Nanodegree. Today, that work has evolved into a focus on production RAG systems and AI agent orchestration.

He writes about practical AI engineering at herczeg.be/blog and lives in Belgium.

Launch

Launch Video

Contents

Table of Contents

Preface

  1. Who I am
  2. Who it is for
  3. How to read it

The problem RAG solves

  1. What an LLM can and cannot do
  2. Limitations of a standalone LLM
  3. The RAG mental model
  4. The RAG pipeline end-to-end
  5. RAG vs. fine-tuning vs. long-context prompting
  6. The seven failure points
  7. Common misconceptions
  8. Seeing the difference: standalone LLM vs. RAG
  9. Summary

Embeddings

  1. From words to vectors
  2. The bi-encoder architecture
  3. Generating embeddings locally
  4. Generating embeddings via API
  5. Cosine similarity and distance metrics
  6. Visualizing embedding space with UMAP
  7. Choosing an embedding model
  8. Similarity search from scratch
  9. Summary

Chunking strategies

  1. The chunk size tradeoff
  2. Fixed-size chunking
  3. Recursive character splitting
  4. Semantic chunking
  5. Document-structure-aware chunking
  6. Contextual chunking
  7. Comparing strategies: A retrieval test
  8. Summary

Vector storage and indexing

  1. Exact vs. approximate nearest neighbor
  2. The speed-accuracy-memory tradeoff
  3. Choosing a vector store
  4. How HNSW works
  5. Building a FAISS index from scratch
  6. pgvector: Vectors in PostgreSQL
  7. Qdrant: A purpose-built vector database
  8. Tuning index parameters
  9. Putting it all together: the comparison benchmark
  10. Summary

Building the ingestion pipeline

  1. The ingestion flow
  2. Parsing real-world documents
  3. Text cleaning and normalization
  4. The full pipeline: Parse, clean, chunk, embed, store
  5. Metadata extraction and storage
  6. Idempotent re-ingestion
  7. Running the complete pipeline
  8. Summary

Hybrid retrieval

  1. Keyword retrieval and the BM25 mental model
  2. Adding a search vector to the chunks table
  3. Side by side: each retriever fails the other’s queries
  4. Hybrid retrieval as candidate generation
  5. Filters as candidate-set scoping
  6. Putting it together: hybrid retrieval over the corpus
  7. Summary

Your first RAG pipeline

  1. Selecting context from the candidate pool
  2. Building the prompt
  3. The complete pipeline
  4. Five queries: where the pipeline succeeds and fails
  5. The failure catalog
  6. What this pipeline cannot do yet
  7. Summary

Reranking

  1. Why first-stage retrieval optimizes for recall
  2. Bi-encoder versus cross-encoder
  3. Adding a local reranker with bge-reranker-v2-m3
  4. Choosing K and N
  5. Latency and the cost of cross-encoders
  6. Did reranking actually improve answers?
  7. When reranking is not worth it
  8. The pipeline so far
  9. Summary

Query transformation

  1. Where query transformation belongs
  2. Query rewriting
  3. HyDE: search with a hypothetical answer
  4. Multi-query expansion
  5. Decomposition
  6. When transformation hurts
  7. A technique hierarchy
  8. Summary

Evaluating RAG systems

  1. Two evaluation surfaces
  2. Building an evaluation set
  3. Retrieval metrics
  4. Generation metrics
  5. The ablation table
  6. Regression tracking
  7. What evaluation will not tell you
  8. Summary

Hardening for production

  1. Stage-level observability
  2. Tracing across stages
  3. Failure modes and graceful degradation
  4. Configuration and secrets
  5. Model versioning and the silent-rebuild trap
  6. Security boundaries in RAG systems
  7. Deploying changes safely
  8. The production baseline
  9. Summary

Advanced retrieval patterns

  1. Parent-document retrieval
  2. Contextual retrieval
  3. Graph-based retrieval
  4. ColBERT and late interaction
  5. The complexity test
  6. Summary

Agentic RAG

  1. Retrieval as a tool call
  2. Multi-step reasoning loops
  3. Bounding agentic loops
  4. Observability for agents
  5. When agentic RAG is worth it
  6. Summary

Closing

  1. What stays true
  2. What does not work
  3. What to do next

Get the free sample chapters

Click the buttons to get the free sample in PDF or EPUB, or read the sample online here

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub