Back to template

RAG Architecture Diagram Examples

These RAG architecture examples show how the same core building blocks — embedder, vector store, retriever, and LLM — get arranged differently as a retrieval system grows from a weekend prototype into a production pipeline.

RAG Architecture Diagram Examples

Real examples

Naive RAG (the baseline)

Who uses it: Developer building a first retrieval prototype

Ingestion: docs → fixed-size chunker (512 tokens) → embedding → Chroma
Query: question → embedding → vector search (top-5) → LLM
Orchestrator: a single LangChain RetrievalQA chain
No reranker, no query rewriting — retrieved chunks go straight to the prompt
LLM: GPT-3.5-turbo for low cost

Why this works: Naive RAG is the right place to start — it has the fewest moving parts, so when answers are wrong you can tell whether the problem is retrieval (wrong chunks) or generation (right chunks, bad answer) before adding complexity.

RAG with reranking

Who uses it: ML engineer whose answers are missing relevant context

Query: question → embedding → vector search (top-20) → reranker → top-5 → LLM
Reranker: a cross-encoder (Cohere Rerank or bge-reranker) scores each candidate
Vector DB: Pinecone with metadata filters by document source
Why top-20 then rerank to 5: cast a wide net, then keep only the best
Orchestrator: LangChain with a ContextualCompressionRetriever

Why this works: Adding a reranker is the highest-leverage upgrade to a naive RAG system — vector search alone optimizes for similarity, but a cross-encoder reranker actually reads the query against each chunk and reorders by true relevance.

Hybrid search RAG

Who uses it: Team where keyword-exact queries (product codes, names) fail semantic search

Query runs two retrievers in parallel: dense (embeddings) + sparse (BM25)
Results merged with Reciprocal Rank Fusion before reranking
Vector DB: Weaviate with built-in hybrid search, or Pinecone + Elasticsearch
Ingestion indexes both embeddings and a keyword index
Reranker fuses and reorders the combined candidate set

Why this works: Hybrid search fixes the classic RAG failure where a user searches an exact term — a SKU, an error code, a person's name — and pure semantic search returns 'similar' but wrong results. BM25 catches the exact match, embeddings catch the meaning.

Agentic RAG

Who uses it: Developer building an assistant that decides when and what to retrieve

Orchestrator: a ReAct agent that chooses whether retrieval is even needed
Agent can rewrite the query, retrieve, then decide to retrieve again
Multiple sources: vector DB + SQL database + web search as separate tools
Self-check step: agent verifies retrieved context answers the question
Falls back to a clarifying question if retrieval confidence is low

Why this works: Agentic RAG moves the retrieval decision into the LLM itself — instead of always retrieving, the agent reasons about whether it needs external data, which query to run, and whether the results are good enough, trading latency for accuracy on complex questions.

Tips for better study mind maps

  • Draw the online query flow and the offline ingestion flow as two separate paths — they run at different times and confusing them is the most common RAG diagram mistake.
  • Put the vector database where both flows meet: ingestion writes to it, the retriever reads from it.
  • Show the reranker as a distinct step after vector search, not merged into it — they are different models doing different jobs.
  • Label the retrieval counts (top-20 → top-5) on the arrows so reviewers can see the funnel.

Start editing online

Go back to the template, swap in your own topics, and keep the same structure if it fits your class or project.

Use this template: /editor/new?template=rag-architecture

Edit this RAG architecture template