RAG Architecture Diagram Examples

These RAG architecture examples show how the same core building blocks — embedder, vector store, retriever, and LLM — get arranged differently as a retrieval system grows from a weekend prototype into a production pipeline.

Edit this RAG architecture template Back to template

Real examples

Naive RAG (the baseline)

Who uses it: Developer building a first retrieval prototype

Ingestion: docs → fixed-size chunker (512 tokens) → embedding → Chroma

Query: question → embedding → vector search (top-5) → LLM

Orchestrator: a single LangChain RetrievalQA chain

No reranker, no query rewriting — retrieved chunks go straight to the prompt

LLM: GPT-3.5-turbo for low cost

Why this works: Naive RAG is the right place to start — it has the fewest moving parts, so when answers are wrong you can tell whether the problem is retrieval (wrong chunks) or generation (right chunks, bad answer) before adding complexity.

RAG with reranking

Who uses it: ML engineer whose answers are missing relevant context

Query: question → embedding → vector search (top-20) → reranker → top-5 → LLM

Reranker: a cross-encoder (Cohere Rerank or bge-reranker) scores each candidate

Vector DB: Pinecone with metadata filters by document source

Why top-20 then rerank to 5: cast a wide net, then keep only the best

Orchestrator: LangChain with a ContextualCompressionRetriever

Why this works: Adding a reranker is the highest-leverage upgrade to a naive RAG system — vector search alone optimizes for similarity, but a cross-encoder reranker actually reads the query against each chunk and reorders by true relevance.

Hybrid search RAG

Who uses it: Team where keyword-exact queries (product codes, names) fail semantic search

Query runs two retrievers in parallel: dense (embeddings) + sparse (BM25)

Results merged with Reciprocal Rank Fusion before reranking

Vector DB: Weaviate with built-in hybrid search, or Pinecone + Elasticsearch

Ingestion indexes both embeddings and a keyword index

Reranker fuses and reorders the combined candidate set

Why this works: Hybrid search fixes the classic RAG failure where a user searches an exact term — a SKU, an error code, a person's name — and pure semantic search returns 'similar' but wrong results. BM25 catches the exact match, embeddings catch the meaning.

Agentic RAG

Who uses it: Developer building an assistant that decides when and what to retrieve

Orchestrator: a ReAct agent that chooses whether retrieval is even needed

Agent can rewrite the query, retrieve, then decide to retrieve again

Multiple sources: vector DB + SQL database + web search as separate tools

Self-check step: agent verifies retrieved context answers the question

Falls back to a clarifying question if retrieval confidence is low

Why this works: Agentic RAG moves the retrieval decision into the LLM itself — instead of always retrieving, the agent reasons about whether it needs external data, which query to run, and whether the results are good enough, trading latency for accuracy on complex questions.

Tips for better study mind maps

Draw the online query flow and the offline ingestion flow as two separate paths — they run at different times and confusing them is the most common RAG diagram mistake.
Put the vector database where both flows meet: ingestion writes to it, the retriever reads from it.
Show the reranker as a distinct step after vector search, not merged into it — they are different models doing different jobs.
Label the retrieval counts (top-20 → top-5) on the arrows so reviewers can see the funnel.

Start editing online

Go back to the template, swap in your own topics, and keep the same structure if it fits your class or project.

Use this template: /editor/new?template=rag-architecture

Edit this RAG architecture template