Tutorial: Build a Production RAG App in 2 Hours

CallMissed
·6 min readGuide

CallMissed

AI Communication Platform

Build AI-powered voice agents, WhatsApp bots, and customer engagement workflows.

Try free
Cover image: Tutorial: Build a Production RAG App in 2 Hours
Cover image: Tutorial: Build a Production RAG App in 2 Hours

This tutorial walks through building a production-grade RAG (Retrieval-Augmented Generation) app from scratch in roughly two hours. Not a toy — a system with chunking, hybrid retrieval, reranking, eval, and citations. Code samples are Python with widely-used 2026 libraries; substitute whatever you prefer.

What we are building

A RAG app that answers questions over a corpus of documents (PDFs, HTML, or markdown) with citations. The pipeline:

  1. Ingest documents → chunk → embed → index
  2. At query time → retrieve → rerank → ground answer with citations
  3. Eval the answers against a small graded test set

Stack choices for this tutorial

Stack choices for this tutorial
Stack choices for this tutorial
  • Python 3.12
  • OpenAI for embeddings and LLM (substitute Anthropic, local model, etc.)
  • Qdrant for vector storage (substitute pgvector, Pinecone, Weaviate)
  • Cohere or BGE for reranking
  • FastAPI for the API layer

Step 1 — Install and set up

bash
pip install openai qdrant-client cohere fastapi uvicorn \
  pypdf tiktoken rank-bm25 numpy
python
import os
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct

oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(host="localhost", port=6333)
COLLECTION = "docs"

Step 2 — Chunking

Step 2 — Chunking
Step 2 — Chunking

Don't dump full documents. Don't chunk on fixed character counts blindly either. Use semantic boundaries with overlap:

python
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

def chunk_text(text: str, max_tokens: int = 400, overlap: int = 50):
    paras = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buf, buf_len = [], [], 0
    for p in paras:
        tlen = len(enc.encode(p))
        if buf_len + tlen > max_tokens and buf:
            chunks.append("\n\n".join(buf))
            tail = enc.decode(enc.encode(chunks[-1])[-overlap:])
            buf, buf_len = [tail], len(enc.encode(tail))
        buf.append(p)
        buf_len += tlen
    if buf:
        chunks.append("\n\n".join(buf))
    return chunks

Tune max_tokens to your retrieval pattern. 300-500 tokens balances recall vs. precision for most question-answering use cases.

Step 3 — Embedding and indexing

python
def embed(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
    )
    return [d.embedding for d in resp.data]

qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
)

def index_doc(doc_id: str, text: str, metadata: dict):
    chunks = chunk_text(text)
    vectors = embed(chunks)
    points = [
        PointStruct(
            id=f"{doc_id}-{i}",
            vector=v,
            payload={"text": c, "doc_id": doc_id, **metadata},
        )
        for i, (c, v) in enumerate(zip(chunks, vectors))
    ]
    qdrant.upsert(collection_name=COLLECTION, points=points)

For production: batch your embedding calls (100-500 chunks per request), handle retries, and persist a content hash so you don't re-embed unchanged chunks.

Step 4 — Hybrid retrieval

Step 4 — Hybrid retrieval
Step 4 — Hybrid retrieval

Pure dense retrieval misses keyword-heavy queries (acronyms, names, codes). Add a sparse component:

python
from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self):
        self.bm25 = None
        self.chunks = []  # populated from your DB on init

    def fit_bm25(self, chunks):
        self.chunks = chunks
        tokenized = [c["text"].lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query: str, k: int = 20):
        # Dense
        q_vec = embed([query])[0]
        dense = qdrant.search(
            collection_name=COLLECTION,
            query_vector=q_vec,
            limit=k,
        )
        dense_results = [
            {"text": h.payload["text"], "score": h.score, "id": h.id}
            for h in dense
        ]
        # Sparse
        sparse_scores = self.bm25.get_scores(query.lower().split())
        top_sparse_idx = sparse_scores.argsort()[::-1][:k]
        sparse_results = [
            {
                "text": self.chunks[i]["text"],
                "score": float(sparse_scores[i]),
                "id": self.chunks[i]["id"],
            }
            for i in top_sparse_idx
        ]
        # Reciprocal rank fusion
        return rrf_fuse(dense_results, sparse_results, k=k)

def rrf_fuse(*result_lists, k: int = 20, rrf_k: int = 60):
    scores = {}
    payload = {}
    for results in result_lists:
        for rank, r in enumerate(results):
            scores[r["id"]] = scores.get(r["id"], 0) + 1 / (rrf_k + rank)
            payload[r["id"]] = r
    ranked = sorted(scores.items(), key=lambda x: -x[1])[:k]
    return [payload[i] for i, _ in ranked]

Reciprocal rank fusion is simple, robust, and outperforms naive score blending in most setups.

Step 5 — Reranking

The retriever is recall-tuned; the reranker is precision-tuned. A cross-encoder reranker reorders candidates by direct relevance to the query:

python
import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

def rerank(query: str, candidates: list[dict], top_n: int = 5):
    docs = [c["text"] for c in candidates]
    resp = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=top_n,
    )
    return [candidates[r.index] for r in resp.results]

Substitute a self-hosted reranker (BGE-reranker, Jina) if you need on-prem.

Step 6 — Grounded generation with citations

python
GROUNDED_PROMPT = """Answer the question using ONLY the provided sources.
Cite each claim with [n] where n is the source number.
If the sources do not contain the answer, say: "I don't have enough information to answer."

Sources:
{sources}

Question: {question}

Answer:"""

def answer(question: str, retriever: HybridRetriever) -> dict:
    candidates = retriever.retrieve(question, k=20)
    top = rerank(question, candidates, top_n=5)
    sources_text = "\n\n".join(
        f"[{i+1}] {c['text']}" for i, c in enumerate(top)
    )
    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": GROUNDED_PROMPT.format(
                    sources=sources_text, question=question
                ),
            }
        ],
        temperature=0.0,
    )
    return {
        "answer": resp.choices[0].message.content,
        "sources": [{"id": c["id"], "text": c["text"]} for c in top],
    }

The [n] citation pattern is parser-friendly. Validate citations at the response boundary: every [n] should map to a real source; reject answers with phantom citations.

Step 7 — Eval

Step 7 — Eval
Step 7 — Eval

Build a graded test set of 30-50 questions with expected answers. Score:

  • Faithfulness: does the answer follow from the cited sources? (LLM-as-judge with a strict rubric)
  • Recall: for questions whose answer is in the corpus, did we retrieve it? (manually labeled)
  • Citation accuracy: do [n] citations actually contain the cited claim?
python
def faithfulness_score(answer: str, sources: list[dict]) -> float:
    judge_prompt = f"""Given the answer and sources, score 0-10 how well each
claim in the answer is supported by the cited sources. Return only the integer.

Answer: {answer}
Sources: {sources}
"""
    resp = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        temperature=0.0,
    )
    try:
        return float(resp.choices[0].message.content.strip()) / 10.0
    except ValueError:
        return 0.0

Run the eval before each prompt change, retriever change, or reranker change. Track scores over time.

Step 8 — Wrap with FastAPI

python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
retriever = HybridRetriever()  # initialize with your data

class Q(BaseModel):
    question: str

@app.post("/ask")
def ask(q: Q):
    return answer(q.question, retriever)

Run with uvicorn main:app --reload. You now have a working RAG service.

What to do next

Production hardening, in priority order:

  1. Persist BM25 index (don't rebuild on every restart)
  2. Add observability — log query, retrieved IDs, answer, latency
  3. Add caching for repeated queries
  4. Add query rewriting for multi-turn conversations
  5. Add metadata filtering for tenant isolation
  6. Add a fallback when retrieval returns no relevant context

This stack is what most production RAG systems converge on. The patterns are stable; the components are swappable. Start here, measure, then evolve.

Frequently Asked Questions

Why hybrid retrieval rather than just dense?
Pure dense retrieval misses keyword-heavy queries — acronyms, exact codes, rare terms. Sparse (BM25) catches these. Fusing both via reciprocal rank fusion gives better recall across query types than either alone.
How big should chunks be?
300-500 tokens balances recall and precision for most question-answering use cases. Smaller chunks fragment context; larger chunks reduce retrieval precision. Tune empirically against your eval set.
Do I need a reranker if I have hybrid retrieval?
For most production systems, yes — the retriever is recall-tuned to fetch a wide candidate set; the reranker is precision-tuned to order it. Skipping rerank works for very narrow corpora but typically loses 5-15% answer quality on diverse queries.

Related Posts