Tutorial: Build a Production RAG App in 2 Hours

CallMissed
·6 min readGuide

This tutorial walks through building a production-grade RAG (Retrieval-Augmented Generation) app from scratch in roughly two hours. Not a toy — a system with chunking, hybrid retrieval, reranking, eval, and citations. Code samples are Python with widely-used 2026 libraries; substitute whatever you prefer.

What we are building

A RAG app that answers questions over a corpus of documents (PDFs, HTML, or markdown) with citations. The pipeline:

  • Ingest documents → chunk → embed → index
  • At query time → retrieve → rerank → ground answer with citations
  • Eval the answers against a small graded test set
  • Stack choices for this tutorial

  • Python 3.12
  • OpenAI for embeddings and LLM (substitute Anthropic, local model, etc.)
  • Qdrant for vector storage (substitute pgvector, Pinecone, Weaviate)
  • Cohere or BGE for reranking
  • FastAPI for the API layer
  • Step 1 — Install and set up

    bash
    pip install openai qdrant-client cohere fastapi uvicorn \
      pypdf tiktoken rank-bm25 numpy
    python
    import os
    from openai import OpenAI
    from qdrant_client import QdrantClient
    from qdrant_client.http.models import Distance, VectorParams, PointStruct
    
    oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    qdrant = QdrantClient(host="localhost", port=6333)
    COLLECTION = "docs"

    Step 2 — Chunking

    Don't dump full documents. Don't chunk on fixed character counts blindly either. Use semantic boundaries with overlap:

    python
    import tiktoken
    
    enc = tiktoken.encoding_for_model("gpt-4o-mini")
    
    def chunk_text(text: str, max_tokens: int = 400, overlap: int = 50):
        paras = [p.strip() for p in text.split("\n\n") if p.strip()]
        chunks, buf, buf_len = [], [], 0
        for p in paras:
            tlen = len(enc.encode(p))
            if buf_len + tlen > max_tokens and buf:
                chunks.append("\n\n".join(buf))
                tail = enc.decode(enc.encode(chunks[-1])[-overlap:])
                buf, buf_len = [tail], len(enc.encode(tail))
            buf.append(p)
            buf_len += tlen
        if buf:
            chunks.append("\n\n".join(buf))
        return chunks

    Tune max_tokens to your retrieval pattern. 300-500 tokens balances recall vs. precision for most question-answering use cases.

    Step 3 — Embedding and indexing

    python
    def embed(texts: list[str]) -> list[list[float]]:
        resp = oai.embeddings.create(
            model="text-embedding-3-large",
            input=texts,
        )
        return [d.embedding for d in resp.data]
    
    qdrant.recreate_collection(
        collection_name=COLLECTION,
        vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
    )
    
    def index_doc(doc_id: str, text: str, metadata: dict):
        chunks = chunk_text(text)
        vectors = embed(chunks)
        points = [
            PointStruct(
                id=f"{doc_id}-{i}",
                vector=v,
                payload={"text": c, "doc_id": doc_id, **metadata},
            )
            for i, (c, v) in enumerate(zip(chunks, vectors))
        ]
        qdrant.upsert(collection_name=COLLECTION, points=points)

    For production: batch your embedding calls (100-500 chunks per request), handle retries, and persist a content hash so you don't re-embed unchanged chunks.

    Step 4 — Hybrid retrieval

    Pure dense retrieval misses keyword-heavy queries (acronyms, names, codes). Add a sparse component:

    python
    from rank_bm25 import BM25Okapi
    
    class HybridRetriever:
        def __init__(self):
            self.bm25 = None
            self.chunks = []  # populated from your DB on init
    
        def fit_bm25(self, chunks):
            self.chunks = chunks
            tokenized = [c["text"].lower().split() for c in chunks]
            self.bm25 = BM25Okapi(tokenized)
    
        def retrieve(self, query: str, k: int = 20):
            # Dense
            q_vec = embed([query])[0]
            dense = qdrant.search(
                collection_name=COLLECTION,
                query_vector=q_vec,
                limit=k,
            )
            dense_results = [
                {"text": h.payload["text"], "score": h.score, "id": h.id}
                for h in dense
            ]
            # Sparse
            sparse_scores = self.bm25.get_scores(query.lower().split())
            top_sparse_idx = sparse_scores.argsort()[::-1][:k]
            sparse_results = [
                {
                    "text": self.chunks[i]["text"],
                    "score": float(sparse_scores[i]),
                    "id": self.chunks[i]["id"],
                }
                for i in top_sparse_idx
            ]
            # Reciprocal rank fusion
            return rrf_fuse(dense_results, sparse_results, k=k)
    
    def rrf_fuse(*result_lists, k: int = 20, rrf_k: int = 60):
        scores = {}
        payload = {}
        for results in result_lists:
            for rank, r in enumerate(results):
                scores[r["id"]] = scores.get(r["id"], 0) + 1 / (rrf_k + rank)
                payload[r["id"]] = r
        ranked = sorted(scores.items(), key=lambda x: -x[1])[:k]
        return [payload[i] for i, _ in ranked]

    Reciprocal rank fusion is simple, robust, and outperforms naive score blending in most setups.

    Step 5 — Reranking

    The retriever is recall-tuned; the reranker is precision-tuned. A cross-encoder reranker reorders candidates by direct relevance to the query:

    python
    import cohere
    
    co = cohere.Client(os.environ["COHERE_API_KEY"])
    
    def rerank(query: str, candidates: list[dict], top_n: int = 5):
        docs = [c["text"] for c in candidates]
        resp = co.rerank(
            model="rerank-english-v3.0",
            query=query,
            documents=docs,
            top_n=top_n,
        )
        return [candidates[r.index] for r in resp.results]

    Substitute a self-hosted reranker (BGE-reranker, Jina) if you need on-prem.

    Step 6 — Grounded generation with citations

    python
    GROUNDED_PROMPT = """Answer the question using ONLY the provided sources.
    Cite each claim with [n] where n is the source number.
    If the sources do not contain the answer, say: "I don't have enough information to answer."
    
    Sources:
    {sources}
    
    Question: {question}
    
    Answer:"""
    
    def answer(question: str, retriever: HybridRetriever) -> dict:
        candidates = retriever.retrieve(question, k=20)
        top = rerank(question, candidates, top_n=5)
        sources_text = "\n\n".join(
            f"[{i+1}] {c['text']}" for i, c in enumerate(top)
        )
        resp = oai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {
                    "role": "user",
                    "content": GROUNDED_PROMPT.format(
                        sources=sources_text, question=question
                    ),
                }
            ],
            temperature=0.0,
        )
        return {
            "answer": resp.choices[0].message.content,
            "sources": [{"id": c["id"], "text": c["text"]} for c in top],
        }

    The [n] citation pattern is parser-friendly. Validate citations at the response boundary: every [n] should map to a real source; reject answers with phantom citations.

    Step 7 — Eval

    Build a graded test set of 30-50 questions with expected answers. Score:

  • Faithfulness: does the answer follow from the cited sources? (LLM-as-judge with a strict rubric)
  • Recall: for questions whose answer is in the corpus, did we retrieve it? (manually labeled)
  • Citation accuracy: do [n] citations actually contain the cited claim?
  • python
    def faithfulness_score(answer: str, sources: list[dict]) -> float:
        judge_prompt = f"""Given the answer and sources, score 0-10 how well each
    claim in the answer is supported by the cited sources. Return only the integer.
    
    Answer: {answer}
    Sources: {sources}
    """
        resp = oai.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": judge_prompt}],
            temperature=0.0,
        )
        try:
            return float(resp.choices[0].message.content.strip()) / 10.0
        except ValueError:
            return 0.0

    Run the eval before each prompt change, retriever change, or reranker change. Track scores over time.

    Step 8 — Wrap with FastAPI

    python
    from fastapi import FastAPI
    from pydantic import BaseModel
    
    app = FastAPI()
    retriever = HybridRetriever()  # initialize with your data
    
    class Q(BaseModel):
        question: str
    
    @app.post("/ask")
    def ask(q: Q):
        return answer(q.question, retriever)

    Run with uvicorn main:app --reload. You now have a working RAG service.

    What to do next

    Production hardening, in priority order:

  • Persist BM25 index (don't rebuild on every restart)
  • Add observability — log query, retrieved IDs, answer, latency
  • Add caching for repeated queries
  • Add query rewriting for multi-turn conversations
  • Add metadata filtering for tenant isolation
  • Add a fallback when retrieval returns no relevant context
  • This stack is what most production RAG systems converge on. The patterns are stable; the components are swappable. Start here, measure, then evolve.

    Frequently Asked Questions

    Why hybrid retrieval rather than just dense?
    Pure dense retrieval misses keyword-heavy queries — acronyms, exact codes, rare terms. Sparse (BM25) catches these. Fusing both via reciprocal rank fusion gives better recall across query types than either alone.
    How big should chunks be?
    300-500 tokens balances recall and precision for most question-answering use cases. Smaller chunks fragment context; larger chunks reduce retrieval precision. Tune empirically against your eval set.
    Do I need a reranker if I have hybrid retrieval?
    For most production systems, yes — the retriever is recall-tuned to fetch a wide candidate set; the reranker is precision-tuned to order it. Skipping rerank works for very narrow corpora but typically loses 5-15% answer quality on diverse queries.

    Related Posts