Tutorial: Build a Production RAG App in 2 Hours

CallMissedMay 8, 2026

·6 min readGuide

Tutorial RAG Python AI Engineering Vector Search

This tutorial walks through building a production-grade RAG (Retrieval-Augmented Generation) app from scratch in roughly two hours. Not a toy — a system with chunking, hybrid retrieval, reranking, eval, and citations. Code samples are Python with widely-used 2026 libraries; substitute whatever you prefer.

What we are building

A RAG app that answers questions over a corpus of documents (PDFs, HTML, or markdown) with citations. The pipeline:

Ingest documents → chunk → embed → index

At query time → retrieve → rerank → ground answer with citations

Eval the answers against a small graded test set

Stack choices for this tutorial

Python 3.12

OpenAI for embeddings and LLM (substitute Anthropic, local model, etc.)

Qdrant for vector storage (substitute pgvector, Pinecone, Weaviate)

Cohere or BGE for reranking

FastAPI for the API layer

Step 1 — Install and set up

bash

pip install openai qdrant-client cohere fastapi uvicorn \
  pypdf tiktoken rank-bm25 numpy

python

import os
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct

oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(host="localhost", port=6333)
COLLECTION = "docs"

Step 2 — Chunking

Don't dump full documents. Don't chunk on fixed character counts blindly either. Use semantic boundaries with overlap:

python

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o-mini")

def chunk_text(text: str, max_tokens: int = 400, overlap: int = 50):
    paras = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks, buf, buf_len = [], [], 0
    for p in paras:
        tlen = len(enc.encode(p))
        if buf_len + tlen > max_tokens and buf:
            chunks.append("\n\n".join(buf))
            tail = enc.decode(enc.encode(chunks[-1])[-overlap:])
            buf, buf_len = [tail], len(enc.encode(tail))
        buf.append(p)
        buf_len += tlen
    if buf:
        chunks.append("\n\n".join(buf))
    return chunks

Tune max_tokens to your retrieval pattern. 300-500 tokens balances recall vs. precision for most question-answering use cases.

Step 3 — Embedding and indexing

python

def embed(texts: list[str]) -> list[list[float]]:
    resp = oai.embeddings.create(
        model="text-embedding-3-large",
        input=texts,
    )
    return [d.embedding for d in resp.data]

qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
)

def index_doc(doc_id: str, text: str, metadata: dict):
    chunks = chunk_text(text)
    vectors = embed(chunks)
    points = [
        PointStruct(
            id=f"{doc_id}-{i}",
            vector=v,
            payload={"text": c, "doc_id": doc_id, **metadata},
        )
        for i, (c, v) in enumerate(zip(chunks, vectors))
    ]
    qdrant.upsert(collection_name=COLLECTION, points=points)

For production: batch your embedding calls (100-500 chunks per request), handle retries, and persist a content hash so you don't re-embed unchanged chunks.

Step 4 — Hybrid retrieval

Pure dense retrieval misses keyword-heavy queries (acronyms, names, codes). Add a sparse component:

python

from rank_bm25 import BM25Okapi

class HybridRetriever:
    def __init__(self):
        self.bm25 = None
        self.chunks = []  # populated from your DB on init

    def fit_bm25(self, chunks):
        self.chunks = chunks
        tokenized = [c["text"].lower().split() for c in chunks]
        self.bm25 = BM25Okapi(tokenized)

    def retrieve(self, query: str, k: int = 20):
        # Dense
        q_vec = embed([query])[0]
        dense = qdrant.search(
            collection_name=COLLECTION,
            query_vector=q_vec,
            limit=k,
        )
        dense_results = [
            {"text": h.payload["text"], "score": h.score, "id": h.id}
            for h in dense
        ]
        # Sparse
        sparse_scores = self.bm25.get_scores(query.lower().split())
        top_sparse_idx = sparse_scores.argsort()[::-1][:k]
        sparse_results = [
            {
                "text": self.chunks[i]["text"],
                "score": float(sparse_scores[i]),
                "id": self.chunks[i]["id"],
            }
            for i in top_sparse_idx
        ]
        # Reciprocal rank fusion
        return rrf_fuse(dense_results, sparse_results, k=k)

def rrf_fuse(*result_lists, k: int = 20, rrf_k: int = 60):
    scores = {}
    payload = {}
    for results in result_lists:
        for rank, r in enumerate(results):
            scores[r["id"]] = scores.get(r["id"], 0) + 1 / (rrf_k + rank)
            payload[r["id"]] = r
    ranked = sorted(scores.items(), key=lambda x: -x[1])[:k]
    return [payload[i] for i, _ in ranked]

Reciprocal rank fusion is simple, robust, and outperforms naive score blending in most setups.

Step 5 — Reranking

The retriever is recall-tuned; the reranker is precision-tuned. A cross-encoder reranker reorders candidates by direct relevance to the query:

python

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

def rerank(query: str, candidates: list[dict], top_n: int = 5):
    docs = [c["text"] for c in candidates]
    resp = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=docs,
        top_n=top_n,
    )
    return [candidates[r.index] for r in resp.results]

Substitute a self-hosted reranker (BGE-reranker, Jina) if you need on-prem.

Step 6 — Grounded generation with citations

python

GROUNDED_PROMPT = """Answer the question using ONLY the provided sources.
Cite each claim with [n] where n is the source number.
If the sources do not contain the answer, say: "I don't have enough information to answer."

Sources:
{sources}

Question: {question}

Answer:"""

def answer(question: str, retriever: HybridRetriever) -> dict:
    candidates = retriever.retrieve(question, k=20)
    top = rerank(question, candidates, top_n=5)
    sources_text = "\n\n".join(
        f"[{i+1}] {c['text']}" for i, c in enumerate(top)
    )
    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": GROUNDED_PROMPT.format(
                    sources=sources_text, question=question
                ),
            }
        ],
        temperature=0.0,
    )
    return {
        "answer": resp.choices[0].message.content,
        "sources": [{"id": c["id"], "text": c["text"]} for c in top],
    }

The [n] citation pattern is parser-friendly. Validate citations at the response boundary: every [n] should map to a real source; reject answers with phantom citations.

Step 7 — Eval

Build a graded test set of 30-50 questions with expected answers. Score:

Faithfulness: does the answer follow from the cited sources? (LLM-as-judge with a strict rubric)

Recall: for questions whose answer is in the corpus, did we retrieve it? (manually labeled)

Citation accuracy: do [n] citations actually contain the cited claim?

python

def faithfulness_score(answer: str, sources: list[dict]) -> float:
    judge_prompt = f"""Given the answer and sources, score 0-10 how well each
claim in the answer is supported by the cited sources. Return only the integer.

Answer: {answer}
Sources: {sources}
"""
    resp = oai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": judge_prompt}],
        temperature=0.0,
    )
    try:
        return float(resp.choices[0].message.content.strip()) / 10.0
    except ValueError:
        return 0.0

Run the eval before each prompt change, retriever change, or reranker change. Track scores over time.

Step 8 — Wrap with FastAPI

python

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()
retriever = HybridRetriever()  # initialize with your data

class Q(BaseModel):
    question: str

@app.post("/ask")
def ask(q: Q):
    return answer(q.question, retriever)

Run with uvicorn main:app --reload. You now have a working RAG service.

What to do next

Production hardening, in priority order:

Persist BM25 index (don't rebuild on every restart)

Add observability — log query, retrieved IDs, answer, latency

Add caching for repeated queries

Add query rewriting for multi-turn conversations

Add metadata filtering for tenant isolation

Add a fallback when retrieval returns no relevant context

This stack is what most production RAG systems converge on. The patterns are stable; the components are swappable. Start here, measure, then evolve.

Frequently Asked Questions

Why hybrid retrieval rather than just dense?

Pure dense retrieval misses keyword-heavy queries — acronyms, exact codes, rare terms. Sparse (BM25) catches these. Fusing both via reciprocal rank fusion gives better recall across query types than either alone.

How big should chunks be?

300-500 tokens balances recall and precision for most question-answering use cases. Smaller chunks fragment context; larger chunks reduce retrieval precision. Tune empirically against your eval set.

Do I need a reranker if I have hybrid retrieval?

For most production systems, yes — the retriever is recall-tuned to fetch a wide candidate set; the reranker is precision-tuned to order it. Skipping rerank works for very narrow corpora but typically loses 5-15% answer quality on diverse queries.