Tutorial: Build a Production RAG App in 2 Hours
This tutorial walks through building a production-grade RAG (Retrieval-Augmented Generation) app from scratch in roughly two hours. Not a toy — a system with chunking, hybrid retrieval, reranking, eval, and citations. Code samples are Python with widely-used 2026 libraries; substitute whatever you prefer.
What we are building
A RAG app that answers questions over a corpus of documents (PDFs, HTML, or markdown) with citations. The pipeline:
Stack choices for this tutorial
Step 1 — Install and set up
pip install openai qdrant-client cohere fastapi uvicorn \
pypdf tiktoken rank-bm25 numpyimport os
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams, PointStruct
oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
qdrant = QdrantClient(host="localhost", port=6333)
COLLECTION = "docs"Step 2 — Chunking
Don't dump full documents. Don't chunk on fixed character counts blindly either. Use semantic boundaries with overlap:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o-mini")
def chunk_text(text: str, max_tokens: int = 400, overlap: int = 50):
paras = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks, buf, buf_len = [], [], 0
for p in paras:
tlen = len(enc.encode(p))
if buf_len + tlen > max_tokens and buf:
chunks.append("\n\n".join(buf))
tail = enc.decode(enc.encode(chunks[-1])[-overlap:])
buf, buf_len = [tail], len(enc.encode(tail))
buf.append(p)
buf_len += tlen
if buf:
chunks.append("\n\n".join(buf))
return chunksTune max_tokens to your retrieval pattern. 300-500 tokens balances recall vs. precision for most question-answering use cases.
Step 3 — Embedding and indexing
def embed(texts: list[str]) -> list[list[float]]:
resp = oai.embeddings.create(
model="text-embedding-3-large",
input=texts,
)
return [d.embedding for d in resp.data]
qdrant.recreate_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=3072, distance=Distance.COSINE),
)
def index_doc(doc_id: str, text: str, metadata: dict):
chunks = chunk_text(text)
vectors = embed(chunks)
points = [
PointStruct(
id=f"{doc_id}-{i}",
vector=v,
payload={"text": c, "doc_id": doc_id, **metadata},
)
for i, (c, v) in enumerate(zip(chunks, vectors))
]
qdrant.upsert(collection_name=COLLECTION, points=points)For production: batch your embedding calls (100-500 chunks per request), handle retries, and persist a content hash so you don't re-embed unchanged chunks.
Step 4 — Hybrid retrieval
Pure dense retrieval misses keyword-heavy queries (acronyms, names, codes). Add a sparse component:
from rank_bm25 import BM25Okapi
class HybridRetriever:
def __init__(self):
self.bm25 = None
self.chunks = [] # populated from your DB on init
def fit_bm25(self, chunks):
self.chunks = chunks
tokenized = [c["text"].lower().split() for c in chunks]
self.bm25 = BM25Okapi(tokenized)
def retrieve(self, query: str, k: int = 20):
# Dense
q_vec = embed([query])[0]
dense = qdrant.search(
collection_name=COLLECTION,
query_vector=q_vec,
limit=k,
)
dense_results = [
{"text": h.payload["text"], "score": h.score, "id": h.id}
for h in dense
]
# Sparse
sparse_scores = self.bm25.get_scores(query.lower().split())
top_sparse_idx = sparse_scores.argsort()[::-1][:k]
sparse_results = [
{
"text": self.chunks[i]["text"],
"score": float(sparse_scores[i]),
"id": self.chunks[i]["id"],
}
for i in top_sparse_idx
]
# Reciprocal rank fusion
return rrf_fuse(dense_results, sparse_results, k=k)
def rrf_fuse(*result_lists, k: int = 20, rrf_k: int = 60):
scores = {}
payload = {}
for results in result_lists:
for rank, r in enumerate(results):
scores[r["id"]] = scores.get(r["id"], 0) + 1 / (rrf_k + rank)
payload[r["id"]] = r
ranked = sorted(scores.items(), key=lambda x: -x[1])[:k]
return [payload[i] for i, _ in ranked]Reciprocal rank fusion is simple, robust, and outperforms naive score blending in most setups.
Step 5 — Reranking
The retriever is recall-tuned; the reranker is precision-tuned. A cross-encoder reranker reorders candidates by direct relevance to the query:
import cohere
co = cohere.Client(os.environ["COHERE_API_KEY"])
def rerank(query: str, candidates: list[dict], top_n: int = 5):
docs = [c["text"] for c in candidates]
resp = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=docs,
top_n=top_n,
)
return [candidates[r.index] for r in resp.results]Substitute a self-hosted reranker (BGE-reranker, Jina) if you need on-prem.
Step 6 — Grounded generation with citations
GROUNDED_PROMPT = """Answer the question using ONLY the provided sources.
Cite each claim with [n] where n is the source number.
If the sources do not contain the answer, say: "I don't have enough information to answer."
Sources:
{sources}
Question: {question}
Answer:"""
def answer(question: str, retriever: HybridRetriever) -> dict:
candidates = retriever.retrieve(question, k=20)
top = rerank(question, candidates, top_n=5)
sources_text = "\n\n".join(
f"[{i+1}] {c['text']}" for i, c in enumerate(top)
)
resp = oai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": GROUNDED_PROMPT.format(
sources=sources_text, question=question
),
}
],
temperature=0.0,
)
return {
"answer": resp.choices[0].message.content,
"sources": [{"id": c["id"], "text": c["text"]} for c in top],
}The [n] citation pattern is parser-friendly. Validate citations at the response boundary: every [n] should map to a real source; reject answers with phantom citations.
Step 7 — Eval
Build a graded test set of 30-50 questions with expected answers. Score:
[n] citations actually contain the cited claim?def faithfulness_score(answer: str, sources: list[dict]) -> float:
judge_prompt = f"""Given the answer and sources, score 0-10 how well each
claim in the answer is supported by the cited sources. Return only the integer.
Answer: {answer}
Sources: {sources}
"""
resp = oai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": judge_prompt}],
temperature=0.0,
)
try:
return float(resp.choices[0].message.content.strip()) / 10.0
except ValueError:
return 0.0Run the eval before each prompt change, retriever change, or reranker change. Track scores over time.
Step 8 — Wrap with FastAPI
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI()
retriever = HybridRetriever() # initialize with your data
class Q(BaseModel):
question: str
@app.post("/ask")
def ask(q: Q):
return answer(q.question, retriever)Run with uvicorn main:app --reload. You now have a working RAG service.
What to do next
Production hardening, in priority order:
This stack is what most production RAG systems converge on. The patterns are stable; the components are swappable. Start here, measure, then evolve.

