Problem
I work in regulated banking. Hosted LLMs aren’t an option for production knowledge work. Data has to stay on-prem, every answer needs an audit trail, and “trust me, the model knows” doesn’t fly with compliance.
I wanted hands-on familiarity with production RAG patterns for that context: embeddings, semantic search, retrieval ranking, prompt-engineering controls, and LLM evaluation. Reading about it wasn’t enough. I needed to build the whole pipeline end-to-end, on my own infrastructure, and stress-test the parts that matter when an auditor asks “where did this answer come from?”
Approach
Built a self-hosted RAG (Retrieval-Augmented Generation) stack from scratch on local infrastructure.
- Generation. Llama 3 served locally. Modest hardware, but the point was to validate the on-prem pattern, not to scale.
- Retrieval. Qdrant as the vector store. Documents chunked with overlap tuned per document type, embedded, and indexed. Hybrid search (semantic + keyword) so policy numbers and entity terms still matched. Pure-semantic retrieval misses those.
- API layer. FastAPI exposes a clean retrieval-and-answer endpoint. Stateless service, horizontally scalable in principle.
- Evaluation harness. Built a custom eval pipeline alongside the system. Synthetic Q&A sets graded for retrieval quality (did the right chunks come back?) and answer fidelity (did the model use them correctly?). The harness is the part I’m most proud of; it’s what would let you sign off on a production deployment.
- Citations. Every answer cites its source chunks. Non-negotiable for the regulated use case. The portfolio chat you’re using right now is built on the same citation pattern.
Architecture
What I learned
The eval harness is the design lesson I’d take to any production RAG project: you don’t ship retrieval you can’t grade. Hybrid retrieval beating semantic-only for entity-heavy content was the technical surprise. Embeddings alone miss policy numbers and acronyms. And citation rendering, done right, is what converts skeptics; the portfolio chat above this paragraph is the same pattern.
Reflections
- The eval harness paid for itself before I even “deployed” the prototype. Build it before you ship, not after.
- Hybrid retrieval (semantic + keyword) beat semantic-only for entity-heavy content. Policy numbers and acronyms slip through pure embeddings.
- Self-hosted is rarely a real constraint. It’s a feature you can sell to compliance.
- Citations sold the pattern more than answer quality did. If you can show your work, skeptics relax.