RAG Knowledge Assistant (Prototype)

Problem

I work in regulated banking. Hosted LLMs aren’t an option for production knowledge work. Data has to stay on-prem, every answer needs an audit trail, and “trust me, the model knows” doesn’t fly with compliance.

I wanted hands-on familiarity with production RAG patterns for that context: embeddings, semantic search, retrieval ranking, prompt-engineering controls, and LLM evaluation. Reading about it wasn’t enough. I needed to build the whole pipeline end-to-end, on my own infrastructure, and stress-test the parts that matter when an auditor asks “where did this answer come from?”

Approach

Built a self-hosted RAG (Retrieval-Augmented Generation) stack from scratch on local infrastructure.

Generation. Llama 3 served locally. Modest hardware, but the point was to validate the on-prem pattern, not to scale.
Retrieval. Qdrant as the vector store. Documents chunked with overlap tuned per document type, embedded, and indexed. Hybrid search (semantic + keyword) so policy numbers and entity terms still matched. Pure-semantic retrieval misses those.
API layer. FastAPI exposes a clean retrieval-and-answer endpoint. Stateless service, horizontally scalable in principle.
Evaluation harness. Built a custom eval pipeline alongside the system. Synthetic Q&A sets graded for retrieval quality (did the right chunks come back?) and answer fidelity (did the model use them correctly?). The harness is the part I’m most proud of; it’s what would let you sign off on a production deployment.
Citations. Every answer cites its source chunks. Non-negotiable for the regulated use case. The portfolio chat you’re using right now is built on the same citation pattern.

If you can’t show an auditor exactly which sentence informed an answer, the system isn’t ready for a regulated environment. Citations are the design constraint, not a feature.

Architecture

Source Docs local files

→

Chunk & Embed overlap-tuned

→

Qdrant vector store

→

FastAPI retrieve · rerank

→

Llama 3 generate

→

Answer + citations

What I learned

End-to-end full RAG pipeline running on local infrastructure

100% of answers cite their source chunks (auditable retrieval)

0 bytes of indexed data leave the host machine

The eval harness is the design lesson I’d take to any production RAG project: you don’t ship retrieval you can’t grade. Hybrid retrieval beating semantic-only for entity-heavy content was the technical surprise. Embeddings alone miss policy numbers and acronyms. And citation rendering, done right, is what converts skeptics; the portfolio chat above this paragraph is the same pattern.

Reflections

The eval harness paid for itself before I even “deployed” the prototype. Build it before you ship, not after.
Hybrid retrieval (semantic + keyword) beat semantic-only for entity-heavy content. Policy numbers and acronyms slip through pure embeddings.
Self-hosted is rarely a real constraint. It’s a feature you can sell to compliance.
Citations sold the pattern more than answer quality did. If you can show your work, skeptics relax.

Scoped chat

Problem

Approach

Architecture

What I learned

Reflections