GA
Gaddiel
Home / Work / RAG Knowledge Assistant (Prototype)
Project· 2025 – Present

RAG Knowledge Assistant (Prototype)

A self-hosted RAG knowledge assistant I built to learn end-to-end production AI patterns for regulated, on-prem contexts.

Status
Shipped
Dates
2025 – Present
Outcome
Working prototype
Role
Self-directed

Scoped chat

Reading: this project only
Ask me about my work
Every answer cites its source.
Scoped to RAG Knowledge Assistant (Prototype)
GA
Hey, ask me anything about my work, projects, or how I'd fit your team. Every answer cites its source.

Problem

I work in regulated banking. Hosted LLMs aren’t an option for production knowledge work. Data has to stay on-prem, every answer needs an audit trail, and “trust me, the model knows” doesn’t fly with compliance.

I wanted hands-on familiarity with production RAG patterns for that context: embeddings, semantic search, retrieval ranking, prompt-engineering controls, and LLM evaluation. Reading about it wasn’t enough. I needed to build the whole pipeline end-to-end, on my own infrastructure, and stress-test the parts that matter when an auditor asks “where did this answer come from?”

Approach

Built a self-hosted RAG (Retrieval-Augmented Generation) stack from scratch on local infrastructure.

  • Generation. Llama 3 served locally. Modest hardware, but the point was to validate the on-prem pattern, not to scale.
  • Retrieval. Qdrant as the vector store. Documents chunked with overlap tuned per document type, embedded, and indexed. Hybrid search (semantic + keyword) so policy numbers and entity terms still matched. Pure-semantic retrieval misses those.
  • API layer. FastAPI exposes a clean retrieval-and-answer endpoint. Stateless service, horizontally scalable in principle.
  • Evaluation harness. Built a custom eval pipeline alongside the system. Synthetic Q&A sets graded for retrieval quality (did the right chunks come back?) and answer fidelity (did the model use them correctly?). The harness is the part I’m most proud of; it’s what would let you sign off on a production deployment.
  • Citations. Every answer cites its source chunks. Non-negotiable for the regulated use case. The portfolio chat you’re using right now is built on the same citation pattern.
If you can’t show an auditor exactly which sentence informed an answer, the system isn’t ready for a regulated environment. Citations are the design constraint, not a feature.

Architecture

Source Docs local files
Chunk & Embed overlap-tuned
Qdrant vector store
FastAPI retrieve · rerank
Llama 3 generate
Answer + citations

What I learned

End-to-end full RAG pipeline running on local infrastructure
100% of answers cite their source chunks (auditable retrieval)
0 bytes of indexed data leave the host machine

The eval harness is the design lesson I’d take to any production RAG project: you don’t ship retrieval you can’t grade. Hybrid retrieval beating semantic-only for entity-heavy content was the technical surprise. Embeddings alone miss policy numbers and acronyms. And citation rendering, done right, is what converts skeptics; the portfolio chat above this paragraph is the same pattern.

Reflections

  • The eval harness paid for itself before I even “deployed” the prototype. Build it before you ship, not after.
  • Hybrid retrieval (semantic + keyword) beat semantic-only for entity-heavy content. Policy numbers and acronyms slip through pure embeddings.
  • Self-hosted is rarely a real constraint. It’s a feature you can sell to compliance.
  • Citations sold the pattern more than answer quality did. If you can show your work, skeptics relax.
Suggested
Ask the assistant about my work
Projects
RAG Knowledge Assistant (Prototype)
Llama 3
DB Change Executor
Streamlit
Finacle Cross-System Integrations
Oracle
ProcessMaker 2 → 3 Migration
ProcessMaker
Riff - Voice-First AI Music Production Partner
React
Exam Portal - Self-Hostable Assessment Platform
.NET 8
Actions
Email Gaddiel
Download CV (PDF)
PressK · ?for shortcuts