Large language models are knowledgeable but not omniscient. They cannot answer questions about your company's internal documentation, your product catalog, or your customer support history. They hallucinate when they do not know the answer, confidently generating plausible but incorrect information.
Retrieval-Augmented Generation (RAG) solves this by grounding LLM responses in your actual data. Instead of asking the model "What is our refund policy?", you first retrieve the relevant documentation, then ask the model to answer based on that documentation. The model generates a response using your data as context, not its training data.
FLIN implements RAG as a composable pipeline: retrieve relevant documents with semantic search, rerank them with a cross-encoder for precision, feed them to the LLM as context, and attribute the answer to its sources.
The RAG Pipeline
fn answer_question(question) {
// Step 1: Retrieve relevant documents
docs = search question in Document by content limit 20// Step 2: Rerank for precision ranked = rerank(question, docs, field: "content", limit: 5)
// Step 3: Build context context = ranked.map(d => d.content).join("\n\n---\n\n")
// Step 4: Generate answer with attribution answer = ai_chat([ { role: "system", content: "Answer questions based ONLY on the provided context. Cite sources by title. If the context doesn't contain the answer, say so." }, { role: "user", content: "Context:\n" + context + "\n\nQuestion: " + question } ])
{ answer: answer, sources: ranked.map(d => { title: d.title, id: d.id }) } } ```
Four steps. Retrieve, rerank, generate, attribute. Each step is a standard FLIN function call.
Step 1: Semantic Retrieval
The first step retrieves candidate documents using the search keyword:
docs = search question in Document by content limit 20This returns the 20 most semantically similar documents. The limit is intentionally generous -- we retrieve more candidates than we need because the reranking step will select the most relevant ones. Retrieving 20 and reranking to 5 is more accurate than retrieving 5 directly.
The retrieval step uses the HNSW index (covered in article 117), which runs in under 5 milliseconds even for large document collections.
Step 2: Reranking
Semantic search uses bi-encoder models that embed the query and documents independently. This is fast (one embedding per query) but imprecise -- the query and document embeddings exist in the same vector space but are computed separately.
Reranking uses a cross-encoder model that scores query-document pairs together. This is slower (one inference per pair) but more accurate because the model can attend to the interaction between query and document.
ranked = rerank(question, docs, field: "content", limit: 5)The rerank() function:
1. Takes the query and candidate documents.
2. Scores each query-document pair with a cross-encoder.
3. Returns the top N documents sorted by relevance score.
pub fn rerank(
query: &str,
documents: &[Entity],
field: &str,
limit: usize,
) -> Vec<RankedEntity> {
let mut scored: Vec<(usize, f32)> = documents.iter()
.enumerate()
.map(|(i, doc)| {
let text = doc.get_text(field);
let score = cross_encoder_score(query, text);
(i, score)
})
.collect();scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap()); scored.truncate(limit);
scored.iter() .map(|(i, score)| RankedEntity { entity: documents[*i].clone(), score: *score, }) .collect() } ```
The cross-encoder is a small model (typically 50-100 MB) that runs locally via ONNX Runtime. Scoring 20 documents takes approximately 100 milliseconds -- acceptable for a question-answering interface where users expect a brief delay.
Step 3: Context Building
The reranked documents are assembled into a context string that will be sent to the LLM:
context = ranked.map(d => d.content).join("\n\n---\n\n")For longer documents, you may want to extract only the most relevant portions:
context = ranked.map(d => {
// Extract the 500 characters around the most relevant passage
passage = extract_relevant_passage(d.content, question, 500)
"[Source: " + d.title + "]\n" + passage
}).join("\n\n---\n\n")The extract_relevant_passage() function finds the passage within the document that is most similar to the query and returns a window of text around it.
Step 4: Generation with Attribution
The LLM receives the context and the question, and generates an answer:
answer = ai_chat([
{ role: "system", content: "Answer questions based ONLY on the provided context. " +
"Cite sources by title using [Source: title] format. " +
"If the context doesn't contain the answer, say 'I don't have enough information to answer this question.'" },
{ role: "user", content: "Context:\n" + context + "\n\nQuestion: " + question }
])The system prompt constrains the LLM to: 1. Only use information from the provided context. 2. Cite sources by title. 3. Explicitly state when the answer is not in the context.
This grounding prevents hallucination. The LLM cannot invent facts that are not in the retrieved documents.
Source Attribution
The response includes the source documents so the user can verify the answer:
{
answer: "Our refund policy allows returns within 30 days of purchase [Source: Refund Policy v2.1]. For digital products, refunds are processed within 5 business days [Source: Digital Products FAQ].",
sources: [
{ title: "Refund Policy v2.1", id: 42 },
{ title: "Digital Products FAQ", id: 78 },
{ title: "Terms of Service", id: 3 },
{ title: "Customer Support Guide", id: 91 },
{ title: "Payment Processing", id: 55 }
]
}The frontend can render source links that let users click through to the original documents:
<div class="answer">
<p>{answer_data.answer}</p>