#120 -- RAG: Retrieval, Reranking, and Source Attribution

Large language models are knowledgeable but not omniscient. They cannot answer questions about your company's internal documentation, your product catalog, or your customer support history. They hallucinate when they do not know the answer, confidently generating plausible but incorrect information.

Retrieval-Augmented Generation (RAG) solves this by grounding LLM responses in your actual data. Instead of asking the model "What is our refund policy?", you first retrieve the relevant documentation, then ask the model to answer based on that documentation. The model generates a response using your data as context, not its training data.

FLIN implements RAG as a composable pipeline: retrieve relevant documents with semantic search, rerank them with a cross-encoder for precision, feed them to the LLM as context, and attribute the answer to its sources.

The RAG Pipeline

flinfn answer_question(question) {
    // Step 1: Retrieve relevant documents
    docs = search question in Document by content limit 20

    // Step 2: Rerank for precision
    ranked = rerank(question, docs, field: "content", limit: 5)

    // Step 3: Build context
    context = ranked.map(d => d.content).join("\n\n---\n\n")

    // Step 4: Generate answer with attribution
    answer = ai_chat([
        { role: "system", content: "Answer questions based ONLY on the provided context. Cite sources by title. If the context doesn't contain the answer, say so." },
        { role: "user", content: "Context:\n" + context + "\n\nQuestion: " + question }
    ])

    {
        answer: answer,
        sources: ranked.map(d => { title: d.title, id: d.id })
    }
}

Four steps. Retrieve, rerank, generate, attribute. Each step is a standard FLIN function call.

Step 1: Semantic Retrieval

The first step retrieves candidate documents using the search keyword:

flindocs = search question in Document by content limit 20

This returns the 20 most semantically similar documents. The limit is intentionally generous -- we retrieve more candidates than we need because the reranking step will select the most relevant ones. Retrieving 20 and reranking to 5 is more accurate than retrieving 5 directly.

The retrieval step uses the HNSW index (covered in article 117), which runs in under 5 milliseconds even for large document collections.

Step 2: Reranking

Semantic search uses bi-encoder models that embed the query and documents independently. This is fast (one embedding per query) but imprecise -- the query and document embeddings exist in the same vector space but are computed separately.

Reranking uses a cross-encoder model that scores query-document pairs together. This is slower (one inference per pair) but more accurate because the model can attend to the interaction between query and document.

flinranked = rerank(question, docs, field: "content", limit: 5)

The rerank() function: 1. Takes the query and candidate documents. 2. Scores each query-document pair with a cross-encoder. 3. Returns the top N documents sorted by relevance score.

rustpub fn rerank(
    query: &str,
    documents: &[Entity],
    field: &str,
    limit: usize,
) -> Vec<RankedEntity> {
    let mut scored: Vec<(usize, f32)> = documents.iter()
        .enumerate()
        .map(|(i, doc)| {
            let text = doc.get_text(field);
            let score = cross_encoder_score(query, text);
            (i, score)
        })
        .collect();

    scored.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
    scored.truncate(limit);

    scored.iter()
        .map(|(i, score)| RankedEntity {
            entity: documents[*i].clone(),
            score: *score,
        })
        .collect()
}

The cross-encoder is a small model (typically 50-100 MB) that runs locally via ONNX Runtime. Scoring 20 documents takes approximately 100 milliseconds -- acceptable for a question-answering interface where users expect a brief delay.

Step 3: Context Building

The reranked documents are assembled into a context string that will be sent to the LLM:

flincontext = ranked.map(d => d.content).join("\n\n---\n\n")

For longer documents, you may want to extract only the most relevant portions:

flincontext = ranked.map(d => {
    // Extract the 500 characters around the most relevant passage
    passage = extract_relevant_passage(d.content, question, 500)
    "[Source: " + d.title + "]\n" + passage
}).join("\n\n---\n\n")

The extract_relevant_passage() function finds the passage within the document that is most similar to the query and returns a window of text around it.

Step 4: Generation with Attribution

The LLM receives the context and the question, and generates an answer:

flinanswer = ai_chat([
    { role: "system", content: "Answer questions based ONLY on the provided context. " +
        "Cite sources by title using [Source: title] format. " +
        "If the context doesn't contain the answer, say 'I don't have enough information to answer this question.'" },
    { role: "user", content: "Context:\n" + context + "\n\nQuestion: " + question }
])

The system prompt constrains the LLM to: 1. Only use information from the provided context. 2. Cite sources by title. 3. Explicitly state when the answer is not in the context.

This grounding prevents hallucination. The LLM cannot invent facts that are not in the retrieved documents.

Source Attribution

The response includes the source documents so the user can verify the answer:

flin{
    answer: "Our refund policy allows returns within 30 days of purchase [Source: Refund Policy v2.1]. For digital products, refunds are processed within 5 business days [Source: Digital Products FAQ].",
    sources: [
        { title: "Refund Policy v2.1", id: 42 },
        { title: "Digital Products FAQ", id: 78 },
        { title: "Terms of Service", id: 3 },
        { title: "Customer Support Guide", id: 91 },
        { title: "Payment Processing", id: 55 }
    ]
}

The frontend can render source links that let users click through to the original documents:

flin<div class="answer">
    <p>{answer_data.answer}</p>

    <div class="sources">
        <h4>Sources</h4>
        {for source in answer_data.sources}
            <a href={"/docs/" + to_text(source.id)}>{source.title}</a>
        {/for}
    </div>
</div>

Complete RAG Example: Knowledge Base

Here is a complete knowledge base search with RAG:

flin// app/knowledge.flin

guard auth

query = ""
answer_data = none

fn search_knowledge(q) {
    if q.len < 3 { return }

    docs = search q in Document by content limit 20
    ranked = rerank(q, docs, field: "content", limit: 5)

    if ranked.len == 0 {
        answer_data = {
            answer: "No relevant documents found.",
            sources: []
        }
        return
    }

    context = ranked.map(d =>
        "[Source: " + d.title + "]\n" + d.content
    ).join("\n\n---\n\n")

    response = ai_chat([
        {
            role: "system",
            content: "You are a knowledge base assistant. Answer questions based only on the provided context. Cite sources."
        },
        {
            role: "user",
            content: "Context:\n" + context + "\n\nQuestion: " + q
        }
    ])

    answer_data = {
        answer: response,
        sources: ranked.map(d => { title: d.title, id: d.id, score: d.score })
    }
}

<main>
    <h1>Knowledge Base</h1>
    <input placeholder="Ask a question..."
           value={query}
           keyup.enter={search_knowledge(query)}>

    {if answer_data != none}
        <div class="answer-card">
            <div class="answer-text">{answer_data.answer}</div>

            <div class="sources">
                <h4>Sources ({answer_data.sources.len})</h4>
                {for source in answer_data.sources}
                    <a href={"/documents/" + to_text(source.id)}>
                        {source.title}
                        <span class="score">{to_text((source.score * 100).round)}% match</span>
                    </a>
                {/for}
            </div>
        </div>
    {/if}
</main>

RAG Quality Optimization

Three techniques improve RAG answer quality:

1. Chunk Size Matters

Documents should be chunked at appropriate granularity. A 10,000-word document embedded as a single vector produces poor retrieval because the embedding is a blurred average of many topics. Chunking into 500-word sections produces focused embeddings that retrieve precisely.

2. Metadata Filtering

Combine semantic search with metadata filters to narrow the search space:

flin// Only search documentation from the current product version
docs = Document.where(product == "flin" && version == "1.0")
relevant = search question in docs by content limit 20

3. Query Expansion

Rephrase the user's question to improve retrieval:

flinexpanded = ai_complete("Rephrase this question in 3 different ways: " + question)
// Search with all phrasings and merge results

Performance Profile

Step	Latency	Notes
Semantic retrieval	2-10 ms	HNSW index lookup
Reranking (20 docs)	80-150 ms	Cross-encoder scoring
Context assembly	< 1 ms	String operations
LLM generation	500-2000 ms	Depends on provider/model
Total	600-2200 ms	Acceptable for Q&A

The LLM generation step dominates the pipeline. Using a faster provider (Groq) or a smaller model (Haiku) reduces the total time significantly.

Why RAG in the Language

RAG is typically implemented as a multi-service architecture: a vector database (Pinecone/Weaviate/Qdrant), a retrieval service, a reranking service, and an LLM API. Coordinating these services requires infrastructure, configuration, and glue code.

FLIN collapses this entire pipeline into language features: search for retrieval, rerank for precision, ai_chat for generation. The vector index is built into FlinDB. The cross-encoder runs locally. The LLM is accessed through the AI Gateway.

A developer can add RAG to their application in one function. Not one service. Not one infrastructure deployment. One function.

In the next article, we cover document parsing -- how FLIN extracts text from PDF, DOCX, CSV, JSON, and YAML files for indexing and RAG.

This is Part 120 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [119] FastEmbed Integration for Embeddings - [120] RAG: Retrieval, Reranking, and Source Attribution (you are here) - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML - [122] Code-Aware Chunking for RAG