RAG Pipeline: Document Search With pgvector and Semantic Chunking

By Thales & Claude -- CEO & AI CTO, ZeroSuite, Inc.

A Terminale student uploads a 40-page physics textbook chapter on electromagnetism. Two minutes later, she asks: "Explain the relationship between Faraday's law and Lenz's law using my document." The AI retrieves the three most relevant passages from her uploaded document, cites the specific sections, and constructs an explanation that references her textbook's notation and examples. She is not studying from a generic AI response. She is studying from her own material, augmented by AI understanding.

A chartered accountant uploads a 200-page OHADA Uniform Act on accounting law. He asks: "What are the provisions regarding consolidated financial statements for groups with subsidiaries in multiple OHADA member states?" The AI searches the document, retrieves Articles 74 through 82, and produces a structured summary with specific article citations.

This is retrieval-augmented generation -- RAG. The AI does not hallucinate answers from training data. It searches the user's own documents, retrieves relevant passages, and grounds its response in the retrieved content. Building a RAG pipeline that works reliably for both a high school student's textbook and a legal professional's 200-page statute required solving problems across the entire document lifecycle: ingestion, processing, chunking, embedding, storage, retrieval, and reranking.

Why pgvector, Not a Dedicated Vector Database

The first architectural decision was where to store embeddings. The obvious choice in 2025-2026 is a dedicated vector database: Pinecone, Weaviate, Qdrant, Milvus. These are purpose-built for similarity search and offer features like filtered search, namespace isolation, and automatic index optimization.

We chose pgvector -- the PostgreSQL extension for vector similarity search. This means our embeddings live in the same PostgreSQL 17 database as our users, conversations, and files. No separate service, no separate connection pool, no separate backup strategy, no separate failure mode.

The reasons are practical, not ideological.

First, operational simplicity. Deblo runs on a single Hetzner server. Every additional service is another process to monitor, another set of credentials to manage, another potential point of failure at 2 AM when the server needs attention. PostgreSQL is already running. Adding pgvector is a single CREATE EXTENSION vector; command. The marginal operational cost is zero.

Second, transactional consistency. When a user uploads a document, we create an UploadedFile record, process the document, create DocumentChunk records with embeddings, and update the file's processing status -- all within a single database transaction. If the chunking fails halfway through, everything rolls back cleanly. With a separate vector database, we would need distributed transaction coordination or accept eventual consistency.

Third, query flexibility. pgvector embeddings live in regular PostgreSQL tables. We can join them with user tables, filter by conversation ID, combine vector similarity with full-text search, and use standard SQL analytics. A query like "find the most relevant chunks from documents uploaded by this user in the last 30 days" is a single SQL statement, not an API call to one service followed by a join against another.

The tradeoff is performance at scale. Pinecone handles billions of vectors. pgvector starts to degrade beyond a few million vectors per table without careful index tuning. For Deblo's current scale -- thousands of users, each with a handful of documents -- pgvector is more than sufficient. If we reach millions of documents, we can migrate to a dedicated vector store. But premature infrastructure complexity has killed more startups than premature optimization.

The Document Processing Pipeline

When a user uploads a file, it enters a four-stage pipeline: extraction, OCR (if needed), chunking, and embedding.

Upload -> Text Extraction -> OCR (images/scanned PDFs) -> Semantic Chunking -> Embedding -> Store

The pipeline handles four file types:

PDF: Primary extraction with pdfjs (JavaScript PDF parser running server-side via PyMuPDF). If the extracted text is sparse (< 100 characters per page, indicating a scanned document), the PDF is routed through OCR.
Images (PNG, JPG, WEBP): Always routed through OCR. No text extraction is possible without it.
DOCX: Converted to plain text using mammoth, which preserves paragraph structure and headings while stripping formatting.
Plain text (TXT, MD, CSV): Used directly, no extraction needed.

OCR is handled by two providers. The primary is Mistral OCR (via the Mistral API's document understanding endpoint), which produces high-quality structured text from scanned documents and images. The fallback is Replicate's OCR models, used when the Mistral API is unavailable or rate-limited. Both providers return plain text with paragraph boundaries preserved.

The extraction stage is the most failure-prone part of the pipeline. PDFs are a notoriously inconsistent format -- a "PDF" might be a vector document with extractable text, a raster scan of a printed page, a mix of both, or a corrupted file that crashes the parser. We handle this with defensive coding: every extraction is wrapped in a try/except, with graceful degradation to OCR if text extraction fails.

Semantic Chunking

After text extraction, the document is split into chunks for embedding. The chunking strategy determines retrieval quality more than any other component in the pipeline.

The naive approach is fixed-window chunking: split the text every N tokens with M tokens of overlap. This is fast and predictable but produces poor chunks. A fixed window might split a paragraph in the middle of a sentence, separate a theorem from its proof, or combine the end of one section with the beginning of an unrelated one.

We use semantic chunking via the Datalab API. Datalab analyzes the document's structure -- headings, paragraph boundaries, topic transitions -- and splits at semantically meaningful points. A chunk boundary falls between paragraphs, between sections, or at topic transitions. Each chunk contains a complete thought or a cohesive unit of information.

The chunking parameters:

Target chunk size: 512 tokens (approximately 2,000 characters in French/English)
Maximum chunk size: 1,024 tokens
Minimum chunk size: 64 tokens (prevents degenerate single-sentence chunks)
Overlap: 10% of chunk size at boundaries (approximately 50 tokens)

The overlap ensures that information near chunk boundaries is not lost. If a key sentence falls at the boundary between two chunks, it appears in both, ensuring that a similarity search can find it regardless of which chunk is retrieved.

When the Datalab API is unavailable, we fall back to a local chunking algorithm that splits on paragraph boundaries (double newlines) and merges small paragraphs to meet the minimum chunk size. This fallback produces adequate but not optimal chunks.

The DocumentChunk Model

Each chunk is stored with its embedding vector, metadata, and a reference to the source file:

pythonfrom pgvector.sqlalchemy import Vector

class DocumentChunk(Base):
    __tablename__ = "document_chunks"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid4)
    file_id = Column(UUID(as_uuid=True), ForeignKey("uploaded_files.id", ondelete="CASCADE"), nullable=False)
    chunk_index = Column(Integer, nullable=False)  # Order within the document

    content = Column(Text, nullable=False)          # The chunk text
    embedding = Column(Vector(1024), nullable=True)  # bge-m3 produces 1024-dim vectors

    # Metadata for filtering and context
    metadata = Column(JSONB, nullable=True, default=dict)
    # Example: {"page": 12, "section": "Chapter 3", "heading": "Loi de Faraday"}

    token_count = Column(Integer, nullable=True)
    created_at = Column(DateTime(timezone=True), server_default=func.now())

    # Relationships
    file = relationship("UploadedFile", back_populates="chunks")

    # Index for vector similarity search
    __table_args__ = (
        Index(
            "ix_document_chunks_embedding",
            embedding,
            postgresql_using="ivfflat",
            postgresql_with={"lists": 100},
            postgresql_ops={"embedding": "vector_cosine_ops"},
        ),
    )

The Vector(1024) column stores the embedding as a native PostgreSQL vector type. The ivfflat index with 100 lists provides approximate nearest-neighbor search that is fast enough for our scale (sub-50ms for searches across 100,000 chunks) while maintaining high recall.

The metadata JSONB column stores structural information extracted during chunking: page number, section heading, chapter title. This metadata is returned alongside search results, enabling the AI to cite specific locations: "According to page 42, section 3.2 of your document..."

The CASCADE delete on file_id ensures that when a user deletes an uploaded file, all associated chunks and embeddings are automatically removed. No orphaned vectors, no manual cleanup.

Embedding Generation

Embeddings are generated using BAAI/bge-m3 via the OpenRouter API. bge-m3 is a multilingual embedding model that handles French, English, and Arabic -- the three primary languages of Deblo's user base -- with equal quality. It produces 1024-dimensional vectors.

pythonimport httpx

async def generate_embedding(
    text: str,
    model: str = "baai/bge-m3",
) -> list[float]:
    """Generate an embedding vector for the given text.

    Uses OpenRouter's embedding API with bge-m3 as primary model
    and Mistral Embed as fallback.
    """
    # Truncate to model's max input (8192 tokens ~ 32K chars)
    if len(text) > 32000:
        text = text[:32000]

    try:
        async with httpx.AsyncClient(timeout=30) as client:
            response = await client.post(
                "https://openrouter.ai/api/v1/embeddings",
                headers={
                    "Authorization": f"Bearer {settings.OPENROUTER_API_KEY}",
                    "Content-Type": "application/json",
                },
                json={
                    "model": model,
                    "input": text,
                },
            )
            response.raise_for_status()
            data = response.json()
            return data["data"][0]["embedding"]

    except (httpx.HTTPError, KeyError, IndexError):
        # Fallback to Mistral Embed
        if model != "mistralai/mistral-embed":
            return await generate_embedding(text, model="mistralai/mistral-embed")
        raise

The fallback to mistralai/mistral-embed handles OpenRouter outages or rate limits on the bge-m3 model. Mistral Embed produces 1024-dimensional vectors as well, so the vectors are dimensionally compatible. However, embeddings from different models are not directly comparable -- a bge-m3 embedding and a Mistral Embed embedding for the same text will not be similar in vector space. We track which model generated each embedding in the chunk metadata and ensure that search queries use the same model as the stored embeddings.

In practice, the fallback triggers rarely (< 1% of embedding requests). When it does, the affected chunks are flagged for re-embedding with the primary model during the next off-peak period.

Semantic Search With Reranking

When the AI needs to search a user's documents -- either because the user explicitly asks or because the AI decides retrieval is needed -- it calls the search_user_files tool. The search proceeds in two stages: vector similarity retrieval followed by reranking.

pythonasync def search_documents(
    query: str,
    user_id: UUID,
    db: AsyncSession,
    top_k: int = 10,
    rerank_top_k: int = 3,
    file_ids: list[UUID] | None = None,
) -> list[dict]:
    """Semantic search across user's document chunks.

    Stage 1: pgvector cosine similarity -> top_k candidates
    Stage 2: Mistral Reranker -> rerank_top_k final results
    """
    # Generate query embedding
    query_embedding = await generate_embedding(query)

    # Stage 1: Vector similarity search
    stmt = (
        select(
            DocumentChunk,
            DocumentChunk.embedding.cosine_distance(query_embedding).label("distance"),
        )
        .join(UploadedFile, DocumentChunk.file_id == UploadedFile.id)
        .where(UploadedFile.user_id == user_id)
        .where(DocumentChunk.embedding.isnot(None))
    )

    if file_ids:
        stmt = stmt.where(DocumentChunk.file_id.in_(file_ids))

    stmt = stmt.order_by("distance").limit(top_k)

    result = await db.execute(stmt)
    candidates = result.all()

    if not candidates:
        return []

    # Stage 2: Rerank with Mistral Reranker
    reranked = await rerank_results(
        query=query,
        documents=[c.DocumentChunk.content for c in candidates],
        top_k=rerank_top_k,
    )

    # Build final results with metadata
    final = []
    for idx, score in reranked:
        chunk = candidates[idx].DocumentChunk
        final.append({
            "content": chunk.content,
            "file_id": str(chunk.file_id),
            "chunk_index": chunk.chunk_index,
            "metadata": chunk.metadata,
            "similarity_score": 1 - candidates[idx].distance,
            "rerank_score": score,
            "filename": chunk.file.filename,
        })

    return final

The two-stage approach is standard in production RAG systems, but the reasons are worth explaining.

Vector similarity search is fast but imprecise. Cosine similarity between embeddings captures semantic relatedness but not relevance to the specific query. A chunk about "Lenz's law" and a chunk about "Ohm's law" might have similar embeddings (both are physics laws about electromagnetism) but only one is relevant to a question about Lenz's law.

The reranker -- Mistral Reranker, a cross-encoder model -- takes the query and each candidate document as input and produces a relevance score. Cross-encoders are more accurate than bi-encoders (embedding similarity) because they jointly attend to the query and document tokens. The tradeoff is speed: a cross-encoder takes 50-200ms per query-document pair, making it impractical for searching thousands of chunks. But for reranking 10 candidates, the total reranking latency is under 500ms.

The combination gives us the best of both worlds: fast initial retrieval via pgvector (< 50ms for 100K chunks) followed by precise reranking via Mistral (< 500ms for 10 candidates). The final 3 chunks returned to the AI are the most relevant passages in the user's document collection.

The read_user_file Tool

Not all document access goes through vector search. Sometimes the AI needs to read a file directly -- to answer "What is on page 5 of my document?" or to process an uploaded image. The read_user_file tool handles direct file access:

The tool retrieves the file's extracted text (or base64-encoded image data for images) and truncates it to 8,000 tokens. This truncation is essential. Without it, a 200-page document would consume the entire LLM context window, leaving no room for the conversation history or the AI's response. The 8K limit is a balance between providing enough context for meaningful analysis and preserving context space for the rest of the conversation.

For files that exceed 8K tokens, the AI is instructed to use search_user_files instead -- searching for specific passages rather than loading the entire document. This guidance is in the system prompt: "For large documents (> 20 pages), prefer searching for specific sections rather than reading the entire file."

Context Compression

RAG solves the "find relevant information" problem but creates a new one: context overflow. A Deblo conversation can include the system prompt (2K tokens), conversation history (variable), retrieved document chunks (variable), tool results (variable), and the AI's current generation. The context window is not infinite.

DeepSeek V3's context window is 128K tokens. That sounds enormous, but a professional user with a long conversation history, multiple document retrievals, and several tool results can approach it. We implemented context compression to handle this:

pythonasync def compress_conversation_context(
    messages: list[dict],
    max_tokens: int = 150_000,
) -> list[dict]:
    """Compress conversation history when it exceeds the token threshold.

    Strategy:
    1. Count total tokens in message history.
    2. If under max_tokens, return unchanged.
    3. If over, summarize the oldest N messages into a single
       system message, preserving the most recent messages verbatim.

    The summary is generated by a fast, cheap model (Mistral Large)
    to avoid consuming expensive tokens for housekeeping.
    """
    total_tokens = sum(count_tokens(m["content"]) for m in messages)

    if total_tokens <= max_tokens:
        return messages

    # Find the split point: keep the last 20 messages verbatim
    KEEP_RECENT = 20
    if len(messages) <= KEEP_RECENT:
        return messages  # Cannot compress further

    old_messages = messages[:-KEEP_RECENT]
    recent_messages = messages[-KEEP_RECENT:]

    # Summarize old messages
    summary_prompt = (
        "Summarize the following conversation history in 2-4 sentences. "
        "Preserve key facts, decisions, file references, and user preferences. "
        "Be concise but complete."
    )

    old_content = "\n".join(
        f"{m['role']}: {m['content'][:500]}" for m in old_messages
    )

    summary = await call_llm(
        model="mistralai/mistral-large-2512",
        messages=[
            {"role": "system", "content": summary_prompt},
            {"role": "user", "content": old_content},
        ],
        max_tokens=300,
    )

    # Replace old messages with summary
    summary_message = {
        "role": "system",
        "content": f"[Resume de la conversation precedente: {summary}]",
    }

    return [summary_message] + recent_messages

The compression model is mistralai/mistral-large-2512, chosen for cost and speed. Summarizing old messages is a housekeeping task, not a user-facing generation. The cost is approximately $0.00005 per summary -- negligible. The latency is 1-2 seconds, which is absorbed into the overall response time without noticeable delay.

The 150K token threshold provides headroom below DeepSeek V3's 128K limit. We compress proactively rather than waiting until the context is full, because the LLM's quality degrades well before the hard limit. Empirically, response quality starts declining around 100K tokens of context. By compressing at 150K (which, after compression, drops to approximately 30K), we keep the active context well within the quality zone.

AI Memory: Cross-Conversation Continuity

RAG retrieves information from documents. But what about information from previous conversations?

A student who spent 45 minutes discussing quadratic equations yesterday should not have to re-establish context today. A professional who told the AI about her client's corporate structure last week should not have to repeat it.

The AIMemory model stores auto-generated conversation summaries:

After each conversation, a background task generates a summary: a title (10-15 words) and a brief description (2-4 sentences) capturing the key topics, decisions, and outcomes. These summaries are stored in the AIMemory table, linked to the user.

When a user starts a new conversation, the system loads the last 10-20 memory entries and injects them into the system prompt as context. The AI sees: "In a previous conversation, the user discussed quadratic equations and struggled with the discriminant formula. In another conversation, the user asked about SYSCOHADA consolidation rules for Group Bamba." This gives the AI continuity without loading entire past conversations into context.

The memory model uses mistralai/mistral-large-2512 for summary generation -- the same model used for context compression. The cost is trivial. The value is significant: returning users feel like the AI remembers them, which builds trust and reduces friction.

The Full Retrieval Flow

Putting it all together, here is what happens when a user asks a question that requires document retrieval:

User sends: "D'apres mon cours de physique, explique la loi de Faraday."
The AI's system prompt includes 15 recent memory entries for context.
The AI decides to call search_user_files with query "loi de Faraday."
The search function generates an embedding for the query using bge-m3.
pgvector retrieves the 10 most similar chunks from the user's documents.
Mistral Reranker reranks the 10 candidates and returns the top 3.
The 3 chunks (approximately 1,500 tokens total) are returned to the AI as tool results.
The AI reads the chunks, identifies the relevant passages about Faraday's law.
The AI generates a response that references the specific document content, citing page numbers and section headings from the chunk metadata.
If the conversation exceeds 150K tokens, context compression runs before the next message.

Total added latency from RAG: approximately 800ms (embedding: 200ms, pgvector search: 50ms, reranking: 500ms, overhead: 50ms). This is negligible compared to the LLM generation time of 3-15 seconds.

What We Would Change

Two aspects of the current pipeline are candidates for improvement.

First, the chunking fallback. When the Datalab API is unavailable, our local paragraph-boundary chunking produces noticeably worse retrieval quality. Chunks may contain multiple unrelated topics or split a single topic across chunks. A local semantic chunking implementation -- using sentence embeddings to detect topic boundaries -- would eliminate this dependency.

Second, hybrid search. Currently, we use pure vector similarity. Adding BM25 (keyword-based) search and combining it with vector scores would improve retrieval for queries that contain specific terms -- article numbers, proper nouns, technical acronyms -- that embedding models sometimes fail to match. PostgreSQL's built-in full-text search (tsvector) would handle this without any additional infrastructure.

Both improvements are on the roadmap. Neither is blocking. The current pipeline works reliably for the document sizes and query patterns we see in production. But as the platform scales and users upload larger document collections, these refinements will become necessary.

This is Part 16 of a 20-part series on building Deblo.ai.