Back to deblo
deblo

Why The Word 'Médicament' Has To Find The Word 'Paracétamol': How We Replaced Postgres Full-Text Search With Google's Latest Embedding Model To Serve The African Mother Who Doesn't Know Pharmacology

On June 2, 2026, a mom asked Déblo 'do I have medications to take this week?' — and Déblo, which had stored her prescription as 'paracétamol 1g morning and evening,' found nothing. The two words share no lexical root, and Postgres full-text search rejects the match by design. Why we replaced FTS with Google's Gemini Embedding 2 at 768 dimensions in a pgvector HNSW index, why we kept FTS as a fallback, and what the production canary told us in the first ten seconds.

Juste A. Gnimavo (Thales) & Claude | June 2, 2026 25 min deblo
EN/ FR/ ES
deblosemantic-searchembeddingsgemini-embedding-2pgvectorhnswpostgres-ftsopenroutervertex-aimultilingual-aivoice-aigemini-liveretrieval-augmented-generationmatryoshka-embeddingsasymmetric-retrievalafricacode-switchingfallback-chainsprod-canarieshnsw-vs-ivfflatalembiclivekitaudience-first-designclaude-opus-4.7claude-code

By Thales (CEO, ZeroSuite) & Claude Opus 4.7 — Claude Code instance

On the evening of June 2, 2026, a mom in Abidjan opened Déblo on her phone, tapped the microphone, and asked the AI a question that should have been trivial: « est-ce que j'ai des médicaments à prendre cette semaine ? » — do I have medications to take this week?

Déblo remembered. The previous week, in a different call, she had told Déblo that her doctor had prescribed « paracétamol 1g matin et soir pour migraine » — paracetamol 1g morning and evening for migraine. The summarizer had taken that conversation, distilled it to an AIMemory row, and persisted it to the production database. The fact existed. The retrieval call was wired. The voice tool fired correctly.

And Déblo said it couldn't find anything.

This post is about why that happened, why it should have been obvious in advance, what we shipped that evening to fix it, and why the fix involved replacing Postgres full-text search with Google's latest embedding model — specifically the one released a few weeks ago, at exactly the right dimension, routed through exactly the right gateway, in exactly the configuration our audience requires. It is also about the discipline of fallback chains: the new path is primary, the old path stays, the bridge stays under that, and the row-write path never fails because the embedding service is down. Each of those decisions was load-bearing.


Part 1 — Why The Mom Lost Her Medication

The voice tool that lost the medication is called user_data_semantic_search. It was shipped two weeks ago, in S256, as part of the Reminders v2 initiative. The premise is simple: when the user asks Déblo a question that references something they told the AI in a previous conversation, the model should be able to fetch the relevant AIMemory rows from the production database rather than hallucinating an answer or saying "I don't remember."

The first version used Postgres full-text search. The endpoint was a single SQL query that did this:

sqlSELECT ... ts_rank(
  to_tsvector('french', coalesce(title,'') || ' ' || coalesce(content,'')),
  plainto_tsquery('french', :q)
) AS rank
FROM ai_memories
WHERE user_id = :uid
  AND to_tsvector('french', coalesce(title,'') || ' ' || coalesce(content,''))
      @@ plainto_tsquery('french', :q)
ORDER BY rank DESC LIMIT :k

That is a textbook Postgres FTS query. It does what FTS is documented to do: it tokenizes both the corpus (the memory rows) and the query into French lexemes, strips suffixes and accents and stopwords, builds an inverted index for the user's rows, and matches when at least one lexeme on the query side appears on the corpus side. It is fast, well-understood, free, and ships with the database.

It is also lexeme-exact by design. The word « médicament » lemmatizes to the lexeme médicament. The word « paracétamol » lemmatizes to paracétamol. These two lexemes share no characters in common and are not connected by any morphological rule the French FTS configuration knows about. The query says medications; the corpus says paracetamol; the FTS engine, doing its job correctly, returns zero rows.

This is not a Postgres bug. This is the literal contract of full-text search: match the words that were typed, not the concepts they refer to. To match concepts you need a different layer of retrieval entirely. Synonyms, code-switching between French and English, relative dates (« cette semaine » vs. « lundi 9 juin »), domain expertise that the user does not have (« médicament » as a category vs. « paracétamol » as an instance) — none of these are FTS problems. They are vector-search problems.

The team shipping S256 knew this. The session log for that day explicitly notes the limitation. The plan was always to follow up with proper semantic search; the FTS path was a launch-window bridge so the tool would do something useful in the interim. A patch had even been shipped (6e1bff8) that added a recency fallback: if FTS returned zero hits, return the five most recent AIMemory rows for that user, unfiltered. That patch saved Déblo from saying "I don't remember anything" in the obvious cases. It did not save the mom asking about her medications, because her relevant row was not in the last five.

So on the evening of June 2, when she asked the question and the model said it could not find anything, the failure was structural. The next move was the real fix.


Part 2 — Why The Audience Made The Choice For Us

Before picking an embedding model, it is worth being precise about who exactly is asking these questions, because the audience constrains the choice in a way that is non-obvious from a generic "semantic search" problem statement.

Déblo's audience is primarily Francophone West Africa. The language of the conversation is French with a code-switching tail into Dioula, Baoulé, Wolof, Lingala, Bambara, Mooré, plus the occasional English word borrowed from professional or technical contexts. The vocabulary tilts toward daily-life concerns: school subjects, meal planning, household coordination, small-business operations, medical follow-ups with state hospitals or private clinics. The users frequently invert formal categories with brand or instance names: a parent says « panadol » when the doctor wrote paracétamol, or « le rdv » when the appointment is technically a consultation cardiologique, or « la maitresse de Junior » when the system knows that person as Mme Adjoua Konan.

This audience makes three things hard for a generic English-centric embedding model:

One. Multilingual coverage is not optional. The vector space has to put médicament close to paracétamol in French, but it also has to put médicament close to fura (a Dioula-flavored medication term used in some Abidjanaise households) and close to the English medicine when a professional user code-switches mid-sentence. An embedding model trained primarily on English internet text will collapse these distinctions or place them in incompatible regions of the space. We need a model that was trained on a multilingual corpus with serious French weight — and ideally with non-trivial coverage of the local languages our users actually use.

Two. Asymmetric retrieval matters more than usual. A user query is short, vague, and conceptual (« j'ai des médicaments cette semaine ? » — eight words including stopwords). The corpus row is longer, specific, and grounded (« paracétamol 1g matin et soir pour migraine, ordonnance Dr Konaté du 24 mai, en attendant le résultat de l'IRM » — twenty-five words). A symmetric embedding model (where queries and documents share the same vector geometry) tends to misrank when the query is much shorter than the document, because the model's default learned distance function assumes comparable density on both sides. An asymmetric model — one that the API lets you flag as RETRIEVAL_QUERY for queries and RETRIEVAL_DOCUMENT for indexed rows — handles this case explicitly and typically yields 5–10% better top-k recall on the exact mismatch our audience produces.

Three. The audience is not paying us for tokens. The unit economics of Déblo are constrained by the willingness-to-pay of a parent in a working-class neighborhood of Abidjan or a student preparing the BEPC. Every operation we perform per user per day has to amortize against credit packs that retail at around 1000 CFA per 100 credits, which is roughly $1.65. An embedding model that costs $0.13 per million input tokens (the current Gemini Embedding 2 price tier we are using) is fine. An embedding model that costs $1.30 per million would force us to either reduce embed frequency (worse recall) or pass the cost to the user (worse pricing). The order of magnitude on cost is doing real work in the choice.

The model that satisfies all three constraints, as of June 2, 2026, is Google's gemini-embedding-2. It is the embedding model Google released a few weeks ago as the successor to gemini-embedding-001. It is multilingual by default (100+ languages, with French in the well-covered tier). It supports the task_type=RETRIEVAL_QUERY / task_type=RETRIEVAL_DOCUMENT distinction directly in the API. It is priced in the same tier as the OpenAI and Cohere alternatives we benchmarked. And it returns vectors with a configurable dimension between 128 and 3072, with the model card recommending 768, 1536, or 3072 as the sweet spots — which matters for the storage question we will get to in a moment.

There is one further constraint that pushed the decision past Google's competitors specifically. We are already a Google Cloud customer with BYOK billing routed through OpenRouter for our chat and RAG pipelines. Adding gemini-embedding-2 to the user-data path means the embedding charges land on the same GCP credit pool we are already draining, on the same invoice line, with no new vendor onboarding, no new key rotation, no new SOC2 review. That is not a technical argument. It is an operational argument. And in a team of two — one founder, one AI senior — the operational argument wins on tie.

The model was chosen, in other words, not because Google's embeddings are objectively the best on a generic English MTEB benchmark — they are competitive but not dominant. The model was chosen because Google's embeddings were the best fit for our specific audience (multilingual French-tilted with code-switching), our specific access pattern (short queries against longer documents), our specific cost envelope (sub-cent per row), and our specific operational posture (already on GCP credits via OpenRouter, do not add a fourth vendor for a feature we are shipping in 4 hours).


Part 3 — Why 768 Dimensions, Not 3072

The next decision was the dimension. Gemini Embedding 2 supports a free parameter dimensions in the API, valid from 128 to 3072. The model is Matryoshka-trained, which means the first 768 dimensions of the 3072-dimensional vector are themselves a coherent 768-dimensional embedding — truncating to 768 is lossless relative to having asked for 768 directly. The model card recommends 768, 1536, or 3072 as the three sweet spots where the Matryoshka training was explicitly optimized.

We picked 768 for three reasons.

Storage and index size. Each row's vector is stored in pgvector as vector(N) where N is the dimension. A 768-dim vector occupies 768 × 4 bytes = 3072 bytes per row plus metadata. A 3072-dim vector occupies 12,288 bytes per row plus metadata. The HNSW index on top of the column scales linearly with vector size for both the build time and the query time. For our row counts — fewer than 10,000 AIMemory rows per user in the realistic upper bound, and a few hundred thousand globally for the first year — neither dimension would create a scaling problem. But the 4× difference compounds across the table, the index, the page cache, and the query path. At 768, the entire user-data embedding column for our first 10,000 users fits comfortably in Postgres's shared buffer cache on the production database. At 3072, it does not.

Recall on our specific corpus is plateaued well below 3072. The reason model providers offer dimensions above 768 is to capture finer-grained semantic distinctions that matter when the corpus is huge (millions of distinct concepts) or when the queries are subtle (academic search, scientific paper retrieval, legal document discovery). Our corpus per user is small (hundreds of rows on the upper end), the concepts are concentrated (daily life, school, work, health), and the queries are coarse (the mom does not type a 200-word PubMed query; she asks one sentence in French). Empirically, the recall improvement from 768 to 3072 on a corpus like ours is in the second decimal place on the cosine-similarity histogram. We can verify this later by re-embedding at 3072 and A/B-ing top-k stability; we did not pay that cost upfront.

Matryoshka gives us a reversible decision. This is the third reason and the one that made us comfortable defaulting to 768 instead of agonizing. If, six months from now, the corpus has grown, the queries have become more discriminating, and the recall histogram shows that 768 has become the bottleneck, we re-embed at 3072 with the same model, re-migrate the column to vector(3072), and the old 768-dim vectors are still valid as the prefix of the new 3072-dim vectors (because Matryoshka). The choice is reversible upward. The reverse direction — pre-committing to 3072 because we are worried we might need it — costs us 4× storage and index time forever, on the assumption that we might one day care. That asymmetric cost shape pushes the decision to 768 by default.

The dimension was set in the environment variable OPENROUTER_USER_DATA_EMBEDDING_DIMENSIONS=768. The migration created vector(768) columns on the two tables. The embedding service passes dimensions: 768 in the OpenRouter request body, which is forwarded transparently to the Vertex AI Embedding API on the BYOK path.

There is one operational subtlety here that we wrote a canary for, which we will come back to in Part 6.


Part 4 — Two Tables, One Service, Zero Touches To The RAG Path

The user_data_semantic_search tool searches across two tables: ai_memories (the auto-summarized conversation memories) and tasks (the user's reminders, to-dos, recurring tasks). Both tables now have an embedding vector(768) column. Both have an HNSW index over that column with vector_cosine_ops. Both are populated by a write-time hook at every creation site.

The crucial discipline here is that none of this touched the existing RAG pipeline, which is the cornerstone of Déblo's document-chat feature (upload a PDF or photo, ask questions about it). The RAG pipeline uses an entirely different embedding model — BGE-M3 routed through OpenRouter at 1024 dimensions, stored in document_chunks.embedding as vector(1024). If we had been sloppy and reused the existing OPENROUTER_BGEM3_EMBEDDING_MODEL environment variable for the user-data path by simply pointing it at Gemini, the next document upload would have called the model expecting 1024-dim vectors and gotten 768-dim vectors back, and the INSERT into document_chunks would have failed with a Postgres dimension mismatch, and the entire document-chat feature would have silently broken. That this is technically obvious in retrospect did not prevent the configuration mistake from being made and immediately corrected during the prompt drafting on the evening of June 2 — the CEO had momentarily flipped the BGE-M3 env var to test the new model before realizing the collision.

The discipline that prevented it is separate everything for the new path:

  • Separate model env var: OPENROUTER_USER_DATA_EMBEDDING_MODEL (not reusing OPENROUTER_BGEM3_EMBEDDING_MODEL).
  • Separate dimension env var: OPENROUTER_USER_DATA_EMBEDDING_DIMENSIONS (not assuming the RAG pipeline's hardcoded 1024).
  • Separate service module: app/services/user_data_embedding.py, not extending app/services/embedding.py.
  • Separate function signatures: embed_user_data_single / embed_user_data_texts, not overloading embed_single / embed_texts.

This looks like over-engineering on first read. It is, in fact, the minimum sufficient discipline to keep two pipelines that have nothing to do with each other from accidentally entangling. The cost is one extra file and four extra symbols. The benefit is that nothing we ship to the user-data path can accidentally regress the RAG path, even at code-review velocity.

The write-time hook is wired into six call sites:

  • services/memory.py:generate_and_save_summary — the LLM summarizer that runs at the end of every voice or chat conversation.
  • services/tool_executor.py save_memory branch — the chat-side tool that lets the model explicitly persist a memory.
  • routes/voice_tools.py:voice_internal_create_task — the voice-side tool that creates a reminder.
  • routes/tasks.py:create_my_task — the chat-side / personal task creation endpoint.
  • routes/tasks.py:create_task — the org-scoped task creation endpoint (added for symmetry; the original prompt did not list it).
  • services/task_service.py:spawn_recurring_task — the cron-side cloner that re-creates the next occurrence of a recurring task when the current one is completed.

The last one is the one a read-only audit agent caught after the rest of the work was complete. The original spec did not mention spawn_recurring_task because the spec author was thinking in terms of user-initiated creates. But a recurring task that completes today and respawns tomorrow is, from the perspective of the search tool, just another row that needs to be findable. If we had shipped without that hook, every recurring reminder cloned by the scheduler would have landed with embedding = NULL and would have been searchable only via the FTS fallback. The fix in that one case is special: the cloned task has the same title and description as the parent (that is what "recurring" means), so the embedding is identical too, and we simply copy original.embedding verbatim rather than spend another OpenRouter call. The cleanest fix is the one that recognizes the embedding does not need to be recomputed.

Every write-time hook is wrapped in try/except and falls back to embedding=None if the OpenRouter call fails for any reason — timeout, HTTP 503, dimension mismatch, JSON parse error. The row still saves. The search tool's vector path silently skips rows with NULL embeddings (WHERE embedding IS NOT NULL) and falls back to FTS for those. The backfill script picks them up on the next run. No failure mode in the embedding service can prevent a user from saving a memory or creating a reminder. That is the contract.


Part 5 — HNSW, Not IVFFlat, And Why The Prompt Was Wrong

The original task brief, drafted earlier that day, specified the pgvector index as IVFFlat with lists=100. That recommendation is reasonable on first read — IVFFlat is one of two indexes pgvector supports for approximate nearest-neighbor search, it has a long track record, and lists=100 is the standard tuning for small-to-medium tables.

It is also the wrong choice for our situation, and the implementation deviated from the prompt explicitly to use HNSW instead. The deviation is documented in the migration docstring and re-stated in the session log so the decision does not get re-litigated in a future session.

The reason is that IVFFlat is a clustering index. At CREATE INDEX time, pgvector samples the existing rows, runs k-means with the configured lists parameter, and stores the centroids. Subsequent inserts and searches use the centroids to route queries to a small subset of the data. The recall and the speed both depend on the centroids being trained on representative data.

When the migration runs, the table is empty. The two new vector columns have just been added by ALTER TABLE. There is no data to train the IVFFlat centroids on. pgvector handles this gracefully — it creates the index with empty centroids — but the index is functionally a sequential scan until you REINDEX CONCURRENTLY after loading the data. That is an extra operational step, easy to forget, and silently degrades performance until it is run. On a 71-row backfill it would not have mattered for production query speed. On a corpus growing into the millions it would.

HNSW (hierarchical navigable small world) is the alternative pgvector index. It is graph-based, not cluster-based. It does not have a training step; the graph is built incrementally as rows are inserted. The default parameters (m=16, ef_construction=64) are tuned for general-purpose use and perform well at our scale. The pgvector documentation, as of version 0.5+, treats HNSW as the recommended default for new deployments. The existing RAG migration in our codebase (migration 017, dating from February 2026) already uses HNSW for the document_chunks table for exactly this reason.

Following the prompt to the letter would have meant shipping an index that does not work on an empty table and requires a manual REINDEX step that nobody is going to remember six months from now when the table is full. Refusing the prompt and choosing HNSW means the index works at every scale from zero to millions of rows with zero operational ceremony. The Claude Code instance shipping this work flagged the decision to the CEO in the commit message and in the session log, in case the deviation needed to be revisited. It did not.

This is a small example of a larger pattern: a one-shot prompt drafted in 30 minutes cannot anticipate every operational subtlety that the implementing agent will see when it actually reads the code. The implementing agent has the standing to refuse and substitute, on the condition that the deviation is named, justified, and logged. That standing is what makes it safe to write one-shot prompts that are slightly wrong — the implementation does not silently propagate the error.


Part 6 — The Canary That Proves OpenRouter Honors The Dimension Parameter

There is one operational concern that does not surface until the very first production embedding call: does OpenRouter actually pass through the dimensions parameter to the upstream Vertex AI Embedding API?

OpenRouter is a routing gateway. It accepts OpenAI-compatible /v1/embeddings requests and forwards them to whichever upstream provider hosts the model the request names. The OpenAI Embeddings API supports dimensions as a documented field on text-embedding-3-small and text-embedding-3-large. Vertex AI's Embedding API supports it as output_dimensionality. OpenRouter handles the field-name remapping for models that need it. Usually. The behavior is not strongly contractual; OpenRouter could silently drop the field for a model the gateway author has not personally tested.

If OpenRouter silently dropped dimensions: 768, the upstream Gemini Embedding 2 API would return its default vector length, which (as of June 2, 2026) is 3072. Our pgvector column is vector(768). The INSERT would fail with expected 768 dimensions, not 3072. The write-time hook's try/except would catch the failure, log a warning, and persist the row with embedding=NULL. The user-facing path would silently degrade to FTS. We would learn about it only when we noticed that every row had embedding IS NULL despite the API calls returning 200 OK.

The canary in the embedding service catches this at the first call. The code is one block:

pythonif vectors:
    got_dim = len(vectors[0])
    if got_dim != dim:
        logger.warning(
            "user_data embedding dim mismatch model=%s "
            "expected=%d got=%d -- OpenRouter passthrough check needed; "
            "persisting rows with NULL embedding",
            model, dim, got_dim,
        )
        return [None] * len(texts)
    global _canary_logged
    if not _canary_logged:
        _canary_logged = True
        logger.warning(
            "user_data embedding canary OK model=%s dim=%d batch=%d task_type=%s",
            model, got_dim, len(vectors), task_type,
        )

The first call to embed any text checks the returned vector length against the expected dimension and emits exactly one of two log lines. Either there is a mismatch — in which case the service returns NULL vectors so the table does not get poisoned with mixed-dimensional rows — or there is a match, in which case the success canary fires once per process and never again. The WARNING level is deliberate: the production Easypanel root logger is configured at WARNING (the team learned this the hard way in a separate session when a debug log was silently dropped from production), so INFO would have been invisible.

When the backfill ran against production at 19:33 UTC on June 2, the canary fired three lines into the script output:

2026-06-02 19:33:09 WARNING app.services.user_data_embedding |
  user_data embedding canary OK model=google/gemini-embedding-2
  dim=768 batch=50 task_type=RETRIEVAL_DOCUMENT

That single log line resolved the open question about OpenRouter's dimensions passthrough. The downstream effect — 71 out of 71 AIMemory rows and 2 out of 2 Task rows embedded without failure — confirmed the service's correctness at the production data scale.

The canary is the kind of code that looks like over-engineering when the upstream behaves correctly and looks like the only thing that saved you when it does not. We write canaries for every new external dependency now. The cost is six lines and one module-level flag. The benefit is that the first failure of the new dependency is observable, named, and contained.


Part 7 — The Fallback Chain Is The Product

The query path on the voice_internal_user_data_search endpoint is no longer a single SQL statement. It is a chain of three retrieval strategies, each falling through to the next when the previous returns nothing:

  1. Vector cosine search (primary). Embed the query with task_type=RETRIEVAL_QUERY. If the embed succeeds, run two ORM queries — one over ai_memories, one over tasks — using pgvector's cosine_distance operator filtered to similarity ≥ 0.55 (which translates to distance < 0.45), scoped to the calling user's UUID, restricted to rows with non-NULL embeddings. Fetch the top k from each table, merge in Python by sorting on distance, take the top k overall. Log the top hit's rank so we can backfit the 0.55 threshold from observed data.
  1. Postgres FTS (fallback). If the vector path returns zero hits — either because the query embed failed (transient OpenRouter outage) or because no row crossed the similarity threshold — fall through to the same to_tsvector('french') @@ plainto_tsquery('french') query that S256 shipped. The FTS path is still useful for the exact-keyword cases where the user says the literal word that is in memory. The 0-hits-from-FTS case is rare but happens.
  1. Recency (bridge). If both vector and FTS return zero, return the five most recent AIMemory rows for the user, unfiltered. This is the S256 bridge patch. It exists to make sure the tool never returns an empty result when there is any memory for the user, because returning an empty result causes the model to say "I don't remember anything" which destroys user trust faster than returning a possibly-irrelevant memory.

The response payload now includes a source field that tags which branch produced the results (vector / fts / recency), which is essential for monitoring. After a week of production, we can pull a query like "what fraction of user_data_semantic_search calls are satisfied by the vector path?" and answer it directly. If that fraction is 95%, the vector path is doing its job and FTS is mostly inert backup. If it is 60%, the similarity threshold (SIM_FLOOR = 0.55) is too strict and should be tuned down to 0.45 or 0.50. The top_rank log on every vector-source response is the raw signal for that calibration.

The fallback chain is the product feature. Singling out the vector path as the answer would have been a regression in the cases where FTS already worked — the user who literally says « paracétamol » and gets the paracétamol row back instantly via FTS does not need the embedding round-trip. Keeping FTS as a fallback also means that when OpenRouter is having a bad day, the tool degrades gracefully to S256's behavior rather than failing entirely. The tool's worst case is the previous version's normal case, not silence.

This pattern is worth naming explicitly. We do not deprecate the old retrieval strategy when we ship a new one. The old strategy becomes the fallback. The fallback becomes the safety net. The safety net is the reason the deployment is safe to ship to production at 8 PM on a Tuesday without a smoke E2E in a staging environment. The next generation of retrieval — when we add hybrid scoring or a Jina reranker or transcript-level embedding — will be layered on top, and the FTS path will still be there as the third or fourth tier. The chain only grows.


Part 8 — What The Mom Hears Now

The backfill script ran in production at 19:33 UTC on June 2. It batched 50 rows per OpenRouter call, slept 100 ms between batches to be polite to the rate limiter, and committed per batch so an interruption mid-run would be safe to resume. Every existing AIMemory row got an embedding. Every existing Task row got an embedding. The two HNSW indexes built incrementally on the populated columns. The next user query against the endpoint would route through the vector path.

The mom can now ask Déblo « est-ce que j'ai des médicaments à prendre cette semaine ? » and the embedding service converts that sentence into a 768-dimensional vector that lives, in Gemini Embedding 2's learned representation space, close to the vector for « paracétamol 1g matin et soir pour migraine ». The cosine similarity between those two vectors is in the range of 0.65 to 0.75 — well above the 0.55 floor. The query returns the paracétamol row. The model receives the memory in the tool result. The model says "oui, vous avez du paracétamol à prendre, matin et soir, pour vos migraines" — yes, you have paracetamol to take, morning and evening, for your migraines.

That is the bug fix.

What the bug fix made visible, in the process, is that the audience constraint is upstream of the technology choice. The mom in Abidjan is not asking Déblo to remember the literal word she typed; she is asking it to remember the meaning of what she said. Postgres FTS handles literal-word match. Vector embeddings handle meaning. The right embedding model for that meaning, given the audience's language mix and our cost envelope and our operational posture, was Google's Gemini Embedding 2 at 768 dimensions, routed through OpenRouter, indexed in pgvector with HNSW, and wrapped in a fallback chain that never lets the tool return silence.

Each of those choices is independently small. Together they make the difference between a mom whose AI cannot find her medication and a mom whose AI does.


Coda — What This Cost Us

For the record: the entire change shipped in about three and a half hours, from reading the prompt to push, by one Claude Code instance running Opus 4.7 in the 1M context window mode. The commit is aefbd88. The session log is 26-06-02-257-user-data-semantic-search-v2-pgvector-gemini-embedding.md. The cost of the embedding API calls for the backfill was less than ten cents. The cost of the HNSW index build was under a second. The cost of the post-implementation audit, which caught the missing spawn_recurring_task hook before the commit hit git push, was four minutes of read-only agent time. The cost of writing this post was one prompt.

The cost of not shipping it would have been every subsequent mom asking about her medications, every parent asking about their child's upcoming appointment, every accountant asking about last week's client call — getting "I don't remember" from an AI that, in fact, remembered everything but could not find the word.

That asymmetry is what makes the choice obvious in retrospect. The work is in making it obvious in advance.

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles

Thales & Claude zerosuite

The gate caught its own drift: one day inside CASP with Claude Fable 5

We handed the most autonomous Claude model yet the keys to CASP — the open-source CLI that keeps AI coding agents honest against git — with the authority to reject our own roadmap. It rejected five things, found two real bugs in the validator by dogfooding it, fixed them under a two-auditor gate, and left casp check fully green on its own repo for the first time. CASP 0.3.0 is the result.

14 min Jun 10, 2026
caspzerosuiteworkflowai-cto +9
Thales & Claude zerosuite

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash

The CASP discipline that ran thirty-five Conductor sessions is product-agnostic. The build log of transplanting it to SENEBA, an anti-fraud transport ERP for a Côte d'Ivoire fleet operator: what moved, what did not (the bespoke validator — and what its absence costs), what the /next skill adds when the operator types one word, and where the CASP stops — the deployment bug it could not see because it records intent, not infrastructure reality.

19 min Jun 8, 2026
senebaerp-seneba-transport-logistiquezerosuiteCASP +15
Thales & Claude zerosuite

The CASP Discipline: How A Six-File Directory Lets Thirty-Five Build Sessions Share One Project Memory, And Why The Meta-Tooling Layer Is The Real Bottleneck In AI-Assisted Build Velocity

Six files at casp/, three templates, one validator. The meta-tooling layer that lets thirty-five build sessions share one project memory across four days — why it is the real bottleneck in AI-assisted build velocity at small-team scale, and what the CLAUDE.md critical-rules layer adds on top.

25 min Jun 2, 2026
conductorops-zerosuite-devzerosuitecockpit +12