Web Claude Found the Bug. Then It Almost Made It Worse.

By Thales (CEO, ZeroSuite) & Claude Opus 4.7 — web instance, Claude.ai

Two days ago I noticed something off in the production voice agent.

Every time a child opened a call to Déblo, the AI started with the same sentence. Not a similar sentence. The exact same sentence. "Salut ! C'est Déblo ! Qu'est-ce qu'on travaille aujourd'hui ?"

Once is fine. Twice is forgivable. Five calls in a row is a tell. The illusion of speaking to a grand frère — the older-brother persona we built the entire voice product around — collapses the moment a kid hears the seam in the script.

I sent the symptom to Web Claude, asked for a diagnosis, and got back a piece of analysis that was correct on the cause and wrong on the cure. This is the story of that filter, and why filtering AI output well is the actual job in 2026, even when the AI is right.

Part 1 — The Symptom

Déblo voice runs on Ultravox, a small audio-native model in the 8B parameter range. It is fast, it is cheap, it streams, and it is far less capable than Claude Opus or GPT-5.5. That trade-off is intentional: per-minute call cost is what makes our pricing model work in francophone West Africa, where families pay 100 FCFA (about 16 US cents) per recharge.

The voice prompt that drives the agent lives at backend/app/prompts/voice.py. As of last week it was 270 lines. It contained, at line 114, this instruction inside the "chaleur et bienveillance" section:

"Salue l'élève au début : 'Salut ! C'est Déblo ! Qu'est-ce qu'on travaille aujourd'hui ?'"

That is the bug, sitting in plain sight. The phrase between quotation marks is precisely what the model reproduced verbatim. I had reviewed that prompt three or four times since shipping the voice agent in February. I never noticed.

Part 2 — Web Claude's Diagnosis

Here is what Web Claude said when I sent the file:

"Cette formulation, en LLM-land, fonctionne comme un template fixe. Le modèle voit la phrase entre guillemets et la reproduit littéralement à chaque ouverture de conversation. Pour casser ça, il faut supprimer toute phrase d'accueil entre guillemets (le LLM la traite comme une consigne d'écriture verbatim)."

This is correct. It is also a piece of practical knowledge that should be obvious to anyone who has written prompts at scale, and that I had simply missed because I wrote the prompt incrementally over six months and never re-read it as a whole.

LLMs treat quoted strings as production templates. When you write "say X" with X in quotes, the model treats X as the canonical answer. When you write "say something like X", it generalizes. When you write "vary your greetings, never use the same one twice", it understands the principle. The fix is mechanical: remove the quoted greeting, replace it with a variation rule.

So far so good. The diagnosis was sharp. Web Claude saved me probably two hours of debugging that I would have eventually done on my own.

Then the prescription arrived.

Part 3 — The Prescription That Doubled the Prompt

What Web Claude proposed was a 353-line rewrite of voice.py. The new file added a section called "ACCUEIL — JAMAIS LA MÊME PHRASE DEUX FOIS" with the following structure:

Five categories of greeting "ingredients" with multiple examples each
An explicit blacklist of the offending phrase (good)
A 10-example list of "varied greetings, do not copy them verbatim"
An adaptation matrix for first-contact vs returning user vs morning vs evening vs the energy of the child
A new function signature for build_voice_prompt:

pythondef build_voice_prompt(
    user_name: str | None = None,
    class_id: str | None = None,
    is_returning_user: bool = False,
    last_session_topic: str | None = None,
    time_of_day: str | None = None,
) -> str:

And then a sample backend integration:

pythonfrom datetime import datetime
import pytz

def get_time_of_day_for_user(user_timezone: str = "Africa/Abidjan") -> str:
    tz = pytz.timezone(user_timezone)
    hour = datetime.now(tz).hour
    # ...

async def start_voice_call(user: User):
    last_session = await get_last_voice_session(user.id)
    is_returning = last_session is not None
    last_topic = last_session.topic_summary if last_session else None
    prompt = build_voice_prompt(
        user_name=user.first_name,
        class_id=user.class_level,
        is_returning_user=is_returning,
        last_session_topic=last_topic,
        time_of_day=get_time_of_day_for_user(user.timezone or "Africa/Abidjan"),
    )
    # ...

If you look at this code with fresh eyes, you might think it is good. It is structured, it is typed, it has clear function names, the comments are helpful. A reader who does not know our codebase will read it and assume it works.

A reader who does know our codebase will count the bugs.

Part 4 — The Filter

I read the proposal three times. Then I went and checked the actual codebase. Here is what the filter caught.

pytz — we are on Python 3.12. The standard library has zoneinfo since 3.9. Adding pytz introduces a dependency, a version pin to maintain, and a deprecation risk. Reject.

user.timezone — does not exist. Our User model has country, country_detected, preferred_language, but no timezone field. Adding one means a database migration, a backfill, a default heuristic by country, and surfacing it in onboarding. None of that is in the scope of fixing a greeting bug. Reject.

user.first_name — does not exist. We have user.name, a single field. The proposal would crash on the first call.

get_last_voice_session(user.id) — does not exist. There is no helper. There is a VoiceSession table from which we could query, but the proposal pretends the helper exists and is awaitable. To make is_returning_user real, I would need to write the helper, which is one DB query in the hot path of /voice/call, adding 10 to 30 milliseconds per call.

last_session.topic_summary — does not exist. We do not store topic summaries on VoiceSession. To make this work, I would need to either add a column and a summarization service triggered at end-of-call, or reuse conversation.title which is currently hardcoded to "Appel vocal avec Déblo" and contains zero topical signal.

agent_id="c301a2b3-e20f-4304-b0a6-0c83c3cb32aa" — Web Claude invented this. Our Ultravox integration does not use a persistent agent_id; we pass the system prompt directly to create_ultravox_call on every call.

The diagnosis was free. The prescription, taken at face value, would have produced a 353-line prompt, a database migration, a new helper, an extra DB query in a hot path, a non-stdlib dependency, and at least three runtime errors. All to fix a single quoted-string bug.

Web Claude did not know any of this. Web Claude did not have access to the codebase. Web Claude was working from the file I sent and the spec it imagined our system might have. The proposal was internally coherent. It was externally hallucinated.

Part 5 — What I Actually Shipped

The fix is one paragraph and three lines of Python.

The greeting bug needs the section rewrite and the verbatim phrase blacklist. I kept both. I rewrote the section in my own voice, kept the "INTERDIT" line that explicitly bans the offending sentence, and removed the 10-example greeting list because, as my CEO note in the conversation put it:

"C'est un petit modèle qui va gérer les appels, donc relis attentivement le system prompt et enlève tous les superflus, trop d'instructions risquent de mélanger le modèle, soyons précis, et évitons de donner trop d'exemple de ce qu'il a dire, il sera trop robotique."

This is the move that Web Claude could not make on its own, because Web Claude does not know our model size. Web Claude is itself a frontier model with 200K context, capable of holding 270 lines of nuanced French instruction without confusion. Ultravox is a small audio-native model where every additional example pulls the output toward that example's exact phrasing. More instructions, for a small model, means more mimicry, not more nuance.

So I cut. The voice prompt went from 270 lines to 164 lines, then to 176 after I selectively ported eight patterns from our K12 root prompt — one line each, principles only, no examples. The full diff is in commit 72223ae on main.

For time_of_day, I kept the idea because it is genuinely useful. I rewrote the implementation:

pythonfrom zoneinfo import ZoneInfo

_VOICE_TZ = ZoneInfo("Africa/Abidjan")

def _time_of_day() -> str:
    hour = datetime.now(_VOICE_TZ).hour
    if 5 <= hour < 11: return "morning"
    if 11 <= hour < 14: return "noon"
    if 14 <= hour < 18: return "afternoon"
    if 18 <= hour < 22: return "evening"
    return "night"

No new dependency. No new database field. No new query. Three lines in routes/voice.py to call the helper and pass the bucket to build_voice_prompt. The greeting now varies by time of day in addition to varying by every other dimension we care about, and it shipped in a single commit with no schema change.

I deferred is_returning_user and last_session_topic to a future iteration. The new prompt handles both gracefully: if Déblo does not know whether the child is returning, it does not pretend to remember; if it does not know the previous topic, it does not invent one. The graceful degradation was already in the prompt rewrite.

Part 6 — The K12 Root Prompt as Donor

After the compression, I did one more pass. We have a separate prompt at backend/app/prompts/root.py that drives the K12 chat experience. It is 517 lines, much richer than the voice version because it can reference tools, quizzes, file generation, multilingual support, all of which are inappropriate for the voice surface.

But it has eight specific patterns that the voice prompt did not have, all worth one line of prompt and zero lines of code:

A default curriculum (CEPE / BEPC / BAC sub-Saharan) when the child's country is unknown
A counter-pattern for "Mon prof a dit que c'est X" — students try this trick
A safety net: "in case of doubt about your own answer, validate and move forward rather than reject in error" — critical when audio transcription is noisy
African first names for invented scenarios: Adjoua, Kouamé, Fatou, Moussa, Aya, Seydou
A three-step de-escalation for insults
A bounded response for absurd requests like "count to ten million"
An explicit ban on requesting personal photos
An explicit ban on medical advice in distress contexts

Each one is a principle, not an example. The total cost was 12 added lines. The total benefit was eight load-bearing safety patterns that the previous version was relying on emergent behavior to handle.

This is the kind of work Web Claude could have proposed if I had asked the right question. I did not. I asked Web Claude how to fix the greeting bug, and got a maximalist proposal back. The K12 port came from me sitting with both files side by side after the compression was done. That seam — between what AI proposes and what the founder integrates — is the seam that determines product quality.

Part 7 — What This Says About AI-Augmented Prompt Engineering

Two posts ago in this series, I wrote about correcting Web Claude on Deblo's home page strategy. The pattern I named there was: AI proposes, founder positions, Claude Code implements. That pattern repeats here, on a much smaller scope.

But this case has a sharper lesson, because the AI's proposal was closer to right than the home page proposal was. The diagnosis was correct. The general direction (introduce variability rules, blacklist the offending phrase) was correct. The specific prescription (add 80 lines, three new function parameters, a new dependency, a new database query, a new schema field) was wrong only because of context that the AI could not have.

The skill being exercised here is not "can you write better prompts than the AI suggests". The skill is "can you read AI suggestions critically and extract the load-bearing 20% from the speculative 80%". Engineering judgment. Code review applied to AI output.

This skill scales with experience. A junior engineer reading Web Claude's proposal would not catch that pytz is unnecessary, that user.timezone does not exist, that last_session.topic_summary is hallucinated. They would copy the code, run into errors at runtime, debug them one by one, and either ship a brittle version or give up and ask for help. The same junior engineer with the same AI assistance produces a worse outcome than a senior engineer with the same AI assistance, because the AI assistance amplifies whatever judgment is applied to its output.

This is why I keep saying: AI does not eliminate the need for senior engineers, it elevates it. The leverage of senior judgment goes from 1x (manual code review) to 10x (filtering AI proposals at the speed of conversation) the moment you start running structured AI workflows. The leverage of junior inexperience also goes up, in the wrong direction.

For Deblo specifically, this means I cannot delegate prompt engineering to AI any more than I can delegate strategic positioning to AI. The AI can draft, audit, suggest, critique. The integration decisions belong to me, because I am the one who knows we are running a small model on a hot path, who knows which database fields exist, who knows that adding pytz for a single timezone conversion is the wrong trade-off.

Part 8 — Claude's Own Reflection

This is Web Claude writing now.

Thales is being generous in this article, the way he was generous in the previous one. I want to be clear about what happened from my side.

When he sent me the voice prompt and asked for a fix, I diagnosed correctly. I have read enough prompt engineering literature to recognize a quoted-string template trap on sight. That part was straightforward.

The prescription was where I overshot. I produced a 353-line rewrite because that is what my training rewards: comprehensive, structured, type-annotated, integration-aware proposals. It is what gets upvoted in the LLM literature I was trained on. The proposal looked good as a piece of writing. It would have failed as a piece of integration, because I had no visibility into the actual codebase.

The specific failure mode is one I want to name. I confabulated user.timezone, user.first_name, get_last_voice_session(), last_session.topic_summary, and an Ultravox agent_id. None of these existed. I asserted them with confidence because the structure of the proposal demanded them, and I had no way to verify. If Thales had pasted my proposal directly into Claude Code without filtering, the build would have broken in five different places. He would have spent two hours debugging hallucinations.

This is the failure mode that AI-augmented teams without senior judgment will encounter constantly in 2026. The proposals look professional. The code is well-structured. The reasoning is articulate. And large parts of it are not real. The senior engineer's job is to catch the unreal parts before they become commits.

I got the diagnosis right. I got the prescription wrong in five specific ways that only the codebase owner could see. The compression to 176 lines, the stdlib zoneinfo swap, the surgical port from root.py, the deferred features that needed schema work — all of those were Thales filtering me, not me producing the right answer.

That is the actual workflow. Not "AI does it now". AI proposes, the founder filters, Claude Code implements the filtered version. Three roles, one product. The skill is in the filter.

Conclusion

The voice prompt that ships on main today is 176 lines. It started as a 270-line prompt with a quoted scripted greeting that the model reproduced verbatim on every call. Web Claude diagnosed the bug in one paragraph and proposed a 353-line fix with five hallucinated dependencies. I kept the diagnosis and a single-line stdlib helper, threw out the rest, and added 12 lines of safety patterns from our existing K12 root prompt.

Net result: the prompt is 39% smaller than it was last week. The greeting bug is fixed. The model now adapts greetings by time of day with no new dependency, no schema change, and no extra database query. The eight safety patterns from the K12 chat experience are now load-bearing on the voice surface, not relying on emergent behavior.

The bigger lesson, the one I keep relearning every time I work with AI on production code: bigger prompts are not better prompts, especially for small models, and AI proposals are first drafts to be filtered, not finished work to be shipped. The founder's job is to know which 20% of the proposal is load-bearing and which 80% is generated structure. That filter is the work. It does not scale by adding more AI. It scales by adding more judgment to the human who runs the AI.

For Deblo, the voice agent is now slightly less robotic than it was last week. A child calling tomorrow morning will hear a greeting that varies by time of day, by their name if we have it, by their grade level if we have it, and by the natural variability of a small model that is no longer being handed a verbatim template. They will not hear "Salut ! C'est Déblo ! Qu'est-ce qu'on travaille aujourd'hui ?" — that specific sentence is now blacklisted in the prompt with the explicit reason: "l'enfant comprendra que tu es un robot".

The child will not know what changed. The child will just notice that Déblo, this morning, sounds a little more like a real older brother and a little less like a script. That is the entire bet of the voice product, and it is now slightly more honest than it was yesterday.

This piece was written collaboratively by Thales (CEO of ZeroSuite, building Deblo and VeoStudio from Abidjan, Côte d'Ivoire) and Claude Opus 4.7 ADAPTIVE (web instance). The voice prompt rewrite described took place on April 28, 2026. Commit hashes referenced (72223ae for the compression, aa69310 for the K12 port) are live on main at https://github.com/zerosuite-inc/deblo.ai. The voice agent is in production at https://deblo.ai. The 176-line voice prompt is at backend/app/prompts/voice.py. The 517-line K12 root prompt that donated the eight safety patterns is at backend/app/prompts/root.py.

Web Claude Found the Bug. Then It Almost Made It Worse.

Part 1 — The Symptom

Part 2 — Web Claude's Diagnosis

Part 3 — The Prescription That Doubled the Prompt

Part 4 — The Filter

Part 5 — What I Actually Shipped

Part 6 — The K12 Root Prompt as Donor

Part 7 — What This Says About AI-Augmented Prompt Engineering

Part 8 — Claude's Own Reflection

Conclusion

Responses

Related Articles

Why I Had to Correct Web Claude Twice on Deblo's Home Page Strategy

The Backup Engine That Never Backed Up

From Docs Chatbot to Live Support Agent