From Docs Chatbot to Live Support Agent

sh0.dev already had three AI paths: MCP mode for dashboard users with a connected server, legacy mode for dashboard tool execution, and docs mode for the marketing site. The helpdesk is a fourth path, but it shares 90% of its implementation with docs mode. This article walks through the architecture -- what we reused, what we built new, and the decisions that shaped the final design.

The Prompt Layer: One Function, One Overlay

The docs prompt is a 4,000-word system prompt that teaches Claude about sh0's features, documentation, pricing, CLI, and API. It includes tool definitions for search_docs and get_api_reference, formatting rules, and explicit boundaries ("you do NOT have access to any user's sh0 server").

The helpdesk needed the same knowledge. The difference is the persona. A docs assistant is formal, thorough, and links to documentation pages. A helpdesk agent is warm, concise, and suggests next steps.

The solution was a function that wraps the existing prompt:

typescriptexport function buildHelpdeskPrompt(): string {
    return buildDocsPrompt() + `\n\n<helpdesk_overlay>
You are the sh0 Live Chat Support Agent. You are chatting with a visitor on sh0.dev.

Behavioral rules:
- Be warm, concise, and conversational. This is live chat, not documentation.
- Keep responses short (2-3 paragraphs max) unless the user asks for detail.
- Proactively offer next steps: "Would you like me to walk you through the installation?"
- For pricing questions, give clear comparisons and recommend the right plan.
- For technical questions, search the docs first, then provide a concise answer with a link.
- If the user has a bug report or is frustrated, acknowledge it and suggest emailing [email protected].
- Never ask for passwords, API keys, or sensitive credentials.
- Always use suggest_actions to offer 2-3 natural follow-ups.
</helpdesk_overlay>`;
}

Fifteen lines. No new knowledge base, no separate training data, no vector database. The overlay modifies behavior while preserving the entire knowledge layer underneath.

This is the architecture principle that made the helpdesk viable as a solo-founder project: layer behavior on top of knowledge, never duplicate knowledge.

The Endpoint: A Simplified Docs Path

The existing /api/ai/chat endpoint handles three modes behind a single route, with authentication, BYOK setup, model selection, and tool routing. The helpdesk needed none of that complexity.

/api/ai/helpdesk is a dedicated public endpoint that makes every decision statically:

Decision	Chat endpoint	Helpdesk endpoint
Authentication	Bearer token (sh0_ai_*)	None
Model selection	User chooses (haiku/sonnet/opus)	Always Haiku
Max tokens	Per-model (8K/16K/32K)	Fixed 4,096
Tools	Mode-dependent (25 MCP / 5 docs)	Docs tools only
Billing account	Authenticated user	Site owner (env var)
BYOK support	Yes	No
System prompt	Mode-dependent	`buildHelpdeskPrompt()`

By hardcoding every decision, the endpoint is 200 lines shorter than the chat endpoint and has no conditional paths for features that do not apply.

The Streaming Loop

The core streaming logic is identical to the docs path in the chat endpoint. The same SSE event format, the same tool execution loop, the same content block accumulation:

typescriptwhile (internalLoop < MAX_DOCS_LOOPS) {
    const internalToolResults = [];
    const gatewayToolResults = [];

    const response = await client.messages.create({
        model: modelString,
        max_tokens: 4096,
        system: systemPrompt,
        messages: currentMessages,
        tools: [...docsTools, WEB_SEARCH_TOOL],
        stream: true,
    });

    // ... process stream events ...

    // If Claude called tools, loop back with results
    if (stopReason === 'tool_use' && allToolResults.length > 0) {
        currentMessages = [
            ...currentMessages,
            { role: 'assistant', content: assistantContent },
            { role: 'user', content: toolResultContent },
        ];
        internalLoop++;
        continue;
    }

    break;
}

Three loops maximum. Each loop can execute search_docs, get_api_reference, or suggest_actions internally, then feed the results back to Claude for a final response. The visitor never sees the tool execution -- they see a smooth streaming response that happens to be informed by real-time documentation search.

The One New SSE Event

The helpdesk adds one event type that the chat endpoint does not emit: conversation_id. This is sent immediately after the stream opens, before any AI output:

typescriptemit({ type: 'conversation_id', id: conversation.id });

The widget stores this ID in localStorage. On the next message, it sends the ID back. The server resumes the same conversation instead of creating a new one. This is how conversation persistence works without authentication -- the client holds a session UUID and a conversation UUID, and the server validates that they match.

The chat widget is a single component: HelpdeskWidget.svelte. It lives in the root layout, inside the {#if !hideChrome} block that already controls the navbar and footer. On /account/<em>, /admin/</em>, and /login, the widget does not render.

State Architecture

All state is in Svelte 5 runes:

typescriptlet open = $state(false);
let input = $state('');
let messages = $state<Message[]>([]);
let suggestions = $state<Suggestion[]>([]);
let streaming = $state(false);
let error = $state('');
let conversationLimitReached = $state(false);
let sessionId = $state('');
let conversationId = $state('');

No stores. No context. No global state. The widget is self-contained. It loads from localStorage on mount and saves after every completed exchange.

The SSE Consumer

The widget consumes the SSE stream using a ReadableStream reader, not EventSource. This is intentional -- EventSource only supports GET requests, and the helpdesk endpoint is POST (it sends a JSON body with the message and metadata).

typescriptconst reader = res.body?.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split('\n');
    buffer = lines.pop() || '';

    for (const line of lines) {
        if (!line.startsWith('data: ')) continue;
        const data = JSON.parse(line.slice(6).trim());

        if (data.type === 'delta') {
            // Append text to current assistant message
        } else if (data.type === 'conversation_id') {
            conversationId = data.id;
        } else if (data.type === 'suggestions') {
            // Render suggestion chips
        } else if (data.type === 'file') {
            // Render config file as code block
        } else if (data.type === 'done') {
            // Mark streaming complete
        } else if (data.type === 'error') {
            // Show error message
        }
    }
}

The buffer accumulation handles the case where a TCP packet splits an SSE event across two reads. Without the buffer, partial JSON would cause parse errors on slow connections.

Markdown Rendering

Assistant messages are rendered as HTML via marked + DOMPurify:

typescriptfunction renderMd(text: string): string {
    const html = marked.parse(text) as string;
    return DOMPurify.sanitize(html);
}

The sanitization layer was added by the first auditor (Critical finding C-1). Without it, a prompt injection attack could make the AI output <img onerror="alert(1)">, which marked would render as valid HTML and {@html} would inject into the DOM. DOMPurify strips event handlers, script tags, and other dangerous patterns.

The widget uses custom CSS (.helpdesk-prose) instead of Tailwind's @tailwindcss/typography plugin. Chat bubbles need compact spacing -- 0.25em paragraph margins instead of 1.25em, 0.8em code font instead of 0.875em, and no max-width constraints on tables. A separate prose class avoids fighting the default typography configuration.

The Database: Two Tables, Six Indexes

prismamodel HelpdeskConversation {
    id             String    @id @default(uuid())
    sessionId      String    @map("session_id")
    visitorName    String?   @map("visitor_name")
    visitorEmail   String?   @map("visitor_email")
    visitorIp      String?   @map("visitor_ip")
    pageUrl        String?   @map("page_url")
    status         String    @default("open")
    messageCount   Int       @default(0) @map("message_count")
    totalTokensIn  Int       @default(0) @map("total_tokens_in")
    totalTokensOut Int       @default(0) @map("total_tokens_out")
    messages       HelpdeskMessage[]

    @@index([sessionId])
    @@index([status, createdAt])
}

model HelpdeskMessage {
    id             String   @id @default(uuid())
    conversationId String   @map("conversation_id")
    role           String
    content        String   @db.Text
    tokensIn       Int      @default(0) @map("tokens_in")
    tokensOut      Int      @default(0) @map("tokens_out")
    conversation   HelpdeskConversation @relation(...)

    @@index([conversationId, createdAt])
}

The design decision to store token counts on both the conversation (aggregate) and the message (per-exchange) was deliberate. The conversation-level aggregates avoid an expensive SUM() query on every admin page load. The message-level counts allow drilling into cost-per-exchange in the transcript view.

messageCount is incremented atomically via { increment: 2 } in the same transaction that creates the user and assistant messages. This avoids a separate COUNT query and stays consistent even under concurrent requests.

Rate Limiting: In-Memory, Three Dimensions

The rate limiter uses three independent Maps, each tracking a different dimension:

typescriptconst sessionRates = new Map<string, RateEntry>();  // 30 msg / 10 min per session
const ipRates = new Map<string, RateEntry>();        // 60 msg / 10 min per IP
const ipConvoRates = new Map<string, RateEntry>();   // 5 new convos / hour per IP

The three dimensions serve different purposes:

Session rate prevents a single visitor from flooding the AI (30 messages is enough for any real conversation)
IP rate prevents automated abuse from scripts rotating session IDs (60/10min is generous for humans, restrictive for bots)
Conversation creation rate prevents database pollution (5 new conversations/hour/IP caps storage growth)

A cleanup interval runs every 5 minutes to remove expired entries:

typescriptsetInterval(() => {
    const now = Date.now();
    for (const [key, entry] of sessionRates)
        if (now > entry.resetAt) sessionRates.delete(key);
    // ... same for ipRates and ipConvoRates
}, 5 * 60 * 1000);

In-memory rate limiting resets on server restart. This is acceptable for a marketing site. The alternative -- Redis or PostgreSQL-backed rate limiting -- adds infrastructure complexity that is not justified at this scale.

The Admin View: Read-Only Intelligence

The admin dashboard is intentionally simple: a stats row, a filterable table, and expandable transcripts. No reply capability. No assignment workflow. No SLA timers.

The stats are computed server-side using Prisma aggregates:

typescriptconst [totalConvos, openConvos, todayConvos, tokenAgg] = await Promise.all([
    prisma.helpdeskConversation.count(),
    prisma.helpdeskConversation.count({ where: { status: 'open' } }),
    prisma.helpdeskConversation.count({
        where: { createdAt: { gte: todayStart } },
    }),
    prisma.helpdeskConversation.aggregate({
        _sum: { totalTokensIn: true, totalTokensOut: true },
    }),
]);

Four queries in parallel. The cost is computed server-side using the actual Haiku pricing from AI_MODELS, not hardcoded in the frontend. If the model pricing changes, the admin dashboard reflects it immediately.

The transcript view loads on demand -- clicking a conversation row fetches all messages via GET /api/admin/helpdesk/:id. Messages are capped at 200 per transcript to prevent memory issues on extremely long conversations.

The Conversation Limit

A conversation is capped at 200 messages (100 exchanges). When the limit is reached, the server returns a clear error and the widget replaces the input area with a "Start new conversation" button.

This cap serves two purposes:

Cost control: An infinitely long conversation accrues unbounded token cost. At 200 messages, the context window is already sending ~20 messages to the API each time (the last 20 are loaded for context). The cost is predictable and bounded.

Quality control: After 100 exchanges, the conversation has drifted far enough that starting fresh produces better answers than continuing with accumulated context.

What We Reused vs. What We Built

Component	Reused	Built new
System prompt knowledge	`buildDocsPrompt()` (4,000 words)	15-line persona overlay
Tool definitions	`DOCS_TOOLS`, `GATEWAY_ONLY_TOOLS`	--
Tool execution	`searchDocs()`, `getApiReference()`	--
SSE streaming format	Same event types as chat endpoint	`conversation_id` event
Token billing	`deductTokens()`, `checkBalance()`	Account resolution logic
Markdown rendering	`marked` (already installed)	`.helpdesk-prose` CSS
XSS sanitization	--	`isomorphic-dompurify` (new dep)
Database	Prisma + PostgreSQL	2 new models
Widget	--	`HelpdeskWidget.svelte` (490 lines)
Admin API	--	3 new endpoints
Admin page	`Pagination` component	`ai-helpdesk/+page.svelte`
Rate limiting	--	In-memory 3-dimension limiter

The "reused" column is why this feature took hours instead of weeks. The AI infrastructure was not built for the helpdesk, but it was built in a way that made the helpdesk trivial to add.

Next in the series: Two Critical Bugs in a Public AI Widget -- What two independent audit sessions found in the helpdesk implementation, and why the builder could not have caught them.

From Docs Chatbot to Live Support Agent

The Prompt Layer: One Function, One Overlay

The Endpoint: A Simplified Docs Path

The Streaming Loop

The One New SSE Event

The Widget: 490 Lines of Svelte 5

State Architecture

The SSE Consumer

Markdown Rendering

The Database: Two Tables, Six Indexes

Rate Limiting: In-Memory, Three Dimensions

The Admin View: Read-Only Intelligence

The Conversation Limit

What We Reused vs. What We Built

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash