The AI Stream That Never Drops: How We Made 5-Minute Generations Survive Network Failures

On March 31, 2026, Thales asked sh0's AI assistant to generate a complete deployment configuration -- a task that would take Claude Opus several minutes and thousands of tokens. Midway through, his WiFi flickered. The response vanished. Five minutes of generation, gone.

This was not a rare edge case. It was the fundamental fragility of how every AI chat application works: a single HTTP connection, streaming tokens in real-time, with zero persistence between the server and the database. If that connection breaks -- WiFi reconnect, laptop sleep, proxy timeout, browser crash -- the tokens that already left Anthropic's servers disappear into the void. You cannot get them back. Anthropic already counted them. You already paid for them.

We decided this was unacceptable. Here is how we fixed it.

The Anatomy of a Fragile Stream

Before the fix, sh0's AI gateway worked like every other AI chat application:

Browser ──SSE──> sh0.dev Gateway ──Stream──> Anthropic API
                     │
                     └── tokens flow through, nothing saved

The gateway was a pass-through. Anthropic sent tokens. The gateway formatted them as Server-Sent Events. The browser rendered them. If the browser disconnected, the gateway's ReadableStream controller threw an error, the for await loop over Anthropic's stream broke, and everything stopped.

Three specific problems made this architecture fragile for long generations:

1. No heartbeat. When Claude uses tools -- MCP server calls, web searches, URL fetches -- it can spend 30 to 60 seconds executing before sending the next token. During that silence, every proxy in the chain (Cloudflare at ~100 seconds, Caddy at ~60 seconds, the browser itself) starts wondering if the connection is dead. Cloudflare's SSE timeout is generous but not infinite. One slow tool execution on a congested network, and the proxy closes the connection.

2. No server-side persistence. The generated text lived in exactly one place: a JavaScript variable in the browser (state.currentResponse). The gateway held nothing. If you refreshed the page, the variable was gone. If you closed the tab, it was gone. The conversation was only saved to the database when the stream completed -- meaning a 4-minute generation that failed at minute 3 saved nothing.

3. No reconnection. When the SSE connection dropped, the client showed a red error toast: "Stream interrupted." That was it. No recovery path. No way to get back what was already generated. The user's only option was to send the same message again and pay for the entire generation a second time.

The Fix: Server-Side Stream Jobs

The core insight is simple: decouple the Anthropic stream from the client connection. The gateway should persist the response to a database as it generates, regardless of whether anyone is listening.

Browser ──SSE──> sh0.dev Gateway ──Stream──> Anthropic API
                     │                           │
                     │    ┌──────────────────────┘
                     │    │  tokens flow in
                     │    ▼
                     │  PostgreSQL
                     │  (AiStreamJob row)
                     │    │
                     │    │  flushed every 2s
                     │    ▼
                     └── emit to client (if still connected)

Browser disconnects?
  Gateway keeps streaming. Keeps flushing to DB.

Browser reconnects?
  GET /api/ai/chat/job/:id → full accumulated text

The Database Model

We added a single table:

sqlCREATE TABLE ai_stream_jobs (
  id              UUID PRIMARY KEY,
  account_id      UUID NOT NULL REFERENCES accounts(id),
  conversation_id TEXT,
  model           TEXT NOT NULL,
  status          TEXT DEFAULT 'streaming',  -- streaming | done | error
  text_content    TEXT DEFAULT '',
  events          TEXT DEFAULT '[]',         -- JSON: suggestions, files, tool calls
  tokens_in       INT DEFAULT 0,
  tokens_out      INT DEFAULT 0,
  error           TEXT,
  last_chunk_at   TIMESTAMP DEFAULT NOW(),
  created_at      TIMESTAMP DEFAULT NOW()
);

Every AI request creates a row. Every 2 seconds during generation, the accumulated text is flushed to this row. When the stream completes (or errors), the row is finalized.

The Disconnect-Safe Emit

The most critical change was four lines of code:

typescriptconst emit = (data: Record<string, unknown>) => {
  try {
    controller.enqueue(encoder.encode(`data: ${JSON.stringify(data)}\n\n`));
  } catch {
    // Client disconnected. Continue generating server-side.
    clientDisconnected = true;
  }
};

Before this change, if controller.enqueue() threw (because the client closed the connection), the error propagated up, killed the Anthropic stream iterator, and stopped everything. Now, we catch the error silently. The for await loop over Anthropic's response continues. The fullResponse variable keeps accumulating. The periodic flush keeps writing to PostgreSQL. The client is gone, but the generation finishes.

The Heartbeat

Every 15 seconds, the gateway emits a heartbeat event:

typescriptconst heartbeatInterval = setInterval(() => {
  emit({ type: 'heartbeat', ts: Date.now() });
}, 15_000);

This serves two purposes: 1. Keeps proxies alive. Cloudflare, Caddy, and nginx all have idle connection timeouts. A heartbeat every 15 seconds is well within every reasonable timeout threshold. 2. Enables client-side timeout detection. The client tracks when it last received any data. If 45 seconds pass with nothing -- no delta, no heartbeat, nothing -- the client knows the connection is dead and switches to recovery mode.

The Recovery Path

When the client detects a disconnect (heartbeat timeout or stream read error), it does not show an error. Instead, it switches to polling:

typescript// Instead of: onError('Stream interrupted')
// We do:
if (currentJobId) {
  onDisconnect(currentJobId);  // Triggers polling recovery
}

The polling loop hits GET /api/ai/chat/job/:id every 3 seconds:

typescriptasync function startPollingRecovery(jobId: string) {
  const poll = async () => {
    const result = await pollJob(jobId, apiKey);
    state.currentResponse = result.data.textContent;  // Replace with server's version

    if (result.data.status === 'done') {
      // Finalize: save conversation, update wallet
      return true;
    }
    return false;  // Keep polling
  };

  // Poll every 3 seconds until done
  const interval = setInterval(async () => {
    if (await poll()) clearInterval(interval);
  }, 3_000);
}

The user sees a subtle "Reconnecting -- server is still generating..." banner instead of an error. The text keeps appearing as the server flushes new content to the database. When the generation completes, the conversation is finalized exactly as if no disconnect had occurred.

Crash Recovery

What if the browser crashes entirely? The tab is gone. The JavaScript variable is gone. Even the job ID is gone.

Two mechanisms protect against this:

1. Periodic saves. Every 10 seconds during streaming, the current partial response is saved to the sh0 instance's SQLite database (the same place conversations are persisted). If the browser crashes at minute 3 of a 5-minute generation, you lose at most 10 seconds of text.

2. localStorage job tracking. When a stream starts, the job ID is written to localStorage. On page mount, the dashboard checks for an active job:

typescriptexport async function recoverActiveJob(): Promise<boolean> {
  const activeJob = loadActiveJob();  // From localStorage
  if (!activeJob) return false;

  const result = await pollJob(activeJob.jobId, apiKey);
  if (result.data.status === 'done') {
    // Job finished while we were away -- show the full response
    state.messages.push({ role: 'assistant', content: result.data.textContent });
    return true;
  }
  if (result.data.status === 'streaming') {
    // Still going -- resume polling
    startPollingRecovery(activeJob.jobId);
    return true;
  }
}

The user force-quits Chrome, reopens it, navigates to the dashboard -- and sees the complete response that was generated while they were gone.

The Cost Problem: Prompt Caching

While we were fixing the streaming architecture, we noticed something else: every message in a conversation re-sends the entire system prompt and conversation history. sh0's system prompt is substantial -- it includes server context, tool definitions, agent overlays, and behavioral instructions. On a 20-message conversation, the input tokens were dominated by the system prompt being sent 20 times.

Anthropic's prompt caching solves this. By adding cache_control: { type: 'ephemeral' } to the system prompt and the last user message, Anthropic caches the prefix and reuses it for 5 minutes:

typescriptconst cachedSystem = [
  { type: 'text', text: systemPrompt, cache_control: { type: 'ephemeral' } },
];

The first message in a conversation pays full price. Every subsequent message within 5 minutes gets a ~90% discount on input tokens for the cached portion. For a 20-message debugging session with an Opus model, this can reduce the total cost from $2+ to under $0.50.

What We Shipped

Four changes, deployed in one session:

Change	Impact
`AiStreamJob` table + DB flush	Server generates to completion even if client disconnects
15s heartbeat	Prevents proxy timeouts during tool execution
Client polling recovery	Automatic reconnection with "Reconnecting..." UI
Prompt caching	~90% input token cost reduction on multi-turn conversations

The total implementation: ~350 lines of TypeScript across the gateway and dashboard. One Prisma migration. One new API endpoint. Zero breaking changes.

Lessons

1. SSE is fragile by default. Server-Sent Events are beautiful for real-time streaming. They are terrible for long-running operations. Every proxy, every network hop, every laptop lid close is a potential kill point. If your SSE stream runs longer than 60 seconds, you need a persistence layer.

2. The server should not care about the client. The gateway's job is to call Anthropic and save the result. Whether a browser is listening is irrelevant. This is the same principle behind job queues: the producer does not care about the consumer. Decouple them.

3. Polling is underrated. We considered SSE reconnection with Last-Event-ID, WebSocket upgrade, and various push-based recovery mechanisms. Polling a single endpoint every 3 seconds is simpler, more resilient (works after server restarts), and fast enough that the user does not notice.

4. Cache your AI calls. If your system prompt is more than 1,000 tokens and your users have multi-turn conversations, prompt caching is not optional. It is a 10x cost reduction sitting there waiting to be enabled.

The Methodology Note

This entire feature -- server-side persistence, client recovery, UI improvements, and prompt caching -- was designed, implemented, and tested in a single Claude Code session. No back-and-forth with a human engineer on architecture decisions. No PRs to review. The session explored the codebase, identified the root causes, designed the solution, implemented it across two repositories, verified the builds, and pushed to production.

This is what it looks like when an AI operates as CTO: it sees the problem end-to-end, from the Anthropic API contract to the Svelte 5 reactivity system to the Cloudflare proxy timeout, and ships a cohesive solution.

The next session will audit it. That is the methodology. Build, audit, audit, approve. Each AI session optimizes locally. The audit sessions catch what the builder missed. The CEO tests everything manually with a checklist. The system converges on the right answer.

But the builder session has to be good enough that the auditor's job is finding edge cases, not redesigning the architecture. Today, the architecture was right.

Nothing drops. Nothing is lost. The stream keeps flowing.