#181 -- Production Hardening Phase 1

There is a moment in every software project where the question shifts from "does it work?" to "does it keep working?" The features are built. The tests pass. The demo is impressive. But the gap between a working demo and a production system is measured not in features but in failures -- specifically, in how the system behaves when things go wrong.

Session 244 marked the beginning of FLIN's production hardening arc. We had over 3,200 passing tests, a full-featured compiler and runtime, and an HTTP server capable of serving real applications. What we did not have was confidence that the system would survive contact with the unpredictable chaos of production traffic. Phase 1 focused on the most fundamental requirement: stability. The system must not crash.

The Crash Audit

Before writing a single line of hardening code, we conducted a systematic audit of every code path that could cause the FLIN runtime to panic. In Rust, a panic is an unrecoverable error that terminates the current thread -- and in a single-threaded HTTP server, that means the entire application goes down.

We categorized crash vectors into three tiers:

Tier 1 -- Guaranteed crashes: Division by zero, out-of-bounds array access, stack overflow from recursive FLIN code, and integer overflow on arithmetic operations. These were deterministic: given the right input, the crash was certain.

Tier 2 -- Probabilistic crashes: Memory exhaustion from unbounded allocations (a FLIN program allocating millions of entities in a loop), file descriptor exhaustion from leaked HTTP connections, and deadlocks in the WAL (write-ahead log) checkpoint process.

Tier 3 -- Edge-case crashes: Unicode boundary errors in string slicing, malformed HTTP requests triggering parser panics, and race conditions during concurrent entity saves with foreign key constraints.

The audit identified 47 distinct crash vectors. Each one needed a fix.

Wrapping Panics at the VM Boundary

The first and most impactful change was wrapping every VM execution in a panic boundary. When the FLIN runtime executes user code -- route handlers, template rendering, cron jobs -- that execution must never propagate a Rust panic to the HTTP server's accept loop.

rustuse std::panic::{catch_unwind, AssertUnwindSafe};

pub fn execute_route_handler(
    vm: &mut VM,
    handler: &CompiledHandler,
    request: &Request,
) -> Result<Response, RuntimeError> {
    let result = catch_unwind(AssertUnwindSafe(|| {
        vm.execute_handler(handler, request)
    }));

    match result {
        Ok(Ok(response)) => Ok(response),
        Ok(Err(runtime_err)) => {
            // Normal FLIN error -- render error page
            log::warn!("Route handler error: {}", runtime_err);
            Ok(render_error_page(500, &runtime_err.to_string()))
        }
        Err(panic_info) => {
            // Rust panic -- this is a bug in FLIN itself
            let msg = extract_panic_message(&panic_info);
            log::error!("PANIC in route handler: {}", msg);
            metrics::increment("vm.panics");
            Ok(render_error_page(500, "Internal server error"))
        }
    }
}

This pattern -- catch the panic, log it, return a 500 -- meant that even if the FLIN compiler or runtime had a bug that caused a Rust panic, the HTTP server would continue serving other requests. The crash would be logged, the user would see an error page, and the system would remain operational.

We applied this boundary at every entry point: route handlers, WebSocket message processors, cron job executors, SSE stream generators, and the admin console API. Every place where user-written FLIN code is executed now has a panic boundary.

Division by Zero and Arithmetic Safety

FLIN supports integer and float arithmetic. In most languages, dividing an integer by zero causes a crash or an exception. We chose a different approach: division by zero in FLIN returns a runtime error that can be caught by the error handling system, but never crashes the VM.

rustfn execute_divide(&mut self) -> Result<(), RuntimeError> {
    let right = self.pop()?;
    let left = self.pop()?;

    match (&left, &right) {
        (Value::Int(a), Value::Int(b)) => {
            if *b == 0 {
                return Err(RuntimeError::new(
                    "DivisionByZero",
                    "Cannot divide by zero",
                    self.current_span(),
                ));
            }
            self.push(Value::Int(a / b));
        }
        (Value::Float(a), Value::Float(b)) => {
            if *b == 0.0 {
                // IEEE 754: return infinity, not an error
                self.push(Value::Float(a / b));
            } else {
                self.push(Value::Float(a / b));
            }
        }
        _ => {
            return Err(RuntimeError::type_error(
                "divide", &left, &right, self.current_span()
            ));
        }
    }
    Ok(())
}

For integers, division by zero is a catchable DivisionByZero error. For floats, we follow IEEE 754 and return infinity -- consistent with how every other numeric computing system works. This means a FLIN developer can write 10 / 0 in a route handler and the worst outcome is an error page, not a dead server.

Array Bounds Checking

Out-of-bounds array access was the second most common crash vector. FLIN arrays are zero-indexed, and accessing index 5 of a 3-element array should never terminate the process.

rustfn execute_index_access(&mut self) -> Result<(), RuntimeError> {
    let index = self.pop()?;
    let collection = self.pop()?;

    match (&collection, &index) {
        (Value::List(items), Value::Int(i)) => {
            let idx = if *i < 0 {
                // Negative indexing: -1 = last element
                let positive = items.len() as i64 + i;
                if positive < 0 {
                    return Err(RuntimeError::new(
                        "IndexOutOfBounds",
                        &format!(
                            "Index {} is out of bounds for list of length {}",
                            i, items.len()
                        ),
                        self.current_span(),
                    ));
                }
                positive as usize
            } else {
                *i as usize
            };

            if idx >= items.len() {
                return Err(RuntimeError::new(
                    "IndexOutOfBounds",
                    &format!(
                        "Index {} is out of bounds for list of length {}",
                        i, items.len()
                    ),
                    self.current_span(),
                ));
            }

            self.push(items[idx].clone());
        }
        _ => {
            self.push(Value::None);
        }
    }
    Ok(())
}

We also added negative indexing -- list[-1] returns the last element, list[-2] the second-to-last -- which is both a convenience feature and a crash prevention measure. Before this change, developers would write list[list.len - 1] and occasionally get the arithmetic wrong.

Stack Overflow Protection

Recursive FLIN code can blow the call stack. A function that calls itself without a base case will eventually exhaust the thread's stack space, causing a segfault that bypasses Rust's panic handler entirely. This is one of the few crash vectors that catch_unwind cannot catch.

Our solution was a configurable call depth limit enforced by the VM:

rustconst DEFAULT_MAX_CALL_DEPTH: usize = 512;

fn execute_call(&mut self, function_id: usize) -> Result<(), RuntimeError> {
    if self.call_stack.len() >= self.max_call_depth {
        return Err(RuntimeError::new(
            "StackOverflow",
            &format!(
                "Maximum call depth ({}) exceeded. \
                 Check for infinite recursion.",
                self.max_call_depth
            ),
            self.current_span(),
        ));
    }

    self.call_stack.push(CallFrame {
        function_id,
        return_ip: self.ip,
        base_pointer: self.stack.len(),
    });

    self.ip = self.functions[function_id].entry_point;
    Ok(())
}

The default limit of 512 nested calls is generous enough for any reasonable recursion (tree traversals, recursive parsers, mathematical computations) while catching infinite loops before they crash the process. The limit is configurable in flin.config for applications that genuinely need deep recursion.

Memory Budget Enforcement

An unbounded allocation is a slow-motion crash. A FLIN program that creates entities in an infinite loop will eventually exhaust available memory, at which point the operating system kills the process. We needed a mechanism to detect and prevent runaway allocations.

rustpub struct MemoryBudget {
    allocated: AtomicUsize,
    limit: usize,
}

impl MemoryBudget {
    pub fn new(limit_mb: usize) -> Self {
        Self {
            allocated: AtomicUsize::new(0),
            limit: limit_mb * 1024 * 1024,
        }
    }

    pub fn allocate(&self, bytes: usize) -> Result<(), RuntimeError> {
        let current = self.allocated.fetch_add(bytes, Ordering::Relaxed);
        if current + bytes > self.limit {
            self.allocated.fetch_sub(bytes, Ordering::Relaxed);
            return Err(RuntimeError::new(
                "MemoryLimitExceeded",
                &format!(
                    "Memory budget exhausted ({} MB limit). \
                     Current usage: {} MB.",
                    self.limit / (1024 * 1024),
                    current / (1024 * 1024),
                ),
                Span::default(),
            ));
        }
        Ok(())
    }

    pub fn deallocate(&self, bytes: usize) {
        self.allocated.fetch_sub(bytes, Ordering::Relaxed);
    }
}

Every allocation that passes through the VM -- creating entities, building lists, concatenating strings, loading file contents -- checks against the memory budget. When the budget is exhausted, the operation fails with a catchable error instead of an OOM kill.

The default budget is 256 MB per VM instance. For the HTTP server, each request handler gets its own budget, preventing a single malicious or buggy request from consuming all available memory.

Graceful HTTP Error Recovery

Before hardening, a malformed HTTP request could cause the parser to panic. A request with an invalid Content-Length header, a body exceeding the maximum size, or malformed chunked transfer encoding would crash the connection handler.

We rewrote the HTTP parser to return Result types at every stage:

flin// Before hardening: crash on malformed request
// After hardening: return 400 Bad Request

route POST "/api/data" {
    // Body parsing is now safe -- returns error on invalid JSON
    data = body

    // If body is not valid JSON, the route handler never executes.
    // The runtime automatically returns:
    // HTTP 400 { "error": "Invalid JSON in request body" }

    save Data { content: data.content }
    { success: true }
}

From the FLIN developer's perspective, nothing changed. The body variable still contains the parsed request body. But behind the scenes, every parsing step is wrapped in error handling that converts failures to HTTP 400 responses instead of process crashes.

Connection Draining

The HTTP server previously accepted connections without limit. Under load, this could exhaust file descriptors, causing accept() to fail with EMFILE and bringing down the entire server. We added connection limits with graceful rejection:

When the connection count reaches the configured maximum (default: 1,024), new connections receive an HTTP 503 Service Unavailable response and are immediately closed. Existing connections continue to be served. The connection count is tracked atomically and decremented when connections close.

This simple change -- rejecting excess connections instead of trying to serve them all -- prevents cascading failures under load.

The Results

After Phase 1, we re-ran the crash audit against all 47 identified vectors:

Tier 1 (guaranteed crashes): 12 of 12 fixed. Division by zero, array bounds, stack overflow, and integer overflow all produce catchable errors.
Tier 2 (probabilistic crashes): 8 of 11 fixed. Memory budget enforcement, connection limits, and file descriptor tracking addressed most cases. Three WAL-related edge cases were deferred to Phase 2.
Tier 3 (edge cases): 15 of 24 fixed. Unicode boundary checking, HTTP parser hardening, and foreign key validation covered the most likely scenarios. Remaining cases were documented for Phase 2.

Total: 35 of 47 crash vectors eliminated. The remaining 12 were edge cases that required the reliability work in Phase 2.

The test suite grew to accommodate the new error handling paths. Every crash vector that was fixed got a corresponding test that verifies the error is returned gracefully rather than crashing the process. These tests are intentionally adversarial -- they feed the system malformed input, trigger resource exhaustion, and exercise error recovery paths.

What Stability Means

Stability is not the absence of errors. It is the presence of graceful error handling at every boundary. A stable system can encounter division by zero, out-of-memory conditions, malformed input, and resource exhaustion, and continue operating. It logs the error, returns a meaningful response to the caller, and remains available for the next request.

Phase 1 established this property for FLIN. The runtime would no longer crash. It might return errors -- and those errors would be accurate, descriptive, and actionable -- but it would keep running.

Phase 2 would address the next question: when errors do occur, does the system's state remain consistent?

This is Part 181 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - Previous arc: [171-180] Standard Library and Ecosystem - [181] Production Hardening Phase 1: Stability (you are here) - [182] Production Hardening Phase 2: Reliability

#181 -- Production Hardening Phase 1

The Crash Audit

Wrapping Panics at the VM Boundary

Division by Zero and Arithmetic Safety

Array Bounds Checking

Stack Overflow Protection

Memory Budget Enforcement

Graceful HTTP Error Recovery

Connection Draining

The Results

What Stability Means

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash