Back to flin
flin

Production Hardening Phase 1

Phase 1 of production hardening: crash prevention, graceful error handling, and stability.

Thales & Claude | March 25, 2026 10 min flin
flinproductionhardeningstabilityerror-handling

There is a moment in every software project where the question shifts from "does it work?" to "does it keep working?" The features are built. The tests pass. The demo is impressive. But the gap between a working demo and a production system is measured not in features but in failures -- specifically, in how the system behaves when things go wrong.

Session 244 marked the beginning of FLIN's production hardening arc. We had over 3,200 passing tests, a full-featured compiler and runtime, and an HTTP server capable of serving real applications. What we did not have was confidence that the system would survive contact with the unpredictable chaos of production traffic. Phase 1 focused on the most fundamental requirement: stability. The system must not crash.

The Crash Audit

Before writing a single line of hardening code, we conducted a systematic audit of every code path that could cause the FLIN runtime to panic. In Rust, a panic is an unrecoverable error that terminates the current thread -- and in a single-threaded HTTP server, that means the entire application goes down.

We categorized crash vectors into three tiers:

Tier 1 -- Guaranteed crashes: Division by zero, out-of-bounds array access, stack overflow from recursive FLIN code, and integer overflow on arithmetic operations. These were deterministic: given the right input, the crash was certain.

Tier 2 -- Probabilistic crashes: Memory exhaustion from unbounded allocations (a FLIN program allocating millions of entities in a loop), file descriptor exhaustion from leaked HTTP connections, and deadlocks in the WAL (write-ahead log) checkpoint process.

Tier 3 -- Edge-case crashes: Unicode boundary errors in string slicing, malformed HTTP requests triggering parser panics, and race conditions during concurrent entity saves with foreign key constraints.

The audit identified 47 distinct crash vectors. Each one needed a fix.

Wrapping Panics at the VM Boundary

The first and most impactful change was wrapping every VM execution in a panic boundary. When the FLIN runtime executes user code -- route handlers, template rendering, cron jobs -- that execution must never propagate a Rust panic to the HTTP server's accept loop.

use std::panic::{catch_unwind, AssertUnwindSafe};

pub fn execute_route_handler( vm: &mut VM, handler: &CompiledHandler, request: &Request, ) -> Result { let result = catch_unwind(AssertUnwindSafe(|| { vm.execute_handler(handler, request) }));

match result { Ok(Ok(response)) => Ok(response), Ok(Err(runtime_err)) => { // Normal FLIN error -- render error page log::warn!("Route handler error: {}", runtime_err); Ok(render_error_page(500, &runtime_err.to_string())) } Err(panic_info) => { // Rust panic -- this is a bug in FLIN itself let msg = extract_panic_message(&panic_info); log::error!("PANIC in route handler: {}", msg); metrics::increment("vm.panics"); Ok(render_error_page(500, "Internal server error")) } } } ```

This pattern -- catch the panic, log it, return a 500 -- meant that even if the FLIN compiler or runtime had a bug that caused a Rust panic, the HTTP server would continue serving other requests. The crash would be logged, the user would see an error page, and the system would remain operational.

We applied this boundary at every entry point: route handlers, WebSocket message processors, cron job executors, SSE stream generators, and the admin console API. Every place where user-written FLIN code is executed now has a panic boundary.

Division by Zero and Arithmetic Safety

FLIN supports integer and float arithmetic. In most languages, dividing an integer by zero causes a crash or an exception. We chose a different approach: division by zero in FLIN returns a runtime error that can be caught by the error handling system, but never crashes the VM.

fn execute_divide(&mut self) -> Result<(), RuntimeError> {
    let right = self.pop()?;
    let left = self.pop()?;

match (&left, &right) { (Value::Int(a), Value::Int(b)) => { if *b == 0 { return Err(RuntimeError::new( "DivisionByZero", "Cannot divide by zero", self.current_span(), )); } self.push(Value::Int(a / b)); } (Value::Float(a), Value::Float(b)) => { if *b == 0.0 { // IEEE 754: return infinity, not an error self.push(Value::Float(a / b)); } else { self.push(Value::Float(a / b)); } } _ => { return Err(RuntimeError::type_error( "divide", &left, &right, self.current_span() )); } } Ok(()) } ```

For integers, division by zero is a catchable DivisionByZero error. For floats, we follow IEEE 754 and return infinity -- consistent with how every other numeric computing system works. This means a FLIN developer can write 10 / 0 in a route handler and the worst outcome is an error page, not a dead server.

Array Bounds Checking

Out-of-bounds array access was the second most common crash vector. FLIN arrays are zero-indexed, and accessing index 5 of a 3-element array should never terminate the process.

fn execute_index_access(&mut self) -> Result<(), RuntimeError> {
    let index = self.pop()?;
    let collection = self.pop()?;

match (&collection, &index) { (Value::List(items), Value::Int(i)) => { let idx = if *i < 0 { // Negative indexing: -1 = last element let positive = items.len() as i64 + i; if positive < 0 { return Err(RuntimeError::new( "IndexOutOfBounds", &format!( "Index {} is out of bounds for list of length {}", i, items.len() ), self.current_span(), )); } positive as usize } else { *i as usize };

if idx >= items.len() { return Err(RuntimeError::new( "IndexOutOfBounds", &format!( "Index {} is out of bounds for list of length {}", i, items.len() ), self.current_span(), )); }

self.push(items[idx].clone()); } _ => { self.push(Value::None); } } Ok(()) } ```

We also added negative indexing -- list[-1] returns the last element, list[-2] the second-to-last -- which is both a convenience feature and a crash prevention measure. Before this change, developers would write list[list.len - 1] and occasionally get the arithmetic wrong.

Stack Overflow Protection

Recursive FLIN code can blow the call stack. A function that calls itself without a base case will eventually exhaust the thread's stack space, causing a segfault that bypasses Rust's panic handler entirely. This is one of the few crash vectors that catch_unwind cannot catch.

Our solution was a configurable call depth limit enforced by the VM:

const DEFAULT_MAX_CALL_DEPTH: usize = 512;

fn execute_call(&mut self, function_id: usize) -> Result<(), RuntimeError> { if self.call_stack.len() >= self.max_call_depth { return Err(RuntimeError::new( "StackOverflow", &format!( "Maximum call depth ({}) exceeded. \ Check for infinite recursion.", self.max_call_depth ), self.current_span(), )); }

self.call_stack.push(CallFrame { function_id, return_ip: self.ip, base_pointer: self.stack.len(), });

self.ip = self.functions[function_id].entry_point; Ok(()) } ```

The default limit of 512 nested calls is generous enough for any reasonable recursion (tree traversals, recursive parsers, mathematical computations) while catching infinite loops before they crash the process. The limit is configurable in flin.config for applications that genuinely need deep recursion.

Memory Budget Enforcement

An unbounded allocation is a slow-motion crash. A FLIN program that creates entities in an infinite loop will eventually exhaust available memory, at which point the operating system kills the process. We needed a mechanism to detect and prevent runaway allocations.

pub struct MemoryBudget {
    allocated: AtomicUsize,
    limit: usize,
}

impl MemoryBudget { pub fn new(limit_mb: usize) -> Self { Self { allocated: AtomicUsize::new(0), limit: limit_mb 1024 1024, } }

pub fn allocate(&self, bytes: usize) -> Result<(), RuntimeError> { let current = self.allocated.fetch_add(bytes, Ordering::Relaxed); if current + bytes > self.limit { self.allocated.fetch_sub(bytes, Ordering::Relaxed); return Err(RuntimeError::new( "MemoryLimitExceeded", &format!( "Memory budget exhausted ({} MB limit). \ Current usage: {} MB.", self.limit / (1024 * 1024), current / (1024 * 1024), ), Span::default(), )); } Ok(()) }

pub fn deallocate(&self, bytes: usize) { self.allocated.fetch_sub(bytes, Ordering::Relaxed); } } ```

Every allocation that passes through the VM -- creating entities, building lists, concatenating strings, loading file contents -- checks against the memory budget. When the budget is exhausted, the operation fails with a catchable error instead of an OOM kill.

The default budget is 256 MB per VM instance. For the HTTP server, each request handler gets its own budget, preventing a single malicious or buggy request from consuming all available memory.

Graceful HTTP Error Recovery

Before hardening, a malformed HTTP request could cause the parser to panic. A request with an invalid Content-Length header, a body exceeding the maximum size, or malformed chunked transfer encoding would crash the connection handler.

We rewrote the HTTP parser to return Result types at every stage:

// Before hardening: crash on malformed request
// After hardening: return 400 Bad Request

route POST "/api/data" { // Body parsing is now safe -- returns error on invalid JSON data = body

// If body is not valid JSON, the route handler never executes. // The runtime automatically returns: // HTTP 400 { "error": "Invalid JSON in request body" }

save Data { content: data.content } { success: true } } ```

From the FLIN developer's perspective, nothing changed. The body variable still contains the parsed request body. But behind the scenes, every parsing step is wrapped in error handling that converts failures to HTTP 400 responses instead of process crashes.

Connection Draining

The HTTP server previously accepted connections without limit. Under load, this could exhaust file descriptors, causing accept() to fail with EMFILE and bringing down the entire server. We added connection limits with graceful rejection:

When the connection count reaches the configured maximum (default: 1,024), new connections receive an HTTP 503 Service Unavailable response and are immediately closed. Existing connections continue to be served. The connection count is tracked atomically and decremented when connections close.

This simple change -- rejecting excess connections instead of trying to serve them all -- prevents cascading failures under load.

The Results

After Phase 1, we re-ran the crash audit against all 47 identified vectors:

  • Tier 1 (guaranteed crashes): 12 of 12 fixed. Division by zero, array bounds, stack overflow, and integer overflow all produce catchable errors.
  • Tier 2 (probabilistic crashes): 8 of 11 fixed. Memory budget enforcement, connection limits, and file descriptor tracking addressed most cases. Three WAL-related edge cases were deferred to Phase 2.
  • Tier 3 (edge cases): 15 of 24 fixed. Unicode boundary checking, HTTP parser hardening, and foreign key validation covered the most likely scenarios. Remaining cases were documented for Phase 2.

Total: 35 of 47 crash vectors eliminated. The remaining 12 were edge cases that required the reliability work in Phase 2.

The test suite grew to accommodate the new error handling paths. Every crash vector that was fixed got a corresponding test that verifies the error is returned gracefully rather than crashing the process. These tests are intentionally adversarial -- they feed the system malformed input, trigger resource exhaustion, and exercise error recovery paths.

What Stability Means

Stability is not the absence of errors. It is the presence of graceful error handling at every boundary. A stable system can encounter division by zero, out-of-memory conditions, malformed input, and resource exhaustion, and continue operating. It logs the error, returns a meaningful response to the caller, and remains available for the next request.

Phase 1 established this property for FLIN. The runtime would no longer crash. It might return errors -- and those errors would be accurate, descriptive, and actionable -- but it would keep running.

Phase 2 would address the next question: when errors do occur, does the system's state remain consistent?

---

This is Part 181 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - Previous arc: [171-180] Standard Library and Ecosystem - [181] Production Hardening Phase 1: Stability (you are here) - [182] Production Hardening Phase 2: Reliability

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles