#174 -- Testing, Benchmarks, and Fuzzing

A compiler that produces incorrect code is worse than a compiler that produces no code at all. Incorrect code runs, appears to work, and fails at the worst possible moment -- in production, with real users, with real data. The only defense against this is testing at every level: unit tests for individual components, integration tests for the full pipeline, benchmarks for performance regression, and fuzz testing for the inputs you never thought to try.

Session 022 was FLIN's quality reckoning. We went from 717 tests to 891 in a single session, adding end-to-end integration tests, performance benchmarks, edge case tests, and a fuzz testing infrastructure. The compiler came out the other side measurably more reliable.

End-to-End Integration Tests

The integration tests in tests/integration_e2e.rs cover the full compilation pipeline: source code goes in, the lexer tokenizes it, the parser builds an AST, the type checker validates it, the code generator emits bytecode, and the VM executes it. If any stage produces incorrect output, the test fails.

Seventy-six tests cover twelve categories:

rust// Variables: Basic declaration and initialization
#[test]
fn test_e2e_variable_declaration() {
    let result = compile_and_run("x = 42");
    assert_eq!(result.get_global("x"), Value::Int(42));
}

// Arithmetic: All numeric operations
#[test]
fn test_e2e_arithmetic_operations() {
    let result = compile_and_run("x = 10 + 5 * 2");
    assert_eq!(result.get_global("x"), Value::Int(20));
}

// Typed declarations: Explicit type annotations
#[test]
fn test_e2e_typed_variable() {
    let result = compile_and_run("x: int = 42");
    assert_eq!(result.get_global("x"), Value::Int(42));
}

// Lambdas: Function definitions and calls
#[test]
fn test_e2e_lambda_call() {
    let result = compile_and_run("add = (a, b) => a + b\nx = add(3, 5)");
    assert_eq!(result.get_global("x"), Value::Int(8));
}

The test helper compile_and_run encapsulates the entire pipeline. It takes a source string, runs it through all compiler stages, and returns a result object that allows inspection of global variables, return values, and output. This helper is the workhorse of FLIN's test suite.

The categories are deliberate. Each covers a different language feature that touches different parts of the compiler:

Variables test the scope chain and variable binding.
Arithmetic tests operator precedence and numeric operations.
Typed declarations test the type checker's annotation handling.
Lambdas test closure creation and function calls.
Match expressions test pattern matching and branch selection.
Lists test heap allocation and indexing.
Entities test the database system.
Control flow tests if/else branching.
String interpolation tests the template literal system.
Boolean operations test short-circuit evaluation.
Comparisons test relational operators.
Increment/decrement test mutation operators.

Edge Case Tests

Edge case tests are the tests that nobody wants to write and everybody needs. They cover the boundaries where things break: empty inputs, malformed syntax, type mismatches, runtime errors, and extreme inputs.

Seventy-four edge case tests live in tests/edge_cases.rs, organized by compiler stage:

Lexer edge cases (27 tests) probe the scanner with inputs designed to confuse it:

rust#[test]
fn test_edge_empty_source() {
    let tokens = scan("").unwrap();
    assert_eq!(tokens.len(), 1); // Just EOF
}

#[test]
fn test_edge_unterminated_string() {
    let result = scan("\"hello");
    assert!(result.is_err());
}

#[test]
fn test_edge_very_long_identifier() {
    let name = "a".repeat(1000);
    let tokens = scan(&name).unwrap();
    assert_eq!(tokens[0].lexeme(), name);
}

#[test]
fn test_edge_unicode_in_string() {
    let tokens = scan("\"cafe\"").unwrap();
    assert!(tokens[0].is_string());
}

Parser edge cases (18 tests) verify that the parser fails gracefully on malformed input rather than panicking:

rust#[test]
fn test_edge_empty_block() {
    let result = parse("if true { }");
    assert!(result.is_ok()); // Empty blocks are valid
}

#[test]
fn test_edge_unbalanced_parens() {
    let result = parse("x = (1 + 2");
    assert!(result.is_err());
}

#[test]
fn test_edge_deep_nesting() {
    // 50 levels of nesting
    let source = "if true { ".repeat(50) + &"}".repeat(50);
    let result = parse(&source);
    assert!(result.is_ok());
}

Runtime edge cases (14 tests) probe the VM's behavior at the limits:

Division by zero
Stack overflow via deep recursion
List index out of bounds
Field access on non-objects
Invalid opcodes

These tests are where we discover real bugs. Session 022 uncovered two: the VM panicked with "index out of bounds" when accessing a field on a non-object (it should return a TypeError), and accessing a non-existent field on a valid object caused a similar panic. Both are marked for fixing in subsequent sessions, but the tests ensure we know about them and can track the fix.

Performance Benchmarks

Twenty-four benchmarks in benches/benchmarks.rs measure the performance of every compiler stage:

Lexer (complex):     202 us avg (4,950 ops/sec)
Parser (todo):       84 us avg  (11,905 ops/sec)
VM (arithmetic):     4 us avg   (250,000 ops/sec)
Counter full:        40 us avg  (25,000 ops/sec)

The benchmarks use Rust's built-in benchmarking framework (criterion or test::bench) to measure execution time with statistical rigor -- multiple iterations, warm-up periods, and outlier elimination.

The benchmark categories cover:

Lexer performance -- How fast can we tokenize source code? Three tests measure simple inputs, complex inputs with many token types, and large inputs with thousands of characters. The lexer processes over 100,000 tokens per second, which is fast enough that it is never the bottleneck.

Parser performance -- How fast can we build an AST? Three tests measure a simple counter, a complex todo application, and arithmetic expressions. The parser handles nearly 12,000 parse operations per second.

Full compilation -- How fast is the complete pipeline from source to bytecode? Three tests measure simple, complex, and large programs. A typical application compiles in under 100 microseconds.

VM execution -- How fast can the VM run bytecode? Four tests measure arithmetic operations, list operations, string concatenation, and loops. The VM executes 250,000 arithmetic operations per second.

End-to-end -- How fast is the complete cycle from source to execution result? Two tests measure the counter application through the full pipeline. The entire round trip takes 40 microseconds.

Memory -- How much memory do heap allocations and map creation consume? Two tests measure the allocation overhead.

Throughput -- An assertion test that verifies the lexer processes at least 100,000 tokens per second. If this test fails, something has regressed.

These benchmarks serve as a regression guard. Every change to the compiler can be measured against the baseline. If a refactoring makes the parser 10% slower, the benchmark catches it before the change is merged.

Fuzz Testing

Fuzz testing feeds random or semi-random input to the compiler and checks for crashes. It does not verify correctness -- it verifies that the compiler never panics, never segfaults, and never enters an infinite loop, regardless of the input.

FLIN's fuzz testing uses cargo-fuzz, the standard Rust fuzzing tool. Three fuzz targets cover different entry points:

rust// fuzz/fuzz_targets/fuzz_lexer.rs
fuzz_target!(|data: &[u8]| {
    if let Ok(source) = std::str::from_utf8(data) {
        let _ = flin::Scanner::new(source).scan_tokens();
    }
});

// fuzz/fuzz_targets/fuzz_parser.rs
fuzz_target!(|data: &[u8]| {
    if let Ok(source) = std::str::from_utf8(data) {
        if let Ok(tokens) = flin::Scanner::new(source).scan_tokens() {
            let _ = flin::Parser::new(tokens).parse();
        }
    }
});

// fuzz/fuzz_targets/fuzz_compiler.rs
fuzz_target!(|data: &[u8]| {
    if let Ok(source) = std::str::from_utf8(data) {
        let _ = flin::compile(source, "<fuzz>");
    }
});

Each fuzzer builds on the previous one. The lexer fuzzer feeds arbitrary strings to the scanner. The parser fuzzer feeds valid token streams to the parser. The compiler fuzzer feeds arbitrary strings through the full pipeline. If any stage panics instead of returning an error, the fuzzer catches it and produces a minimal reproducing input.

Running the fuzzer is a continuous process:

bashcargo +nightly fuzz run fuzz_lexer -- -max_len=4096

The fuzzer generates millions of inputs per hour, mutating and recombining them using coverage-guided feedback. Over time, it explores more and more of the code's execution paths, finding edge cases that no human tester would think to write.

Test Organization

The test suite is organized by purpose, not by module:

tests/
  cli_tests.rs         # CLI command tests (35 tests)
  integration_vm.rs    # VM integration tests (35 tests)
  integration_e2e.rs   # Full E2E tests (76 tests)
  edge_cases.rs        # Edge case tests (74 tests)

benches/
  benchmarks.rs        # Performance benchmarks (24 tests)

fuzz/
  Cargo.toml           # Fuzz crate
  fuzz_targets/
    fuzz_lexer.rs      # Lexer fuzzer
    fuzz_parser.rs     # Parser fuzzer
    fuzz_compiler.rs   # Full compiler fuzzer

This organization makes it easy to run subsets of the test suite. During development, you might run only the unit tests (cargo test --lib) for fast feedback. Before committing, you run the full suite (cargo test --all). In CI, you run everything including the benchmarks.

Bugs Found

Testing is not about confirming that things work. It is about finding out where they break. Session 022's edge case tests found three issues:

VM GetField panic on non-objects. Accessing a field on a value that is not an object causes a panic with "index out of bounds" instead of returning a RuntimeError::TypeError. This is a safety issue -- the VM should never panic on user input.

VM GetField panic on missing fields. Accessing a non-existent field on a valid object causes a similar panic. The VM should return an error or none, not crash.

Undefined globals return None. Accessing an undefined global variable returns Value::None rather than an error. This was confirmed as a deliberate design decision, but the test documents the behavior explicitly so that future developers understand the semantics.

These findings demonstrate the value of systematic testing. The first two bugs would have caused runtime crashes in production. Without edge case tests specifically designed to probe error paths, they would have remained hidden until a user stumbled into them.

The Quality Confidence Curve

At 891 tests, FLIN's test suite provides high confidence in the compiler's correctness. But confidence is not binary -- it is a curve that grows with each test added:

0-100 tests: Basic confidence. Major features work.
100-500 tests: Moderate confidence. Most code paths are covered.
500-1000 tests: High confidence. Edge cases and error paths are tested.
1000+ tests: Production confidence. Regressions are caught automatically.

The combination of unit tests, integration tests, edge case tests, benchmarks, and fuzz testing creates multiple overlapping safety nets. A bug that escapes the unit tests will be caught by integration tests. A crash that escapes integration tests will be caught by the fuzzer. A performance regression that escapes all functional tests will be caught by benchmarks.

This layered approach is how you build a compiler that developers trust with their production code.

This is Part 174 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [173] The .flinc Binary Format - [174] Testing, Benchmarks, and Fuzzing (you are here) - [175] Documentation Comments in FLIN

#174 -- Testing, Benchmarks, and Fuzzing

End-to-End Integration Tests

Edge Case Tests

Performance Benchmarks

Fuzz Testing

Test Organization

Bugs Found

The Quality Confidence Curve

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash