Back to flin
flin

Code-Aware Chunking for RAG

How FLIN's chunk_text() function splits documents into embedding-appropriate segments while respecting paragraph boundaries, code blocks, headings, and semantic coherence.

Thales & Claude | March 25, 2026 7 min flin
flinchunkingragcodeai

Embedding models have a fixed context window. Most models accept 512 tokens (approximately 400 words). Some accept 8,192 tokens. None accept an entire 50-page document as input. To embed a long document, you must split it into chunks that fit the model's context window.

Naive chunking -- splitting every 500 characters regardless of content -- produces terrible embeddings. A chunk that starts in the middle of a sentence and ends in the middle of a code block has no coherent meaning. The embedding is a blurred average of two unrelated topics. Search queries that match either topic will score poorly because the embedding does not strongly represent either one.

FLIN's chunk_text() function is aware of document structure. It splits text at semantic boundaries: paragraph breaks, heading boundaries, code block delimiters, and sentence endings. The result is chunks that each represent a single coherent idea, producing focused embeddings that retrieve accurately.

The chunk_text() Function

chunks = chunk_text(text, {
    max_size: 500,        // Maximum characters per chunk
    overlap: 50,          // Characters of overlap between chunks
    strategy: "semantic"  // "fixed", "paragraph", "semantic", "code"
})

Each chunk contains: ``flin chunk.text // The chunk content chunk.position // Start position in the original document chunk.page // Page number (if available from parsing) chunk.index // Sequential chunk index ``

Chunking Strategies

Fixed Size

The simplest strategy. Splits at exact character boundaries:

chunks = chunk_text(text, { max_size: 500, strategy: "fixed" })

Use this only for unstructured text with no headings, code blocks, or clear paragraph boundaries. It is fast but produces the lowest quality embeddings.

Paragraph-Aware

Splits at paragraph boundaries (double newlines), keeping paragraphs intact when possible:

chunks = chunk_text(text, { max_size: 500, strategy: "paragraph" })
fn chunk_by_paragraph(text: &str, max_size: usize, overlap: usize) -> Vec<Chunk> {
    let paragraphs: Vec<&str> = text.split("\n\n").collect();
    let mut chunks = Vec::new();
    let mut current = String::new();
    let mut position = 0;

for paragraph in paragraphs { if current.len() + paragraph.len() + 2 > max_size && !current.is_empty() { chunks.push(Chunk { text: current.trim().to_string(), position, index: chunks.len(), });

// Overlap: keep the last overlap characters let overlap_start = current.len().saturating_sub(overlap); current = current[overlap_start..].to_string(); position += overlap_start; }

if !current.is_empty() { current.push_str("\n\n"); } current.push_str(paragraph); }

if !current.is_empty() { chunks.push(Chunk { text: current.trim().to_string(), position, index: chunks.len(), }); }

chunks } ```

Semantic (Default)

The most sophisticated strategy. Respects headings, paragraphs, lists, and natural semantic boundaries:

chunks = chunk_text(text, { max_size: 500, strategy: "semantic" })

The semantic chunker follows a priority hierarchy:

1. Never split in the middle of a code block. A code snippet is atomic -- splitting it produces two meaningless fragments. 2. Prefer splitting at headings. A heading marks the start of a new topic. Chunks should align with topic boundaries. 3. Prefer splitting at paragraph boundaries. Paragraphs are the natural unit of a single idea. 4. Prefer splitting at sentence boundaries. If a paragraph is too long, split between sentences rather than mid-sentence. 5. As a last resort, split at word boundaries. Never split in the middle of a word.

fn chunk_semantic(text: &str, max_size: usize, overlap: usize) -> Vec<Chunk> {
    let blocks = parse_into_blocks(text);
    let mut chunks = Vec::new();
    let mut current = String::new();
    let mut position = 0;

for block in blocks { match block { Block::Heading(h) => { // Always start a new chunk at a heading if !current.is_empty() { chunks.push(make_chunk(¤t, position, chunks.len())); position += current.len(); current.clear(); } current.push_str(&h); current.push('\n'); } Block::Code(code) => { // Keep code blocks intact if current.len() + code.len() > max_size && !current.is_empty() { chunks.push(make_chunk(¤t, position, chunks.len())); position += current.len(); current.clear(); } current.push_str(&code); current.push('\n'); } Block::Paragraph(p) => { if current.len() + p.len() > max_size { if !current.is_empty() { chunks.push(make_chunk(¤t, position, chunks.len())); position += current.len(); current.clear(); } // If paragraph itself exceeds max_size, split by sentence if p.len() > max_size { let sentence_chunks = split_by_sentences(&p, max_size); for sc in sentence_chunks { chunks.push(make_chunk(&sc, position, chunks.len())); position += sc.len(); } continue; } } current.push_str(&p); current.push_str("\n\n"); } } }

if !current.is_empty() { chunks.push(make_chunk(¤t, position, chunks.len())); }

add_overlaps(&mut chunks, overlap); chunks } ```

Code-Aware

A specialized strategy for source code and technical documentation with heavy code content:

chunks = chunk_text(text, { max_size: 500, strategy: "code" })

The code-aware chunker recognizes: - Function definitions -- a function and its body stay together. - Class/struct definitions -- a type definition stays together. - Import blocks -- grouped imports are a single chunk. - Comments -- doc comments are attached to their associated code.

fn chunk_code_aware(text: &str, max_size: usize) -> Vec<Chunk> {
    let mut chunks = Vec::new();
    let mut in_code_block = false;
    let mut code_buffer = String::new();
    let mut text_buffer = String::new();

for line in text.lines() { if line.starts_with("```") { if in_code_block { // End of code block code_buffer.push_str(line); code_buffer.push('\n');

// Flush text before code if !text_buffer.is_empty() { flush_buffer(&mut text_buffer, max_size, &mut chunks); }

// Code block as single chunk (or split if very large) if code_buffer.len() <= max_size { chunks.push(make_chunk(&code_buffer, 0, chunks.len())); } else { // Split large code blocks by function boundaries let code_chunks = split_code_by_functions(&code_buffer, max_size); chunks.extend(code_chunks); } code_buffer.clear(); in_code_block = false; } else { in_code_block = true; code_buffer.push_str(line); code_buffer.push('\n'); } } else if in_code_block { code_buffer.push_str(line); code_buffer.push('\n'); } else { text_buffer.push_str(line); text_buffer.push('\n'); } }

// Flush remaining buffers if !text_buffer.is_empty() { flush_buffer(&mut text_buffer, max_size, &mut chunks); } if !code_buffer.is_empty() { chunks.push(make_chunk(&code_buffer, 0, chunks.len())); }

chunks } ```

Overlap Between Chunks

Chunks can overlap to ensure that information at chunk boundaries is not lost:

Chunk 1: "...the user authentication system uses JWT tokens for stateless..."
Chunk 2: "...uses JWT tokens for stateless verification. The token contains..."

The overlap (50 characters by default) means that a query about "JWT tokens for stateless verification" will match both chunks, even though the phrase spans the boundary.

Overlap increases the total number of chunks and storage requirements but significantly improves retrieval quality for queries that happen to align with chunk boundaries.

Chunk Quality Metrics

FLIN provides quality metrics for chunks to help developers tune their chunking parameters:

chunks = chunk_text(text, { max_size: 500, strategy: "semantic" })

for chunk in chunks { log_info("Chunk {chunk.index}: {chunk.text.len} chars") }

// Summary statistics avg_size = chunks.map(c => c.text.len).sum / chunks.len log_info("Average chunk size: {avg_size} characters") log_info("Total chunks: {chunks.len}") ```

Ideal chunk sizes for embedding: - Too small (< 100 characters): Not enough context for a meaningful embedding. - Optimal (200-600 characters): Single coherent idea, good embedding quality. - Too large (> 1000 characters): Multiple topics blurred together, poor retrieval precision.

Practical Example: Indexing a Documentation Site

// Batch index all documentation
docs_dir = ".flindb/documents/"
files = list_files(docs_dir)

total_chunks = 0

for file_path in files { parsed = parse_document(file_path)

doc = Document { title: parsed.metadata.title || file_name(file_path), file_path: file_path, format: parsed.format, full_text: parsed.text } save doc

chunks = chunk_text(parsed.text, { max_size: 500, overlap: 50, strategy: if file_path.ends_with(".md") { "code" } else { "semantic" } })

for chunk in chunks { save DocumentChunk { document_id: doc.id, content: chunk.text, // semantic text -- auto-embedded position: chunk.position, chunk_index: chunk.index } }

total_chunks = total_chunks + chunks.len log_info("Indexed {file_path}: {chunks.len} chunks") }

log_info("Total: {files.len} documents, {total_chunks} chunks") ```

Chunking is the bridge between raw documents and searchable embeddings. Get it wrong, and your RAG system returns irrelevant results regardless of how good the embedding model or LLM is. Get it right, and every chunk represents a focused, retrievable piece of knowledge.

In the next article, we explore hybrid document search -- combining BM25 keyword search with semantic search for the best of both worlds.

---

This is Part 122 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML - [122] Code-Aware Chunking for RAG (you are here) - [123] Hybrid Document Search: BM25 + Semantic - [124] AI-First Language Design

Share this article:

Responses

Write a response
0/2000
Loading responses...

Related Articles