#130 -- Text Chunking Strategies

Embedding models have context windows. The best open-source models accept 512 tokens. Some accept 8,192. None accept 50,000. When a user uploads a 50-page PDF, you cannot embed the entire document as a single vector. The embedding model would truncate the text silently, and the resulting vector would represent only the first few pages -- making the rest of the document invisible to semantic search.

Chunking solves this by splitting documents into pieces that fit within the embedding model's context window. But chunking is not just "split every N characters." Bad chunking destroys meaning. Split a sentence in half and neither half makes sense to an embedding model. Split in the middle of a code block and you lose the function signature that gives the body its meaning.

Session 221 implemented FLIN's chunking module: the bridge between document extraction (which produces raw text) and embedding generation (which requires bounded-length inputs). Twenty tests, 400 lines of Rust, and two chunking strategies that handle everything from legal contracts to source code.

The Document Intelligence Pipeline

Chunking sits at stage 3 of FLIN's document intelligence pipeline:

1. Upload File    -> body.file (PDF, DOCX, etc.)
2. Extract Text   -> document_extract(file) -> raw text
3. Chunk Text     -> chunk_text(text, options) -> [Chunk]
4. Embed Chunks   -> embed(chunk.text) -> vector
5. Store Vectors  -> VectorStore.store_embedding()
6. Semantic Search -> search "query" in Entity by field

Stages 1 and 2 were complete. Stage 4 and 5 existed from the embedding system built earlier. Stage 3 was the missing piece. Without chunking, the only way to embed a document was to truncate it to fit the model's context window -- losing most of the content.

The Chunk Struct

Every chunk carries metadata about its position in the original document:

rust#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
    pub text: String,          // The chunk content
    pub index: usize,          // Position in the chunk sequence (0, 1, 2, ...)
    pub start_offset: usize,   // Character offset in original text
    pub end_offset: usize,     // Character offset in original text
    pub total_chunks: usize,   // Total number of chunks
}

The offsets enable source attribution. When semantic search returns a chunk, the application can highlight the exact passage in the original document. The index and total count enable display logic like "showing result from chunk 7 of 23."

Recursive Character Chunking

The primary chunking strategy is recursive character chunking. It splits text at character boundaries with configurable chunk size and overlap:

rustpub struct ChunkOptions {
    pub chunk_size: usize,       // Target chunk size in characters
    pub overlap: usize,          // Overlap between consecutive chunks
    pub respect_word_boundaries: bool,  // Split at word boundaries
    pub respect_sentence_boundaries: bool, // Split at sentence ends
}

impl ChunkOptions {
    pub fn new(chunk_size: usize, overlap: usize) -> Self {
        Self {
            chunk_size,
            overlap,
            respect_word_boundaries: true,
            respect_sentence_boundaries: false,
        }
    }
}

The default configuration is 1,000 characters per chunk with 200 characters of overlap. These numbers come from empirical observation across different embedding models:

1,000 characters is approximately 250 tokens, well within the 512-token window of models like all-MiniLM-L6-v2.
200 characters of overlap (20%) ensures that concepts split across chunk boundaries are still captured in at least one chunk.

Why Characters, Not Tokens

Chunk size is measured in characters, not tokens. This is a deliberate choice. Token counts depend on the tokenizer, which depends on the model. A chunk of 1,000 characters might be 230 tokens with one model and 280 with another. Character-based sizing is model-independent and produces consistent behavior regardless of which embedding model is configured.

rustpub fn chunk_text(text: &str, options: &ChunkOptions) -> Vec<Chunk> {
    if text.is_empty() {
        return vec![];
    }

    let char_count = text.chars().count();
    if char_count <= options.chunk_size {
        return vec![Chunk {
            text: text.to_string(),
            index: 0,
            start_offset: 0,
            end_offset: char_count,
            total_chunks: 1,
        }];
    }

    let mut chunks = Vec::new();
    let mut start = 0;
    let chars: Vec<char> = text.chars().collect();

    while start < chars.len() {
        let mut end = (start + options.chunk_size).min(chars.len());

        // Respect word boundaries: backtrack to last space
        if options.respect_word_boundaries && end < chars.len() {
            if let Some(space_pos) = find_last_space(&chars, start, end) {
                end = space_pos + 1;
            }
        }

        let chunk_text: String = chars[start..end].iter().collect();
        chunks.push(Chunk {
            text: chunk_text,
            index: chunks.len(),
            start_offset: start,
            end_offset: end,
            total_chunks: 0, // Set after all chunks are created
        });

        // Advance with overlap
        start = if end >= chars.len() {
            chars.len()
        } else {
            end - options.overlap.min(end - start)
        };
    }

    // Set total_chunks on all chunks
    let total = chunks.len();
    for chunk in &mut chunks {
        chunk.total_chunks = total;
    }

    chunks
}

UTF-8 Safety

The function uses chars().count() instead of len() and indexes into a Vec<char> rather than byte slices. This is critical for multilingual content. A French document with accented characters, a Japanese document with multi-byte kanji, or a document with emoji all chunk correctly because the algorithm operates on Unicode code points, not bytes.

A byte-based approach would split in the middle of a multi-byte character, producing invalid UTF-8. The character-based approach costs a small performance penalty (collecting into a Vec) but guarantees correctness across all languages.

Sentence-Boundary Chunking

The second strategy splits at sentence boundaries. This produces chunks that are more semantically coherent because each chunk contains complete sentences:

rustpub fn chunk_by_sentences(text: &str, options: &ChunkOptions) -> Vec<Chunk> {
    let sentences = split_into_sentences(text);
    let mut chunks = Vec::new();
    let mut current_text = String::new();
    let mut current_start = 0;

    for sentence in &sentences {
        if current_text.chars().count() + sentence.chars().count() > options.chunk_size
            && !current_text.is_empty()
        {
            chunks.push(Chunk {
                text: current_text.trim().to_string(),
                index: chunks.len(),
                start_offset: current_start,
                end_offset: current_start + current_text.chars().count(),
                total_chunks: 0,
            });
            current_start += current_text.chars().count() - options.overlap;
            current_text = String::new();
        }
        current_text.push_str(sentence);
    }

    // Handle remaining text
    if !current_text.trim().is_empty() {
        chunks.push(Chunk {
            text: current_text.trim().to_string(),
            index: chunks.len(),
            start_offset: current_start,
            end_offset: current_start + current_text.chars().count(),
            total_chunks: 0,
        });
    }

    let total = chunks.len();
    for chunk in &mut chunks {
        chunk.total_chunks = total;
    }

    chunks
}

Sentence chunking is better for prose -- articles, legal documents, essays -- where meaning is organized in sentences and paragraphs. Character chunking is better for unstructured text or text where sentence detection is unreliable (log files, CSV data, code).

Overlap: Why It Matters

Consider a document about photosynthesis that says: "Chlorophyll absorbs red and blue light. This absorbed energy drives the conversion of carbon dioxide and water into glucose."

If the chunk boundary falls between these two sentences with no overlap, a search for "how chlorophyll produces glucose" might miss the connection because no single chunk contains both "chlorophyll" and "glucose." With overlap, the second chunk includes the first sentence as context, preserving the relationship.

flin// Without overlap: search might miss connections
chunk_1: "...Chlorophyll absorbs red and blue light."
chunk_2: "This absorbed energy drives the conversion..."

// With 200-char overlap: connections preserved
chunk_1: "...Chlorophyll absorbs red and blue light."
chunk_2: "...absorbs red and blue light. This absorbed energy drives..."

The trade-off is storage. A 200-character overlap means approximately 20% more chunks and 20% more embeddings. For most applications, this cost is negligible compared to the quality improvement in search results.

Estimating Tokens and Optimal Chunk Size

The module includes helpers for adapting chunk size to specific embedding models:

rustpub fn estimate_tokens(text: &str) -> usize {
    // Approximate: 1 token per 4 characters for English
    // This is a rough estimate; actual token count depends on the tokenizer
    (text.chars().count() + 3) / 4
}

pub fn optimal_chunk_size(model_context_window: usize) -> usize {
    // Use 80% of context window to leave room for overhead
    let target_tokens = (model_context_window as f64 * 0.8) as usize;
    // Convert tokens to characters (4 chars per token)
    target_tokens * 4
}

These are approximations. The 4-characters-per-token ratio is a reasonable average for English text with common tokenizers (GPT-style BPE, SentencePiece). For languages with more complex morphology or non-Latin scripts, the ratio can vary significantly. The 80% utilization factor leaves headroom for tokenizer-specific overhead and special tokens.

Default Configuration

For developers who do not want to think about chunk parameters, FLIN provides sensible defaults:

rustpub fn chunk_text_default(text: &str) -> Vec<Chunk> {
    chunk_text(text, &ChunkOptions::new(1000, 200))
}

One function call, no configuration. The defaults work well for English text with embedding models in the 384-to-1024 dimension range. For specialized use cases -- very short documents, very long documents, non-English text -- the full ChunkOptions API is available.

Connecting to the Pipeline

In FLIN application code, chunking is typically invisible. The semantic text field type triggers automatic chunking and embedding when a document is saved:

flinentity KnowledgeArticle {
    title: text
    content: semantic text
}

article = KnowledgeArticle.create({
    title: "Introduction to SYSCOHADA",
    content: extracted_text  // From document_extract()
})
save article
// Automatically: chunk -> embed -> store vectors

Behind the scenes, the runtime chunks the content, generates an embedding for each chunk, and stores the embeddings in the vector store with field names like content__chunk_0, content__chunk_1, and so on. When a semantic search query arrives, the search system queries across all chunks and returns results with source attribution back to the original entity.

The chunking module was built in 25 minutes during Session 221. Twenty tests verify every edge case: empty text, text shorter than chunk size, Unicode text, paragraph text, overlap behavior, and metadata consistency. The module is small (400 lines) but critical -- it is the bridge that makes semantic search work on real documents, not just toy examples.

In the next article, we wire the chunking module to the embedding system, completing the pipeline from raw document bytes to searchable vectors.

This is Part 130 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [129] Download Grants and Access Keys - [130] Text Chunking Strategies (you are here) - [131] Chunk-Embedding Integration

#130 -- Text Chunking Strategies

The Document Intelligence Pipeline

The Chunk Struct

Recursive Character Chunking

Why Characters, Not Tokens

UTF-8 Safety

Sentence-Boundary Chunking

Overlap: Why It Matters

Estimating Tokens and Optimal Chunk Size

Default Configuration

Connecting to the Pipeline

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash