#132 -- Extracting Text From CSV, XLSX, RTF, and XML

PDF and DOCX get all the attention in document parsing discussions. But the real world runs on spreadsheets, legacy word processors, and XML feeds. A SYSCOHADA accounting firm uploads trial balances as Excel files. A legal team exchanges drafts in RTF because it is the lowest common denominator between Word, LibreOffice, and Pages. A news aggregator ingests RSS feeds in XML. An import pipeline reads data from CSV exports.

FLIN's document extraction pipeline needed to handle all of these. Sessions 228, 230, and 231 added four new format parsers -- CSV, XLSX, RTF, and XML -- bringing the total supported formats to nine. Each format presents unique challenges, from multi-sheet workbooks to XPath query evaluation, but all feed into the same chunking and embedding pipeline.

The Extraction Dispatcher

Every document format flows through a single dispatcher function that detects the type and routes to the appropriate parser:

rustpub fn extract_document(
    bytes: &[u8],
    mime_type: Option<&str>,
    extension: Option<&str>,
) -> Result<String, String> {
    let doc_type = detect_document_type(mime_type, extension);

    match doc_type {
        DocumentType::Text => Ok(String::from_utf8_lossy(bytes).to_string()),
        DocumentType::Markdown => Ok(String::from_utf8_lossy(bytes).to_string()),
        DocumentType::Html => extract_html_text(bytes),
        DocumentType::Pdf => extract_pdf_text(bytes),
        DocumentType::Docx => extract_docx_text(bytes),
        DocumentType::Csv => extract_csv_text(bytes),
        DocumentType::Xlsx => extract_xlsx_text(bytes),
        DocumentType::Rtf => extract_rtf_text(bytes),
        DocumentType::Xml => extract_xml_text(bytes),
        // ... other formats
        DocumentType::Unknown => Err("Unsupported document type".to_string()),
    }
}

MIME type detection takes priority over file extension. If a file is uploaded with Content-Type: text/csv, it is treated as CSV even if the extension is .txt. Extension-based detection is the fallback when no MIME type is available.

CSV: Tabular Data as Searchable Text

CSV files are deceptively simple. The format has no formal specification (RFC 4180 is a guideline, not a standard), and real-world CSV files use different delimiters, quote characters, line endings, and encodings. FLIN uses the csv crate, which handles these variations gracefully.

The extraction converts CSV rows into tab-delimited text, preserving the tabular structure while producing text that embedding models can process:

rustpub fn extract_csv_text(bytes: &[u8]) -> Result<String, String> {
    // Handle UTF-8 BOM
    let data = if bytes.starts_with(&[0xEF, 0xBB, 0xBF]) {
        &bytes[3..]
    } else {
        bytes
    };

    let mut reader = csv::ReaderBuilder::new()
        .flexible(true)
        .trim(csv::Trim::All)
        .from_reader(data);

    let mut output = String::new();

    // Headers
    if let Ok(headers) = reader.headers() {
        output.push_str(&headers.iter().collect::<Vec<_>>().join("\t"));
        output.push('\n');
    }

    // Rows
    for record in reader.records() {
        if let Ok(record) = record {
            let fields: Vec<&str> = record.iter().collect();
            output.push_str(&fields.join("\t"));
            output.push('\n');
        }
    }

    Ok(output)
}

The output looks like this:

Product     Q1      Q2      Q3
Widget      100     150     120
Gadget      200     180     220

Tab-delimited output was chosen over comma-delimited for readability and because tabs rarely appear in data values. The embedding model sees the column headers alongside the values, which helps it understand that "120" means "Q3 sales for Widget" rather than just the number 120.

XLSX: Multi-Sheet Workbooks

Excel files are ZIP archives containing XML. The calamine crate handles the decompression and XML parsing, providing a clean API for reading cells.

The key challenge with XLSX is multi-sheet workbooks. A financial report might have separate sheets for revenue, expenses, and balance sheet. FLIN extracts all sheets and labels each one:

rustpub fn extract_xlsx_text(bytes: &[u8]) -> Result<String, String> {
    let cursor = std::io::Cursor::new(bytes);
    let mut workbook: Xlsx<_> = calamine::open_workbook_from_rs(cursor)
        .map_err(|e| format!("Failed to open XLSX: {}", e))?;

    let sheet_names = workbook.sheet_names().to_vec();
    let mut output = String::new();

    for name in &sheet_names {
        if let Ok(range) = workbook.worksheet_range(name) {
            output.push_str(&format!("=== Sheet: {} ===\n", name));

            for row in range.rows() {
                let cells: Vec<String> = row.iter()
                    .map(|cell| format_cell_value(cell))
                    .collect();
                output.push_str(&cells.join("\t"));
                output.push('\n');
            }
            output.push('\n');
        }
    }

    Ok(output)
}

fn format_cell_value(cell: &DataType) -> String {
    match cell {
        DataType::String(s) => s.clone(),
        DataType::Int(i) => i.to_string(),
        DataType::Float(f) => {
            if f.fract() == 0.0 { format!("{}", *f as i64) }
            else { format!("{:.2}", f) }
        }
        DataType::Bool(b) => if *b { "TRUE" } else { "FALSE" }.to_string(),
        DataType::DateTime(dt) => format!("{:.0}", dt),
        DataType::Empty => String::new(),
        _ => String::new(),
    }
}

The output for a multi-sheet workbook:

=== Sheet: Sales Data ===
Product     Q1      Q2      Q3
Widget      100     150     120

=== Sheet: Summary ===
Total       370
Average     123.33

Sheet headers act as section markers. When this text is chunked and embedded, the embedding model understands that "Widget Q1 100" relates to "Sales Data" and "370" relates to "Summary." This contextual information significantly improves search quality for financial documents.

RTF: Rich Text Without the Riches

RTF (Rich Text Format) is a Microsoft format from 1987 that encodes formatting in plain-text control words. A simple document might look like {\rtf1\ansi\b Hello\b0 World}, where \b starts bold and \b0 ends it. The extraction strips all formatting and returns plain text:

rustpub fn extract_rtf_text(bytes: &[u8]) -> Result<String, String> {
    let text = String::from_utf8_lossy(bytes).to_string();

    let document = rtf_parser::parse_rtf(text)
        .map_err(|e| format!("RTF parse error: {}", e))?;

    let raw_text = document.get_text();
    Ok(normalize_rtf_text(&raw_text))
}

fn normalize_rtf_text(text: &str) -> String {
    // Collapse multiple whitespace, preserve paragraph breaks
    let mut result = String::new();
    let mut last_was_space = false;

    for ch in text.chars() {
        if ch == '\n' {
            result.push('\n');
            last_was_space = false;
        } else if ch.is_whitespace() {
            if !last_was_space {
                result.push(' ');
            }
            last_was_space = true;
        } else {
            result.push(ch);
            last_was_space = false;
        }
    }

    result.trim().to_string()
}

RTF extraction is simpler than the other formats because rtf-parser handles the heavy lifting. The normalization step cleans up the whitespace artifacts that RTF parsing often produces -- consecutive spaces from removed formatting codes, empty lines from removed tables.

XML: XPath-Powered Extraction

XML is the most versatile of the four formats. A generic XML file needs plain-text extraction (strip all tags). An RSS feed needs item-level extraction. A SOAP message needs element selection. FLIN handles all of these with a dual-library approach: roxmltree for fast parsing and sxd-xpath for XPath 1.0 query support.

Basic Text Extraction

The simplest operation strips all XML tags and returns the text content:

rustpub fn extract_xml_text(bytes: &[u8]) -> Result<String, String> {
    let text = String::from_utf8_lossy(bytes);
    let doc = roxmltree::Document::parse(&text)
        .map_err(|e| format!("XML parse error: {}", e))?;

    let mut output = String::new();
    collect_text_recursive(doc.root(), &mut output);
    Ok(output.trim().to_string())
}

XPath Queries

For structured extraction, FLIN supports XPath 1.0 queries:

rustpub fn extract_xml_by_xpath(
    bytes: &[u8],
    xpath: &str,
) -> Result<XPathResult, String> {
    let text = String::from_utf8_lossy(bytes);
    let package = sxd_document::parser::parse(&text)
        .map_err(|e| format!("XML parse error: {:?}", e))?;
    let document = package.as_document();

    let factory = sxd_xpath::Factory::new();
    let expression = factory.build(xpath)
        .map_err(|e| format!("XPath error: {:?}", e))?
        .ok_or("Empty XPath expression")?;

    let context = sxd_xpath::Context::new();
    let value = expression.evaluate(&context, document.root())
        .map_err(|e| format!("XPath evaluation error: {:?}", e))?;

    Ok(xpath_value_to_result(value))
}

XPath enables precise data extraction from structured XML:

flin// Extract all item titles from an RSS feed
titles = xml_xpath(rss_data, "//item/title/text()")

// Extract links with a specific attribute
links = xml_xpath(data, "//link[@rel='stylesheet']/@href")

// Get the first matching element
first_item = xml_xpath_first(data, "//item[1]")

Subtype Detection

FLIN automatically detects XML subtypes by examining the root element:

rustpub fn detect_xml_subtype(bytes: &[u8]) -> XmlSubtype {
    let text = String::from_utf8_lossy(bytes);
    if let Ok(doc) = roxmltree::Document::parse(&text) {
        let root_name = doc.root_element().tag_name().name().to_lowercase();
        match root_name.as_str() {
            "rss" => XmlSubtype::Rss,
            "feed" => XmlSubtype::Atom,
            "envelope" => XmlSubtype::Soap,
            "svg" => XmlSubtype::Svg,
            "html" => XmlSubtype::Xhtml,
            _ => XmlSubtype::Generic,
        }
    } else {
        XmlSubtype::Generic
    }
}

This enables convenience functions like extract_rss_items and extract_atom_entries that know the structure of common XML formats and extract items without requiring the developer to write XPath queries.

Nine Formats, One Pipeline

With CSV, XLSX, RTF, and XML complete, FLIN's document extraction pipeline handles nine formats:

Format	Crate	MIME Types	Tests
Plain Text	(built-in)	text/plain	5
Markdown	(built-in)	text/markdown	5
HTML	(built-in)	text/html	12
PDF	pdf-extract	application/pdf	8
DOCX	docx-rs	application/vnd.openxml...	10
CSV	csv	text/csv	12
XLSX	calamine	application/vnd.openxml...	8
RTF	rtf-parser	application/rtf	22
XML	roxmltree + sxd-xpath	application/xml, text/xml	61

Total: 117 tests across all format parsers. Each parser handles its edge cases -- UTF-8 BOM in CSV, empty sheets in XLSX, malformed control words in RTF, namespace-heavy XML -- and feeds clean text into the chunking pipeline.

The unified dispatcher means that adding a new format requires exactly three changes: add a variant to DocumentType, add detection logic for its MIME types and extensions, and implement the extraction function. The chunking, embedding, and search systems work with the new format automatically.

In the next article, we explore a different kind of conversion: semantic auto-conversion, where FLIN automatically detects semantic text fields and sets up embedding infrastructure without the developer writing any setup code.

This is Part 132 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.

Series Navigation: - [131] Chunk-Embedding Integration - [132] Extracting Text From CSV, XLSX, RTF, and XML (you are here) - [133] Semantic Auto-Conversion

#132 -- Extracting Text From CSV, XLSX, RTF, and XML

The Extraction Dispatcher

CSV: Tabular Data as Searchable Text

XLSX: Multi-Sheet Workbooks

RTF: Rich Text Without the Riches

XML: XPath-Powered Extraction

Basic Text Extraction

XPath Queries

Subtype Detection

Nine Formats, One Pipeline

Responses

Related Articles

Thirteen Agents, Forty-Three Minutes: The First Claude Fable 5 Workflow Session, And What A Deterministic Orchestration Script Changes About Multi-Agent Builds

The gate caught its own drift: one day inside CASP with Claude Fable 5

The CASP Transplant: How The Six-File Discipline Moved From Conductor To An Anti-Fraud Transport ERP, What The /next Skill Adds When The Operator Just Types 'next', And Why The Cost Of CASP Drift Rises When The Project Is Someone Else's Cash