Cloud-based embedding APIs are convenient but come with three fundamental problems: latency (100-300 ms per call), cost (accumulates with volume), and privacy (your data is sent to a third party). For applications that generate thousands of embeddings daily, or that handle sensitive data, or that need sub-50ms search latency, cloud APIs are a bottleneck.
FastEmbed solves all three problems. It is an open-source library that runs embedding models locally, on the same machine as the FLIN runtime. No network call. No API key. No data leaving the server. A 384-dimension embedding generates in 10-50 milliseconds depending on text length and hardware.
FLIN integrates FastEmbed as the default local embedding provider, making it the recommended choice for production applications that need fast, private semantic search.
What FastEmbed Is
FastEmbed is an embedding inference library optimized for production use. It runs quantized ONNX models that produce high-quality embeddings at a fraction of the resource cost of full-precision models.
Key characteristics: - Model size: 30-100 MB (vs 500 MB+ for full-precision) - Inference time: 10-50 ms per embedding - Memory usage: 100-300 MB at runtime - Accuracy: >95% of full-precision model quality - Dependencies: ONNX Runtime only
The models are downloaded once and cached locally. After the first run, there is no network dependency.
Configuration
Enabling FastEmbed in FLIN:
// flin.config
ai {
embedding {
provider = "fastembed"
model = "BAAI/bge-small-en-v1.5" // 384 dimensions, 33 MB
}
}Available models:
| Model | Dimensions | Size | Quality | Speed |
|---|---|---|---|---|
BAAI/bge-small-en-v1.5 | 384 | 33 MB | Good | Fast |
BAAI/bge-base-en-v1.5 | 768 | 110 MB | Better | Medium |
BAAI/bge-large-en-v1.5 | 1024 | 335 MB | Best | Slower |
sentence-transformers/all-MiniLM-L6-v2 | 384 | 23 MB | Good | Fastest |
For most applications, bge-small-en-v1.5 provides the best balance of quality and speed. The 384-dimension vectors are small enough to index efficiently while capturing enough semantic information for accurate search.
Integration with semantic text
When FastEmbed is configured, semantic text fields use it automatically:
entity Product {
name: text
description: semantic text // Uses FastEmbed for embedding
}product = Product { name: "Ergonomic Office Chair", description: "Adjustable lumbar support with breathable mesh back..." } save product // Embedding generated locally via FastEmbed ```
The switch from cloud embeddings to FastEmbed is transparent. The save operation calls FastEmbed instead of an API. The search keyword uses the same HNSW index. The developer code does not change.
Implementation
The FastEmbed integration in the FLIN runtime:
use fastembed::{TextEmbedding, InitOptions, EmbeddingModel};pub struct FastEmbedProvider { model: TextEmbedding, model_name: String, }
impl FastEmbedProvider {
pub fn new(model_name: &str) -> Result
Ok(Self { model, model_name: model_name.to_string(), }) }
pub fn embed(&self, text: &str) -> Result
pub fn embed_batch(&self, texts: &[String]) -> Result
fn parse_model(name: &str) -> EmbeddingModel { match name { "BAAI/bge-small-en-v1.5" => EmbeddingModel::BGESmallENV15, "BAAI/bge-base-en-v1.5" => EmbeddingModel::BGEBaseENV15, "BAAI/bge-large-en-v1.5" => EmbeddingModel::BGELargeENV15, "sentence-transformers/all-MiniLM-L6-v2" => EmbeddingModel::AllMiniLML6V2, _ => EmbeddingModel::BGESmallENV15, // Default } } ```
Batch Embedding for Imports
When importing existing data, generating embeddings one at a time would be slow. FastEmbed supports batch processing:
// Import 10,000 products with embeddings
products = load_csv("products.csv")for batch in products.chunks(100) { for product in batch { save Product { name: product.name, description: product.description // Batched embedding } } } ```
The FLIN runtime detects batch save operations and groups the embedding calls:
pub fn embed_batch_on_save(
provider: &FastEmbedProvider,
entities: &mut [Entity],
semantic_fields: &[&str],
) -> Result<(), EmbeddingError> {
for field_name in semantic_fields {
let texts: Vec<String> = entities.iter()
.map(|e| e.get_text(field_name).to_string())
.collect();let embeddings = provider.embed_batch(&texts)?;
for (entity, embedding) in entities.iter_mut().zip(embeddings) { entity.set_embedding(field_name, embedding); } } Ok(()) } ```
Batch embedding is approximately 5x faster than individual embedding calls due to reduced overhead per invocation.
Model Download and Caching
The first time a FastEmbed model is used, it is downloaded from Hugging Face and cached in .flindb/models/:
.flindb/
models/
BAAI--bge-small-en-v1.5/
model.onnx (33 MB)
tokenizer.json (400 KB)
config.json (1 KB)Subsequent uses load from cache. The download progress is displayed in the FLIN development server console:
[FastEmbed] Downloading BAAI/bge-small-en-v1.5... 33.2 MB
[FastEmbed] Model cached at .flindb/models/BAAI--bge-small-en-v1.5/
[FastEmbed] Ready. First embedding: 12msFor deployment, the model files should be included in the application bundle or pre-downloaded in the deployment script. FLIN will not attempt to download models in production if the cache directory already contains them.
Benchmarks: FastEmbed vs Cloud APIs
| Metric | FastEmbed (local) | OpenAI API | Cohere API |
|---|---|---|---|
| Latency (single) | 12 ms | 150 ms | 120 ms |
| Latency (batch 100) | 180 ms | 800 ms | 600 ms |
| Cost per 1M embeddings | $0 (hardware only) | $0.02-$0.13 | $0.10 |
| Privacy | Full (no data sent) | Data sent to OpenAI | Data sent to Cohere |
| Offline capable | Yes | No | No |
| Accuracy (MTEB avg) | 0.62 (small) | 0.63 (ada-002) | 0.64 (v3) |
FastEmbed matches cloud API quality within 2-3% while being 10x faster and completely private.
Hybrid Approach
FLIN supports using different embedding providers for different purposes:
ai {
// FastEmbed for semantic text fields (fast, private)
embedding {
provider = "fastembed"
model = "BAAI/bge-small-en-v1.5"
}// Cloud API for Intent Engine (needs LLM, not just embeddings) provider = "openai" model = "gpt-4o-mini" api_key = env("OPENAI_API_KEY") } ```
Semantic search uses FastEmbed (local, fast). The Intent Engine uses the cloud LLM (for natural language understanding). This hybrid approach gives the best of both worlds: fast search with private data, and powerful intent translation when needed.
Multilingual Embeddings
For applications serving multilingual content (common in Africa where users switch between French, English, and local languages), multilingual embedding models are available:
ai {
embedding {
provider = "fastembed"
model = "BAAI/bge-small-en-v1.5" // English
// Future: BAAI/bge-m3 for multilingual
}
}The BGE-M3 model (when supported) handles over 100 languages in a single embedding space. A search for "chaise de bureau confortable" (French) would find products described in English as "comfortable office chair" because the meanings map to the same vector region.
Why Local Embeddings Matter for Africa
Two practical reasons make local embeddings essential for FLIN's target market:
Internet reliability. Many African developers work with intermittent connectivity. A cloud-dependent embedding pipeline means semantic search stops working when the internet drops. FastEmbed works offline.
Data sovereignty. Enterprise customers in regulated industries (banking, healthcare, government) require that data does not leave their infrastructure. Local embeddings satisfy this requirement without sacrificing functionality.
FastEmbed transforms semantic search from a cloud dependency into a local capability. The embedding model is as much a part of the FLIN binary as the HTTP server or the database engine -- always available, always fast, always private.
In the next article, we explore RAG (Retrieval-Augmented Generation) -- how FLIN combines semantic search with LLM generation to answer questions from your application's data.
---
This is Part 119 of the "How We Built FLIN" series, documenting how a CEO in Abidjan and an AI CTO designed and built a programming language from scratch.
Series Navigation: - [118] AI Gateway: 8 Providers, One API - [119] FastEmbed Integration for Embeddings (you are here) - [120] RAG: Retrieval, Reranking, and Source Attribution - [121] Document Parsing: PDF, DOCX, CSV, JSON, YAML