#064 -- Graph Queries and Semantic Search

Phase 3 and Phase 4 of Session 166 added capabilities that most databases charge extra for. Graph queries -- shortest path, PageRank, connected components, cycle detection, topological sort -- are typically the domain of dedicated graph databases like Neo4j. Semantic search -- vector embeddings, BM25 keyword ranking, hybrid retrieval -- is typically the domain of dedicated search engines like Elasticsearch or vector databases like Pinecone.

FlinDB has both. Built into the same embedded database. No additional services. No network calls. No configuration. Because if you are building an AI-powered application in FLIN, you should not need to set up three different databases to store data, traverse relationships, and search semantically.

Graph Queries: Treating Relationships as Edges

FlinDB's entity reference system naturally forms a graph. Every reference from one entity to another is an edge. Every entity is a node. The graph query engine operates on this implicit graph without requiring a separate graph model.

Shortest Path: BFS Between Entities

The find_path() method uses breadth-first search to find the shortest path between two entities through their references:

rustpub fn find_path(
    &self,
    entity_type: &str,
    from_id: u64,
    to_id: u64,
) -> DatabaseResult<Option<Vec<u64>>> {
    let mut visited = HashSet::new();
    let mut queue = VecDeque::new();
    let mut parent = HashMap::new();

    queue.push_back(from_id);
    visited.insert(from_id);

    while let Some(current) = queue.pop_front() {
        if current == to_id {
            return Ok(Some(reconstruct_path(&parent, from_id, to_id)));
        }

        // Find all entities that this entity references
        if let Ok(entity) = self.find_by_id(entity_type, current) {
            for (field, value) in &entity.fields {
                if let Value::EntityRef(_, ref_id) = value {
                    if !visited.contains(ref_id) {
                        visited.insert(*ref_id);
                        parent.insert(*ref_id, current);
                        queue.push_back(*ref_id);
                    }
                }
            }
        }
    }

    Ok(None) // No path found
}

Use case: social networks ("how is Thales connected to this other user?"), organizational hierarchies ("what is the reporting chain from this employee to the CEO?"), dependency graphs ("what is the shortest dependency path between these two packages?").

Multi-Hop Traversal

traverse() walks the relationship graph to a configurable depth:

rustlet related = db.traverse("Category", start_id, "parent", 3)?;
// Returns all categories reachable within 3 hops through the "parent" reference

This is essential for hierarchical data. Given a root category "Electronics", a 3-hop traversal through the parent reference finds "Electronics" -> "Phones" -> "Smartphones" -> "iPhone 16". Without graph traversal, you would need recursive queries or multiple round trips.

PageRank: Influence Scoring

FlinDB implements the PageRank algorithm for computing influence scores across entity graphs:

PR(u) = (1-d) + d * SUM(PR(v) / out_degree(v))

The implementation uses configurable iteration count and damping factor:

rustpub fn page_rank(
    &self,
    entity_type: &str,
    ref_field: &str,
    iterations: usize,     // default 20
    damping: f64,           // default 0.85
) -> DatabaseResult<HashMap<u64, f64>>

Use case: ranking content by importance ("which articles are most referenced by other articles?"), identifying key entities ("which users have the most connections?"), prioritizing search results.

Connected Components

connected_components() identifies groups of entities that are connected through references:

rustlet components = db.connected_components("User", "follows")?;
// Returns: [[1, 2, 3], [4, 5], [6]]
// Three groups of connected users

Use case: community detection in social networks, identifying isolated data clusters, finding disconnected subgraphs.

Cycle Detection

find_cycles() detects circular references in entity graphs:

rustlet cycles = db.find_cycles("Task", "depends_on")?;
// Returns: [[1, 3, 5, 1]]
// Task 1 depends on 3, which depends on 5, which depends on 1

Use case: dependency management ("do these tasks have circular dependencies?"), data validation ("is this category hierarchy acyclic?"), workflow engines ("will this process loop forever?").

Topological Sort

topological_sort() orders entities so that dependencies come before dependents:

rustlet ordered = db.topological_sort("Task", "depends_on")?;
// Returns tasks in execution order: [5, 3, 1, 2, 4]

Use case: build systems, task scheduling, migration ordering, course prerequisite planning.

Graph Statistics

graph_stats() provides centrality metrics for the entity graph:

rustpub struct GraphStats {
    pub node_count: usize,
    pub edge_count: usize,
    pub avg_in_degree: f64,
    pub avg_out_degree: f64,
    pub max_in_degree: usize,
    pub max_out_degree: usize,
    pub highest_in_degree_node: Option<u64>,
    pub highest_out_degree_node: Option<u64>,
}

These statistics answer questions like "how connected is this graph?", "which entity has the most incoming references?", and "what is the average number of relationships per entity?".

Semantic Search: AI-Powered Queries

Phase 4 of Session 166 added semantic search -- the ability to find entities based on meaning rather than exact keyword matches. This is the feature that makes FlinDB an AI-native database.

BM25 Keyword Ranking

Before vector search, FlinDB implements BM25 -- the same ranking algorithm used by Elasticsearch and Solr. BM25 scores documents by term frequency, inverse document frequency, and document length:

score(D, Q) = SUM( IDF(qi) * (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * |D|/avgdl)) )

where:
- IDF = ln((N - n + 0.5) / (n + 0.5) + 1)
- k1 = 1.2 (term frequency saturation)
- b = 0.75 (length normalization)

The BM25Index maintains an inverted index of terms to document IDs, along with document frequency statistics:

rustpub struct BM25Index {
    inverted_index: HashMap<String, Vec<(u64, f64)>>,  // term -> [(doc_id, tf)]
    doc_lengths: HashMap<u64, usize>,
    avg_doc_length: f64,
    total_docs: usize,
}

BM25 search returns entities ranked by keyword relevance:

rustdb.keyword_search("comfortable office chair", "Product", "description", 10)?;

Vector Similarity Search

For semantic understanding beyond keyword matching, FlinDB generates vector embeddings for semantic text fields. The embedding configuration is pluggable:

rustpub enum EmbeddingConfig {
    Mock,                     // Hash-based, for testing
    OpenAI { api_key: String, model: String },
    Local { model_path: String },
}

When an entity with a semantic text field is saved, FlinDB automatically generates an embedding and stores it in the vector index:

rust// Setup
db.enable_semantic_search();
db.add_semantic_field("Product", "description");

// Save triggers automatic embedding
db.save("Product", None, {
    "name": "Ergonomic Office Chair",
    "description": "Comfortable seating with lumbar support..."
})?;

// Search by meaning
db.semantic_search(
    "comfortable seating for work",
    "Product",
    "description",
    10
)?;

The query "comfortable seating for work" has no words in common with "Ergonomic Office Chair with lumbar support," but semantic search finds it because the meaning is similar. The search converts the query to a vector, then finds the nearest neighbors in the vector index using cosine similarity.

Hybrid Search: Best of Both Worlds

Pure keyword search misses synonyms and paraphrases. Pure semantic search can miss exact matches and specific terminology. Hybrid search combines both using Reciprocal Rank Fusion (RRF):

rustpub fn hybrid_search(
    &self,
    query: &str,
    entity_type: &str,
    field: &str,
    limit: usize,
) -> DatabaseResult<Vec<HybridSearchResult>>

RRF merges the keyword ranking and the semantic ranking by reciprocal rank position:

RRF_score(d) = 1/(k + rank_keyword(d)) + 1/(k + rank_semantic(d))

Where k is a constant (typically 60) that prevents any single ranking from dominating. An entity that ranks #1 in keyword search and #5 in semantic search scores higher than an entity that ranks #2 in both -- because the #1 keyword ranking indicates a strong exact match.

flinresults = db.hybrid_search(
    "comfortable seating for work",
    "Product",
    "description",
    10
)

for r in results {
    print("{r.entity_id}: keyword={r.keyword_score}, semantic={r.semantic_score}")
}

Persistence

The search indexes are persisted to disk and loaded on startup:

rustdb.save_semantic_indexes(dir)?;  // Save BM25 + vector indexes
db.load_semantic_indexes(dir)?;  // Load on restart
db.reindex_semantic()?;          // Rebuild indexes from data

This ensures that search capabilities survive server restarts without re-embedding all entities.

Why These Features Belong Together

A traditional architecture for an AI-powered application with graph relationships would require:

PostgreSQL for relational data (CRUD, transactions)
Neo4j for graph queries (shortest path, PageRank)
Elasticsearch for keyword search (BM25, full-text)
Pinecone for vector search (semantic similarity)

Four databases. Four connection strings. Four schemas. Four deployment targets. Four sets of credentials. Four potential points of failure.

FlinDB provides all four capabilities in a single embedded database:

flin// CRUD
save user

// Graph
path = db.find_path("User", user1.id, user2.id)

// Keyword search
results = db.keyword_search("office chair", "Product", "description", 10)

// Semantic search
results = db.semantic_search("comfortable seating", "Product", "description", 10)

// Hybrid search
results = db.hybrid_search("comfortable office chair", "Product", "description", 10)

No additional services. No network latency between services. No data synchronization between databases. The data lives in one place, and all query modalities operate on the same data.

The Forty-Four Semantic Search Tests

The semantic search implementation alone required forty-four tests -- the most tests of any single feature in FlinDB:

BM25 tests (15): index construction, term frequency computation, inverse document frequency, document length normalization, multi-term queries, empty index, single document, duplicate terms, relevance ordering.

Vector search tests (8): embedding generation, cosine similarity, nearest neighbor, multiple results, empty index, dimension mismatch, threshold filtering.

Hybrid search tests (7): RRF computation, mixed rankings, keyword-only matches, semantic-only matches, both-match boosting, empty results, limit enforcement.

ZeroCore integration tests (7): enable/disable semantic search, auto-indexing on save, field registration, persistence roundtrip, reindexing.

Graph query tests (11): shortest path, multi-hop traversal, PageRank convergence, connected components, cycle detection, topological sort, graph statistics.

Combined with the 12 transaction tests, Session 166 added 94 tests in total -- bringing the running count to 2,270.

Performance Considerations

Graph queries and semantic search have different performance profiles:

Graph queries are bounded by the graph size and connectivity. BFS shortest path is O(V + E) where V is vertices and E is edges. PageRank is O(iterations * E). For FlinDB's embedded use case (thousands to tens of thousands of entities), these are sub-millisecond operations.

BM25 search is O(k * |V|) where k is the number of query terms and V is the vocabulary size. For moderate document collections, this is fast. For large collections, the inverted index keeps it efficient.

Vector search is O(n * d) where n is the number of documents and d is the embedding dimension. This is a brute-force scan. For large collections (millions of documents), an approximate nearest neighbor index like HNSW would be needed. For FlinDB's target scale, brute-force is sufficient and avoids the complexity of maintaining a separate index structure.

Hybrid search combines BM25 and vector search costs, plus an O(n log n) sort for RRF fusion. The fusion overhead is negligible compared to the search costs.

These performance characteristics are appropriate for an embedded database. FlinDB is not competing with Elasticsearch for billion-document search or Neo4j for billion-edge graph traversal. It is providing these capabilities at application scale -- where the convenience of having everything in one database far outweighs the performance advantage of specialized systems.

This is Part 9 of the "How We Built FlinDB" series, documenting how we built a complete embedded database engine for the FLIN programming language.

Series Navigation: - [062] Relationships and Eager/Lazy Loading - [063] Transactions and Continuous Backup - [064] Graph Queries and Semantic Search (you are here) - [065] The EAVT Storage Model - [066] Database Encryption and Configuration