Chapter 11: Graph Storage and Operations
Chapter 3 established the mathematical foundations for graph structures within Cognica's unified algebra, demonstrating that graph posting lists are isomorphic to document posting lists and that traversal, pattern matching, and path queries all produce sets amenable to Boolean composition. This chapter shifts from theory to implementation: how Cognica materializes the property graph model as document collections, how CRUD operations maintain graph invariants, how traversal algorithms navigate the adjacency structure, and how an in-memory adjacency cache accelerates multi-hop queries.
11.1 Property Graph Model in Cognica
11.1.1 From Algebra to Implementation
Chapter 3 defined a property graph as a tuple with vertices, edges, property assignments, labels, and edge types. The challenge of implementation is to map this abstract structure onto Cognica's concrete storage layer — document collections backed by LSM-trees — while preserving the algebraic properties that enable cross-paradigm query composition.
The central observation is that vertices and edges are both semi-structured entities carrying key-value properties, exactly like documents. A vertex with label "Person" and properties {name: "Alice", age: 30} differs from a document only in the additional graph metadata it carries (label, graph membership) and the structural relationships (edges) it participates in. This observation motivates a storage strategy where graph entities are stored as documents in specialized collections, leveraging all existing infrastructure for indexing, querying, and transaction management.
11.1.2 Property Graph Definition
Cognica implements the labeled property graph model, where:
- Vertices (nodes) carry a single label and a set of key-value properties
- Edges (relationships) carry a single type and a set of key-value properties
- Edges are directed: each edge has a source vertex and a target vertex
- Both vertices and edges have system-assigned unique identifiers
- Multiple edges of the same or different types may connect the same pair of vertices
This model is formalized as:
where:
- is the set of vertices
- is the set of typed, directed edges
- assigns unique identifiers
- assigns a label to each vertex
- assigns a type to each edge
- assigns properties
11.1.3 Design Decisions: Why Property Graph
Several graph data models were evaluated during the design of Cognica's graph layer:
| Model | Strengths | Weaknesses |
|---|---|---|
| Property Graph | Rich properties, intuitive, industry standard | Multi-label support varies |
| RDF (Subject-Predicate-Object) | Standardized (SPARQL), reasoning | Verbose, no properties on edges natively |
| Hypergraph | Edges connect any number of vertices | Complexity, limited tooling |
| Bipartite Graph | Clean for specific domains | Too restrictive for general use |
The property graph model was selected for three reasons:
-
Document-vertex isomorphism: Properties on vertices and edges map directly to JSON document fields, enabling reuse of the document storage engine without an impedance mismatch.
-
Industry adoption: Neo4j, Amazon Neptune, Apache AGE, and TigerGraph all use the property graph model. Adopting it enables Apache AGE-compatible Cypher query support (described in the Cypher design document) and reduces the learning curve for users migrating from these systems.
-
Expressive power: The property graph model naturally represents the entities and relationships found in social networks, knowledge graphs, fraud detection, and recommendation systems — the dominant use cases for graph databases.
11.2 Storage Architecture
11.2.1 Dual-Collection Model
Each graph in Cognica is stored as a pair of document collections:
{graph}_nodes— stores all vertices{graph}_edges— stores all edges
For example, a graph named social creates collections social_nodes and social_edges. This naming convention is enforced by constants in the implementation:
constexpr auto kNodesSuffix = "_nodes";
constexpr auto kEdgesSuffix = "_edges";
(graph_functions.cpp:34-35)
The dual-collection model has several advantages over alternatives:
- Separation of concerns: Node and edge schemas differ structurally, and separate collections allow independent indexing strategies.
- Query selectivity: Queries that access only nodes (or only edges) scan a single collection, avoiding the overhead of filtering a mixed entity store.
- Existing infrastructure: Each collection inherits the full capabilities of Cognica's document storage layer — primary and secondary indexes, cursor-based iteration, transactional operations — without requiring graph-specific storage code.
11.2.2 Document-Vertex Correspondence
Chapter 3 proved that graph posting lists are isomorphic to document posting lists. The storage architecture realizes this isomorphism concretely: every vertex is a document, and every document in a _nodes collection is a vertex. The mapping preserves identity (the document _id is the vertex identifier) and properties (document fields are vertex properties).
This correspondence means that graph vertices are queryable through the standard SQL interface:
-- Query vertices directly as documents
SELECT name, age FROM social_nodes WHERE _label = 'Person' AND age > 25;
The dual nature — vertices as documents, documents as vertices — enables the cross-paradigm queries that Chapter 3 motivated algebraically.
11.2.3 Node Document Format
A node document stores graph metadata in system fields (prefixed with underscore) and user properties as top-level fields:
{
"_id": "social:Person:1001",
"_graph": "social",
"_label": "Person",
"name": "Alice",
"age": 30,
"email": "alice@example.com"
}
The system fields are:
| Field | Type | Description |
|---|---|---|
_id | String | Unique vertex identifier |
_graph | String | Graph namespace membership |
_label | String | Vertex label (e.g., "Person", "Company") |
All other fields are user-defined properties, stored flat at the top level of the document.
11.2.4 Edge Document Format
An edge document follows a similar pattern, with additional fields recording the source and target vertices:
{
"_id": "social:KNOWS:42",
"_graph": "social",
"_type": "KNOWS",
"_source": "social:Person:1001",
"_target": "social:Person:1002",
"since": 2020,
"strength": 0.85
}
The system fields are:
| Field | Type | Description |
|---|---|---|
_id | String | Unique edge identifier |
_graph | String | Graph namespace membership |
_type | String | Edge type (e.g., "KNOWS", "WORKS_AT") |
_source | String | Source vertex _id |
_target | String | Target vertex _id |
Edge properties (since, strength in the example) are stored flat alongside the system fields.
11.2.5 Identifier Scheme
Vertex and edge identifiers follow the format {graph}:{label_or_type}:{sequence}:
social:Person:1 -- first Person vertex in the social graph
social:Company:1 -- first Company vertex in the social graph
social:KNOWS:1 -- first KNOWS edge in the social graph
social:WORKS_AT:1 -- first WORKS_AT edge in the social graph
The identifier generation is implemented with per-key atomic sequence counters:
auto make_node_id_(const std::string& graph, const std::string& label)
-> std::string {
auto seq = get_next_sequence_(graph + ":" + label);
return graph + ":" + label + ":" + std::to_string(seq);
}
(graph_functions.cpp:163-167)
This scheme provides several benefits:
- Namespacing: The graph prefix prevents identifier collisions across graphs.
- Type encoding: The label/type segment enables efficient prefix-based filtering.
- Monotonicity: Sequence numbers ensure identifiers are monotonically increasing within each category, which provides good LSM-tree write locality.
- Human readability: Unlike opaque integer or UUID identifiers, the structured format aids debugging and interactive querying.
11.2.6 Flat Property Storage
A key design decision is storing properties flat at the top level of the document rather than nested under a properties sub-object. Consider the two alternatives:
Cognica (flat storage):
{
"_id": "social:Person:1",
"_label": "Person",
"_graph": "social",
"name": "Alice",
"age": 30
}
Alternative (nested storage, used by Apache AGE internally):
{
"id": "social:Person:1",
"label": "Person",
"properties": {
"name": "Alice",
"age": 30
}
}
Flat storage was chosen for three reasons:
-
Index utilization: Cognica's secondary indexes operate on top-level document fields. Flat storage means
nameandageare directly indexable. Nested storage would require JSON path indexes or specialized index types. -
Query simplicity: With flat storage, a Cypher expression
n.nametranslates to a direct column referencet0.namein SQL. Nested storage would requiret0.properties->>'name', adding JSON extraction overhead. -
Document-vertex unification: Flat storage means node documents are structurally identical to ordinary documents with a few extra system fields. This makes graph collections directly queryable with standard SQL, fulfilling the cross-paradigm promise.
The only cost is that system fields (prefixed with _) must be excluded when constructing a properties-only view for API compatibility. The implementation handles this with a simple prefix check:
for (auto it = doc.MemberBegin(); it != doc.MemberEnd(); ++it) {
const auto* name = it->name.GetString();
if (name[0] != '_') {
// This is a user property, include in output
}
}
(graph_functions.cpp:352-355)
11.2.7 Index Strategy
The graph_create function creates both collections with carefully chosen indexes:
Node collection indexes:
| Index Name | Fields | Purpose |
|---|---|---|
| Primary | _id | Vertex lookup by identifier |
idx_label | _label | Filter vertices by label |
idx_graph | _graph | Filter vertices by graph namespace |
Edge collection indexes:
| Index Name | Fields | Purpose |
|---|---|---|
| Primary | _id | Edge lookup by identifier |
idx_type | _type | Filter edges by type |
idx_graph | _graph | Filter edges by graph namespace |
idx_src_type | _source, _type | Outgoing edge traversal with type filter |
idx_tgt_type | _target, _type | Incoming edge traversal with type filter |
The composite indexes idx_src_type and idx_tgt_type are particularly important for graph traversal performance. When expanding outgoing edges from vertex with a type filter, the query {_source: v, _type: T} can be resolved entirely through the idx_src_type composite index, avoiding a full collection scan.
11.3 Graph Management
11.3.1 Graph Creation
The graph_create function atomically creates both the node and edge collections with their respective schemas and indexes:
SELECT * FROM graph_create('social');
The implementation first checks whether a graph with the given name already exists by verifying the presence of both collections. If either already exists, the operation fails with an error rather than silently succeeding or partially creating the graph.
auto nodes_schema
= db::document::SchemaBuilder {}
.set_collection_name(nodes_collection_name_(*graph_name))
.set_primary_key({kFieldId})
.add_secondary_key("idx_label", {kFieldLabel}, false)
.add_secondary_key("idx_graph", {kFieldGraph}, false)
.build();
(graph_functions.cpp:479-485)
The edge schema includes the composite indexes for efficient traversal:
auto edges_schema
= db::document::SchemaBuilder {}
.set_collection_name(edges_collection_name_(*graph_name))
.set_primary_key({kFieldId})
.add_secondary_key("idx_type", {kFieldType}, false)
.add_secondary_key("idx_graph", {kFieldGraph}, false)
.add_secondary_key("idx_src_type",
{kFieldSource, kFieldType}, false)
.add_secondary_key("idx_tgt_type",
{kFieldTarget, kFieldType}, false)
.build();
(graph_functions.cpp:492-502)
If the edges collection creation fails, the already-created nodes collection is dropped to maintain atomicity.
11.3.2 Graph Deletion
Graph deletion removes both collections and invalidates all adjacency cache entries for the graph:
SELECT * FROM graph_drop('social');
The order of operations matters: the adjacency cache is invalidated first, then the edges collection is dropped, then the nodes collection. This ordering ensures that no traversal operation can read cached data pointing to dropped collections.
GraphAdjacencyCache::instance().invalidate_graph(*graph_name);
auto edges_dropped
= doc_db->drop_collection(edges_collection_name_(*graph_name));
auto nodes_dropped
= doc_db->drop_collection(nodes_collection_name_(*graph_name));
(graph_functions.cpp:538-543)
11.3.3 Graph Discovery
Two functions support graph discovery:
graph_list() returns all graphs in the database by scanning collection names for the _nodes suffix and verifying that a corresponding _edges collection exists:
SELECT * FROM graph_list();
-- Returns: social, company, knowledge_base, ...
graph_exists(name) checks whether a specific graph exists:
SELECT * FROM graph_exists('social');
-- Returns: true or false
The existence check verifies both collections are present, not just one. This prevents a partially created or corrupted graph from appearing as valid.
11.4 Node and Edge Operations
11.4.1 Node Creation
Creating a node inserts a document into the _nodes collection with system fields and user properties:
SELECT * FROM graph_create_node('social', 'Person',
'{"name": "Alice", "age": 30}');
-- Returns: social:Person:1
The implementation generates a unique identifier, constructs a document with system fields, copies user properties to the top level, and inserts the document:
auto node_id = make_node_id_(*graph_name, *label);
auto doc = db::document::Document {};
add_string_member_(doc, kFieldId, node_id);
add_string_member_(doc, kFieldGraph, *graph_name);
add_string_member_(doc, kFieldLabel, *label);
// Copy user properties to top level
for (auto it = properties->MemberBegin();
it != properties->MemberEnd(); ++it) {
// ... copy key-value pair to doc
}
(graph_functions.cpp:634-650)
The function validates that the graph exists before insertion and returns the generated node identifier on success.
11.4.2 Node Retrieval
A node can be retrieved by its identifier:
SELECT * FROM graph_get_node('social', 'social:Person:1');
The result includes three columns compatible with the Apache AGE format:
| Column | Type | Description |
|---|---|---|
id | String | Vertex identifier |
label | String | Vertex label |
properties | JSONB | User properties (system fields excluded) |
The properties column is constructed by iterating over the document and filtering out fields whose names begin with _.
11.4.3 Node Querying
The graph_nodes function queries vertices by label and optional property filters:
-- All Person nodes
SELECT * FROM graph_nodes('social', 'Person');
-- Person nodes with specific properties
SELECT * FROM graph_nodes('social', 'Person', '{"age": 30}');
-- All nodes regardless of label
SELECT * FROM graph_nodes('social', NULL);
The label and property filters are composed into a single query document and executed against the nodes collection, leveraging the idx_label secondary index for label filtering.
11.4.4 Node Update
Node properties can be updated in place:
SELECT * FROM graph_update_node('social', 'social:Person:1',
'{"age": 31, "title": "Engineer"}');
The update operation validates that no system fields (those prefixed with _) are being modified, with one exception: node updates allow changing the _label field to support vertex relabeling:
-- Relabel a Person node to Employee
SELECT * FROM graph_update_node('social', 'social:Person:1',
'{"_label": "Employee"}');
The $delete operator supports both single-property and multi-property deletion:
-- Delete a single property
SELECT * FROM graph_update_node('social', 'social:Person:1',
'{"$delete": "title"}');
-- Delete multiple properties at once
SELECT * FROM graph_update_node('social', 'social:Person:1',
'{"$delete": ["title", "email"]}');
The $delete operator also validates that reserved system fields cannot be removed. The validation logic checks both direct property keys and $delete targets against the reserved _ prefix:
if (key == "$delete") {
// Accepts a single string or an array of strings
// Each path is validated against reserved field names
}
if (key[0] == '_') {
if (allow_label_update && key == kFieldLabel) {
continue; // Node updates permit _label changes
}
return std::unexpected(/* system field error */);
}
(graph_functions.cpp:296-335)
After a successful update, the adjacency cache entry for the node is invalidated to ensure traversal operations see the current state.
11.4.5 Node Deletion
Deleting a node checks for connected edges before proceeding:
-- Fails if the node has any connected edges
SELECT * FROM graph_delete_node('social', 'social:Person:1');
-- Detach delete: removes the node and all connected edges
SELECT * FROM graph_delete_node('social', 'social:Person:1', true);
The detach delete operation collects all edges where the node appears as either source or target, removes them all, then removes the node. This is implemented as a two-phase process:
- Edge collection: Query
{_source: node_id}and{_target: node_id}to find all connected edge identifiers using a deduplication set. - Edge removal: Delete each connected edge by its identifier.
- Node removal: Delete the node document.
auto collect_connected_edge_ids = [&](auto* edges_view)
-> std::vector<std::string> {
auto source_query = make_query_doc_(kFieldSource, *node_id);
auto target_query = make_query_doc_(kFieldTarget, *node_id);
auto edge_ids = std::vector<std::string> {};
auto seen = absl::flat_hash_set<std::string> {};
// ... collect from both source and target queries
};
(graph_functions.cpp:937-963)
After deletion, the adjacency cache is invalidated for the removed node.
11.4.6 Edge Creation
Creating an edge requires specifying the graph, edge type, source vertex, target vertex, and properties:
SELECT * FROM graph_create_edge('social', 'KNOWS',
'social:Person:1', 'social:Person:2',
'{"since": 2020}');
-- Returns: social:KNOWS:1
The implementation performs referential integrity checks — both source and target nodes must exist before the edge is created. This prevents dangling edges that reference non-existent vertices:
if (auto err = verify_node_exists(*source_id, source_query);
err.has_value()) {
return std::unexpected("graph_create_edge: source node '"
+ *source_id + "' does not exist");
}
(graph_functions.cpp:1106-1108)
After edge insertion, the adjacency cache is invalidated for both the source and target nodes to ensure cache consistency.
11.4.7 Edge Retrieval and Querying
Individual edges are retrieved by identifier:
SELECT * FROM graph_get_edge('social', 'social:KNOWS:1');
The result follows the Apache AGE-compatible format:
| Column | Type | Description |
|---|---|---|
id | String | Edge identifier |
start_id | String | Source vertex identifier |
end_id | String | Target vertex identifier |
label | String | Edge type |
properties | JSONB | User properties |
Edges connected to a specific node can be queried with direction control:
-- Outgoing edges from Alice
SELECT * FROM graph_edges('social', 'social:Person:1', 'KNOWS', 'outgoing');
-- Incoming edges to Alice
SELECT * FROM graph_edges('social', 'social:Person:1', 'KNOWS', 'incoming');
-- All edges (both directions)
SELECT * FROM graph_edges('social', 'social:Person:1', 'KNOWS', 'both');
The direction parameter controls which index is queried:
| Direction | Index Used | Query Field |
|---|---|---|
outgoing | idx_src_type | _source |
incoming | idx_tgt_type | _target |
both | Both indexes | _source and _target |
11.4.8 Edge Update and Deletion
Edge updates follow a similar pattern to node updates but with stricter validation. Unlike node updates, edge updates do not permit changes to any system field (including _type). The properties object must also be non-empty:
SELECT * FROM graph_update_edge('social', 'social:KNOWS:1',
'{"since": 2025}');
Before applying the update, the implementation loads the edge metadata to determine the source and target vertex identifiers. If the specified edge does not exist, the operation fails with an explicit error rather than silently succeeding:
auto load_edge_metadata = [&](auto* edges_view) {
auto cursor = edges_view->find(query);
if (!cursor || !cursor->is_valid()) {
return;
}
found_edge = true;
// Extract source_id and target_id for cache invalidation
};
(graph_functions.cpp:1385-1398)
Edge updates also support the $delete operator for removing individual properties, with the same reserved-field protection as node updates. The $delete operator accepts either a single field name or an array of field names.
Edge deletion loads the edge metadata first (to determine source and target for cache invalidation), then removes the edge document:
SELECT * FROM graph_delete_edge('social', 'social:KNOWS:1');
Both operations invalidate cache entries for the source and target nodes of the affected edge.
11.4.9 Transaction-Aware Operations
All graph CRUD functions participate in the database transaction context when one is available. Each table function receives a TableFunctionContext that may carry an active transaction and workspace identifier. When a transaction context is present, all collection lookups and mutations are routed through the transaction rather than directly through the document database:
auto find_collection_(const TableFunctionContext& ctx,
const std::string& collection_name)
-> std::shared_ptr<db::document::Collection> {
auto* doc_db = resolve_document_db_(ctx);
if (doc_db == nullptr) {
return nullptr;
}
if (has_transaction_context_(ctx)) {
return ctx.transaction->find_collection(
ctx.workspace_id, collection_name);
}
return doc_db->find_collection(collection_name);
}
(graph_functions.cpp:195-208)
This design enables several capabilities:
-
Atomic multi-step graph mutations: A Cypher
CREATEstatement that creates multiple nodes and edges can execute all mutations within a single transaction. If any step fails, the entire operation rolls back. -
Read-your-own-writes: Within a LATERAL join chain, a graph mutation in one step is visible to subsequent steps in the same query. For example, creating a node and then creating an edge to that node within the same query works correctly because both operations share the same transaction context.
-
Isolation from concurrent queries: Graph mutations in progress are not visible to other sessions until the transaction commits, preventing partial graph states from being observed.
The transaction routing is transparent to the function implementations. Each CRUD function uses resolve_document_db_() to obtain the database handle and begin_collection_transaction_() to obtain a collection-scoped transaction when available. When no transaction context is present (standalone queries outside an explicit transaction), the functions fall back to direct collection access, preserving backward compatibility.
11.4.10 Scalar JSON Helpers for Path Materialization
In addition to the table functions described above, four scalar functions provide JSON-formatted access to graph elements. These functions are used internally by the Cypher query compiler to materialize path expressions, where nodes and edges must be returned as structured JSON values rather than tabular rows:
| Function | Arguments | Returns | Description |
|---|---|---|---|
graph_get_node_json | graph TEXT, id TEXT | JSONB | Single node as JSON object |
graph_get_nodes_json | graph TEXT, ids TEXT[] | JSONB | Array of nodes as JSON array |
graph_get_edge_json | graph TEXT, id TEXT | JSONB | Single edge as JSON object |
graph_get_edges_json | graph TEXT, ids TEXT[] | JSONB | Array of edges as JSON array |
The JSON representation follows a structured format distinct from the flat document storage:
Node JSON format:
{
"id": "social:Person:1",
"label": "Person",
"properties": {"name": "Alice", "age": 30}
}
Edge JSON format:
{
"id": "social:KNOWS:1",
"start_id": "social:Person:1",
"end_id": "social:Person:2",
"label": "KNOWS",
"properties": {"since": 2020}
}
Unlike the flat storage format used in document collections, these JSON representations nest user properties under a properties key. This matches the Apache AGE convention for Cypher path output and enables clean separation of system metadata from user data in query results.
The scalar functions are registered with kVolatile volatility because they read from mutable graph collections. They are also transaction-aware, using the same SessionContext plumbing as the table functions to ensure read-your-own-writes consistency within multi-step Cypher queries.
11.5 Traversal Algorithms
Graph traversal is the distinguishing capability that separates graph databases from document databases. Cognica implements several traversal algorithms as SQL table functions, each optimized for specific access patterns.
11.5.1 Neighbor Discovery: graph_neighbors
The graph_neighbors function performs breadth-first neighbor discovery from a starting vertex:
SELECT * FROM graph_neighbors('social', 'social:Person:1',
'KNOWS', 'outgoing', 2);
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
graph | TEXT | required | Graph name |
start_node | TEXT | required | Starting vertex ID |
edge_type | TEXT | NULL | Edge type filter (NULL = all types) |
direction | TEXT | 'outgoing' | 'outgoing', 'incoming', or 'both' |
max_depth | INT | 1 | Maximum traversal depth |
Algorithm: The implementation uses BFS with a visited set to avoid cycles and a frontier deque to track vertices at each depth level:
The algorithm operates as follows:
- Initialize the visited set with the start node and push it onto the frontier with depth 0.
- Pop the front of the frontier. If depth equals
max_depth, skip expansion. - For each adjacent vertex not yet visited, add it to the visited set and the result set, then push it onto the back of the frontier with incremented depth.
- Repeat until the frontier is empty.
auto visited = absl::flat_hash_set<std::string> {};
visited.insert(*start_node);
auto frontier = std::deque<
std::tuple<std::string, int64_t, std::vector<std::string>>> {};
frontier.emplace_back(*start_node, 0,
std::vector<std::string> {*start_node});
while (!frontier.empty()) {
auto [current_node, depth, path] = std::move(frontier.front());
frontier.pop_front();
if (depth >= max_depth) {
continue;
}
// ... expand neighbors
}
(graph_functions.cpp:1593-1608)
Result columns: Each discovered neighbor is returned with:
| Column | Type | Description |
|---|---|---|
id | String | Neighbor vertex ID |
label | String | Neighbor vertex label |
properties | JSONB | Neighbor properties |
depth | INT | Distance from start node |
path | ARRAY | Vertex IDs along the traversal path |
Complexity: For a graph with maximum degree and traversal depth :
- Time: — exponential in depth, as each level expands up to neighbors
- Space: — the frontier and visited set grow proportionally
11.5.2 Configurable Traversal: graph_traverse
The graph_traverse function extends neighbor discovery with support for both BFS and DFS strategies, edge type filtering via arrays, and configurable depth limits:
-- BFS traversal following only KNOWS edges
SELECT * FROM graph_traverse('social', 'social:Person:1',
ARRAY['KNOWS'], 'outgoing', 3, 'bfs');
-- DFS traversal following KNOWS and FOLLOWS edges
SELECT * FROM graph_traverse('social', 'social:Person:1',
ARRAY['KNOWS', 'FOLLOWS'], 'outgoing', 5, 'dfs');
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
graph | TEXT | required | Graph name |
start_node | TEXT | required | Starting vertex ID |
edge_types | TEXT[] | NULL | Array of edge types (NULL = all) |
direction | TEXT | 'outgoing' | 'outgoing', 'incoming', or 'both' |
max_depth | INT | 10 | Maximum traversal depth |
strategy | TEXT | 'bfs' | 'bfs' or 'dfs' |
BFS vs DFS: The implementation uses a single std::deque for both strategies. The only difference is where elements are removed:
if (strategy == "bfs") {
state = std::move(frontier.front());
frontier.pop_front();
} else {
state = std::move(frontier.back());
frontier.pop_back();
}
(graph_functions.cpp:1765-1771)
BFS pops from the front (FIFO), guaranteeing that vertices are discovered in order of increasing depth. DFS pops from the back (LIFO), exploring each branch to its maximum depth before backtracking.
Edge type filtering: When the edge type array is non-empty, a hash set is constructed for type membership checks during traversal:
auto type_filter = absl::flat_hash_set<std::string> {};
for (const auto& t : edge_types) {
type_filter.insert(t);
}
(graph_functions.cpp:1738-1741)
During expansion, each edge's type is checked against this filter set. Edges with types not in the set are skipped.
Complexity: Same as neighbor discovery — time and space — but the effective degree may be reduced by edge type filtering.
11.5.3 Shortest Path: graph_shortest_path
The graph_shortest_path function finds the shortest path between two vertices using bidirectional BFS:
SELECT * FROM graph_shortest_path('path_graph',
'path_graph:Node:1', 'path_graph:Node:5',
NULL, 'outgoing', 10);
Algorithm: Bidirectional BFS simultaneously expands from both the start and end vertices, meeting in the middle. This reduces the search space from to for a graph with degree and shortest path length .
The algorithm maintains two separate search frontiers and visited maps:
auto forward = absl::flat_hash_map<std::string, PathNode> {};
auto backward = absl::flat_hash_map<std::string, PathNode> {};
forward[*start_node] = PathNode {*start_node, "", "", 0};
backward[*end_node] = PathNode {*end_node, "", "", 0};
auto forward_frontier = std::vector<std::string> {*start_node};
auto backward_frontier = std::vector<std::string> {*end_node};
(graph_functions.cpp:1930-1937)
Each PathNode stores the node ID, its parent in the search tree, the edge used to reach it, and its depth. This information enables path reconstruction once the two searches meet.
Alternating expansion: The main loop alternates between expanding the forward and backward frontiers. After each expansion, it checks whether any newly discovered vertex exists in the opposite frontier:
for (int64_t depth = 0; depth < max_depth / 2 + 1; ++depth) {
// Expand forward frontier
expand_frontier(forward_frontier, forward,
kFieldSource, kFieldTarget);
// Check for meeting point
for (const auto& node : forward_frontier) {
if (backward.contains(node)) {
return reconstruct_path(node);
}
}
// Expand backward frontier
expand_frontier(backward_frontier, backward,
kFieldTarget, kFieldSource);
// Check for meeting point
for (const auto& node : backward_frontier) {
if (forward.contains(node)) {
return reconstruct_path(node);
}
}
}
(graph_functions.cpp:2027-2061)
Path reconstruction: When the two frontiers meet at vertex , the path is reconstructed by tracing parents backward from to the start (forward path) and from to the end (backward path):
- Trace the forward path: (start)
- Reverse it to get:
- Trace the backward path: (end)
- Concatenate:
Trivial case: When start equals end, the function immediately returns a path of length 0 containing just the single vertex.
Result columns:
| Column | Type | Description |
|---|---|---|
nodes | ARRAY | Ordered vertex IDs along the path |
edges | ARRAY | Ordered edge IDs along the path |
length | INT | Number of edges in the path |
total_weight | FLOAT | Sum of edge weights (defaults to hop count) |
Complexity: For a graph with average degree and shortest path length :
- Time: — the square root of unidirectional BFS
- Space: — for the two visited maps
The bidirectional approach provides substantial speedup for long paths in dense graphs. For a graph with and :
- Unidirectional BFS: vertex expansions
- Bidirectional BFS: vertex expansions
This represents a millionfold reduction.
11.5.4 All Paths: graph_all_paths
The graph_all_paths function enumerates all simple paths between two vertices within depth constraints:
SELECT * FROM graph_all_paths('path_graph',
'path_graph:Node:1', 'path_graph:Node:5',
NULL, 1, 5, 100);
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
graph | TEXT | required | Graph name |
start | TEXT | required | Start vertex ID |
end | TEXT | required | End vertex ID |
edge_types | TEXT[] | NULL | Edge type filter |
min_depth | INT | 1 | Minimum path length |
max_depth | INT | 5 | Maximum path length |
limit | INT | 100 | Maximum number of paths to return |
Algorithm: DFS with per-path cycle detection. Unlike the traversal functions which maintain a global visited set, graph_all_paths uses a per-path visited set to allow the same vertex to appear in different paths:
struct DFSState {
std::string node;
std::vector<std::string> path_nodes;
std::vector<std::string> path_edges;
absl::flat_hash_set<std::string> visited;
};
(graph_functions.cpp:2114-2119)
Each stack entry carries its own visited set. When a neighbor has already been visited on the current path, it is skipped — except when the neighbor is the target vertex, which is allowed to terminate the path:
if (state.visited.contains(neighbor) && neighbor != *end_node) {
continue;
}
(graph_functions.cpp:2167-2168)
This cycle detection ensures that only simple paths (no repeated vertices except the terminal) are enumerated.
Complexity: In the worst case (complete graph), the number of simple paths between two vertices can be exponential: for vertices. The limit parameter bounds the output, but the search itself may explore exponentially many partial paths. The max_depth parameter provides a tighter bound by limiting path length.
11.5.5 Reachability: graph_reachable
The graph_reachable function checks whether a path exists between two vertices, optimized for early termination:
SELECT * FROM graph_reachable('social',
'social:Person:1', 'social:Person:5');
-- Returns: true or false
The implementation delegates to graph_shortest_path and checks whether any path was found:
auto path_result = fn_graph_shortest_path(ctx, args);
if (!path_result) {
return std::unexpected(path_result.error());
}
auto& paths = std::get<GraphPathSeries>(*path_result);
auto result = JsonValueSeries {};
result.push_back(make_bool_value(!paths.empty()));
(graph_functions.cpp:2199-2208)
This leverages the bidirectional BFS for efficiency while providing a simpler Boolean interface.
Complexity: Same as graph_shortest_path — where is the shortest path length. The bidirectional BFS terminates as soon as any path is found, providing implicit early termination.
11.5.6 Complexity Summary
| Function | Algorithm | Time | Space |
|---|---|---|---|
graph_neighbors | BFS | ||
graph_traverse | BFS/DFS | ||
graph_shortest_path | Bidirectional BFS | ||
graph_all_paths | DFS + cycle detection | ||
graph_reachable | Bidirectional BFS |
Where = maximum degree, = traversal depth or path length, = total vertices.
11.6 Adjacency Cache
Multi-hop graph traversals require repeated edge lookups for each vertex in the frontier. Without caching, a 3-hop traversal expanding 100 neighbors at each level generates index lookups against the edge collection. The adjacency cache dramatically reduces this cost by storing edge adjacency lists in memory.
11.6.1 Cache Architecture
The GraphAdjacencyCache is a process-wide singleton that stores both outgoing and incoming edge lists for cached nodes:
class GraphAdjacencyCache final {
// ...
private:
struct CacheEntry {
std::vector<EdgeInfo> outgoing;
std::vector<EdgeInfo> incoming;
uint64_t access_time;
};
std::unordered_map<std::string, CacheEntry> cache_;
mutable std::mutex mutex_;
size_t max_entries_ = 1000000; // 1M nodes
};
(graph_functions.hpp:277-289)
Each EdgeInfo stores the minimal information needed for traversal:
struct EdgeInfo {
std::string edge_id;
std::string neighbor_id;
std::string edge_type;
};
(graph_functions.hpp:216-220)
The cache key is a composite string "graph:node_id", ensuring that identically-named nodes in different graphs are cached independently.
11.6.2 LRU Eviction
When the cache reaches its capacity (default 1M entries), the eviction policy removes the least recently used entry. Each access updates an access_time counter:
void GraphAdjacencyCache::evict_if_needed_() {
if (cache_.size() < max_entries_) {
return;
}
auto oldest_it = cache_.begin();
auto oldest_time = oldest_it->second.access_time;
for (auto it = cache_.begin(); it != cache_.end(); ++it) {
if (it->second.access_time < oldest_time) {
oldest_time = it->second.access_time;
oldest_it = it;
}
}
cache_.erase(oldest_it);
}
(graph_functions.cpp:2530-2548)
The eviction scan is in the cache size. For production workloads that frequently trigger eviction, this could be improved with a doubly-linked list maintaining LRU order. However, for the current implementation, evictions are rare relative to cache hits because the 1M entry capacity accommodates most practical graph sizes.
11.6.3 Cache Warming
The graph_warm_cache function pre-populates the cache by iterating over all nodes in a graph and loading their adjacency lists:
SELECT * FROM graph_warm_cache('social', 100000);
-- Returns: cached = 100000, time_ms = 1500
The implementation scans the nodes collection, and for each node, queries its outgoing and incoming edges:
for (const auto& node_id : node_ids) {
auto outgoing = std::vector<EdgeInfo> {};
auto incoming = std::vector<EdgeInfo> {};
auto out_query = make_query_doc_(kFieldSource, node_id);
auto out_cursor = edges_coll->find(out_query);
for (; out_cursor->is_valid(); out_cursor->next()) {
// ... collect EdgeInfo
}
auto in_query = make_query_doc_(kFieldTarget, node_id);
auto in_cursor = edges_coll->find(in_query);
for (; in_cursor->is_valid(); in_cursor->next()) {
// ... collect EdgeInfo
}
cache_node(graph, node_id, outgoing, incoming);
}
(graph_functions.cpp:2468-2503)
Cache warming is intended for use during low-traffic periods, such as application startup or after a graph bulk load. The max_nodes parameter limits the number of nodes cached to control memory usage and warming time.
11.6.4 Cache Invalidation
Cache consistency is maintained through eager invalidation on write operations. Every mutation that affects a node's adjacency triggers invalidation:
| Operation | Invalidation Scope |
|---|---|
graph_create_edge | Source node, Target node |
graph_update_edge | Source node, Target node |
graph_delete_edge | Source node, Target node |
graph_update_node | The updated node |
graph_delete_node | The deleted node |
graph_drop | All nodes in the graph |
The invalidation is per-node, not per-edge. When an edge is created between vertices and , both 's and 's cache entries are removed entirely (not surgically updated). This conservative approach avoids subtle consistency bugs at the cost of occasional additional cache misses:
auto& cache = GraphAdjacencyCache::instance();
cache.invalidate_node(*graph_name, *source_id);
cache.invalidate_node(*graph_name, *target_id);
(graph_functions.cpp:1152-1154)
For graph_drop, all entries with the graph's prefix are removed:
void GraphAdjacencyCache::invalidate_graph(const std::string& graph) {
auto lock = std::lock_guard {mutex_};
auto prefix = graph + ":";
for (auto it = cache_.begin(); it != cache_.end();) {
if (it->first.compare(0, prefix.size(), prefix) == 0) {
it = cache_.erase(it);
} else {
++it;
}
}
}
(graph_functions.cpp:2513-2523)
11.6.5 Thread Safety
All cache operations are protected by a mutex:
auto GraphAdjacencyCache::get_outgoing(
const std::string& graph, const std::string& node_id)
-> std::optional<std::vector<EdgeInfo>> {
auto lock = std::lock_guard {mutex_};
auto key = make_key_(graph, node_id);
auto it = cache_.find(key);
if (it == cache_.end()) {
++miss_count_;
return std::nullopt;
}
++hit_count_;
it->second.access_time = ++access_counter_;
return it->second.outgoing;
}
(graph_functions.cpp:2397-2410)
The single-mutex design is simple and correct. For workloads with very high traversal concurrency, a sharded lock design (partitioning the cache by hash of the key) could reduce contention. However, cache lookups are fast relative to the I/O they avoid, so the mutex overhead is typically negligible.
11.6.6 Cache Statistics and Monitoring
The graph_cache_stats function exposes cache performance metrics:
SELECT * FROM graph_cache_stats();
| Column | Type | Description |
|---|---|---|
entries | INT | Current number of cached nodes |
hits | INT | Total cache hit count |
misses | INT | Total cache miss count |
hit_rate | FLOAT | Hit rate as a decimal (0.0 to 1.0) |
These metrics enable operators to assess whether the cache is sized appropriately and whether cache warming is effective for their workload.
11.6.7 Performance Impact
The adjacency cache provides the greatest benefit for multi-hop traversals. Consider a 3-hop BFS traversal on a graph where each node has 50 outgoing edges:
Without cache: Each hop requires an index lookup against the edges collection.
- Total lookups:
- Each lookup involves LSM-tree index traversal: multiple disk reads or memory-mapped page accesses
With warm cache: Each hop requires a hash table lookup.
- Total lookups: (same count)
- Each lookup: single hash table probe, expected time, no disk I/O
The speedup is proportional to the ratio of index lookup time to hash table lookup time, which can be 100x or more depending on whether the index data is in the OS page cache.
11.7 Aggregation Operations
11.7.1 Degree Counting: graph_degree
The graph_degree function counts the number of edges connected to a vertex, optionally filtered by edge type and direction:
-- Count all edges (both directions)
SELECT * FROM graph_degree('social', 'social:Person:1', NULL, 'both');
-- Count only outgoing KNOWS edges
SELECT * FROM graph_degree('social', 'social:Person:1', 'KNOWS', 'outgoing');
The implementation iterates over matching edges and counts them:
auto count_edges = [&](const char* field) {
auto query = db::document::Document {};
add_string_member_(query, field, *node_id);
if (edge_type) {
add_string_member_(query, kFieldType, *edge_type);
}
auto cursor = edges_coll->find(query);
for (; cursor->is_valid(); cursor->next()) {
++degree;
}
};
(graph_functions.cpp:2248-2260)
The degree computation leverages the composite indexes (idx_src_type, idx_tgt_type) when an edge type filter is provided, making the operation efficient even for high-degree nodes.
Complexity: where is the degree of vertex in the specified direction. The index limits the scan to edges matching the filter.
11.7.2 Common Neighbor Computation: graph_common_neighbors
The graph_common_neighbors function finds vertices that are outgoing neighbors of both input vertices:
SELECT * FROM graph_common_neighbors('social',
'social:Person:1', 'social:Person:2', 'KNOWS');
The algorithm is a straightforward set intersection:
- Collect outgoing neighbors of vertex into set .
- Collect outgoing neighbors of vertex into set .
- Return — vertices present in both sets.
auto neighbors_a = get_neighbors(*node_a);
auto neighbors_b = get_neighbors(*node_b);
for (const auto& neighbor : neighbors_a) {
if (neighbors_b.contains(neighbor)) {
// Fetch neighbor node and add to result
}
}
(graph_functions.cpp:2340-2345)
The implementation uses absl::flat_hash_set for expected-time membership testing, making the intersection operation efficient.
Complexity: where is the cost of fetching a node document.
Applications: Common neighbor counting is a fundamental building block for link prediction algorithms. The Jaccard coefficient between two vertices is computed as:
The Adamic-Adar index weights each common neighbor by the inverse log of its degree:
Both metrics can be computed efficiently using graph_common_neighbors combined with graph_degree.
11.8 Design Trade-offs
11.8.1 Flat vs Nested Property Storage
The choice to store properties flat alongside system fields (rather than nested under a properties key) is Cognica's most distinctive deviation from other graph databases.
Advantages of flat storage:
- Direct indexability of property fields
- Simpler SQL translation (
n.namebecomest.name, nott.properties->>'name') - Unified document-graph model (graph vertices are ordinary documents)
- No JSON extraction overhead for property access
Advantages of nested storage (used by Apache AGE internally):
- Clean separation between system and user fields
- No naming conflicts between user properties and system fields
- Simpler introspection (all user data is under one key)
Cognica mitigates the disadvantage of flat storage by reserving the _ prefix for system fields and validating that user properties cannot overwrite system fields during updates:
if (key[0] == '_') {
return std::unexpected(std::string {fn_name}
+ ": cannot update reserved graph system fields");
}
(graph_functions.cpp:328-330)
11.8.2 Collection-Based vs Native Graph Storage
Cognica stores graphs as document collections rather than implementing a purpose-built graph storage engine. This is a fundamentally different approach from systems like Neo4j, which uses a native graph storage format with fixed-size records and pointer-based adjacency traversal.
Collection-based storage (Cognica):
| Aspect | Characteristic |
|---|---|
| Adjacency traversal | Index lookup per hop |
| Vertex access | Index lookup by _id |
| Storage overhead | Standard document overhead |
| Implementation cost | Minimal (reuses existing engine) |
| Transaction support | Full (inherits from document layer) |
| Cross-paradigm queries | Native (vertices are documents) |
Native graph storage (Neo4j-style):
| Aspect | Characteristic |
|---|---|
| Adjacency traversal | Pointer chase, O(1) per hop |
| Vertex access | Direct by internal ID |
| Storage overhead | Fixed-size records, compact |
| Implementation cost | Substantial (custom engine) |
| Transaction support | Custom implementation required |
| Cross-paradigm queries | Requires data duplication |
Cognica's approach trades per-hop traversal speed for implementation simplicity, full transaction support, and seamless cross-paradigm query composition. The adjacency cache closes much of the traversal performance gap by providing in-memory adjacency lookups that approach the speed of native pointer-based traversal.
11.8.3 Cache vs Disk Trade-offs
The adjacency cache introduces a classic space-time trade-off:
| Configuration | Memory | Traversal Speed | Write Overhead |
|---|---|---|---|
| No cache | Minimal | Slow (disk/index per hop) | None |
| Selective cache | Moderate | Fast for cached nodes | Per-write invalidation |
| Full cache | High (proportional to edge count) | Fastest | Per-write invalidation |
The 1M entry default limit assumes approximately 100 bytes per entry (edge lists for both directions), yielding roughly 100 MB of cache memory. This accommodates graphs with up to 1 million active vertices. For larger graphs, the LRU eviction policy ensures that frequently accessed vertices remain cached while infrequently accessed vertices are evicted.
11.8.4 Comparison with Existing Systems
| Feature | Cognica | Apache AGE | Neo4j |
|---|---|---|---|
| Storage backend | Document collections (LSM-tree) | PostgreSQL heap tables | Native fixed-size records |
| Property storage | Flat (top-level fields) | Nested (agtype) | Native property store |
| Index type | B-tree secondary indexes | PostgreSQL indexes | Native index-free adjacency |
| Query language | SQL + table functions + Cypher | SQL + Cypher | Cypher |
| Transaction model | Document-level MVCC | PostgreSQL MVCC | Custom MVCC |
| Cross-paradigm | Native (same storage layer) | Via PostgreSQL tables | Requires connector |
| Adjacency cache | LRU in-memory cache | None (relies on PostgreSQL buffer) | Index-free adjacency |
| Traversal method | BFS/DFS via collection queries | Custom scan operators | Native traversal engine |
Apache AGE stores graph data in PostgreSQL heap tables with a custom agtype data type that wraps JSON-like values. Each vertex and edge is a row with an agtype properties column. Cognica's flat property storage avoids the agtype extraction overhead but requires system field naming discipline.
Neo4j uses a native storage format where each vertex record contains a pointer to its first relationship, and each relationship record contains pointers to the next relationships for both the source and target vertices. This forms a doubly-linked list that enables adjacency traversal without index lookups. Cognica's adjacency cache provides similar in-memory traversal when warmed, at the cost of cache memory.
11.9 Summary
This chapter presented Cognica's graph storage and operations layer, tracing the path from the mathematical property graph model defined in Chapter 3 to a concrete implementation built on the document storage engine described in Chapter 6.
The key architectural decisions are:
-
Dual-collection model: Each graph is stored as a pair of document collections (
_nodesand_edges), inheriting all document storage capabilities including indexing, transactions, and SQL queryability. -
Flat property storage: User properties are stored as top-level document fields rather than nested under a properties key. This enables direct indexing, simple SQL translation, and a unified document-graph data model.
-
Structured identifiers: The
{graph}:{label}:{sequence}naming scheme provides namespace isolation, type encoding, and human readability while maintaining monotonic ordering for LSM-tree locality. -
Traversal algorithms: BFS neighbor discovery, configurable BFS/DFS traversal, bidirectional BFS shortest path, DFS all-paths enumeration with per-path cycle detection, and BFS reachability checking cover the fundamental graph query patterns.
-
Adjacency cache: An LRU cache storing outgoing and incoming edge lists for up to 1 million nodes dramatically accelerates multi-hop traversals by replacing index lookups with hash table probes.
-
Eager invalidation: Write operations eagerly invalidate affected cache entries, maintaining consistency between the cache and the persistent storage layer.
-
Transaction-aware CRUD: All graph mutation functions participate in the database transaction context, enabling atomic multi-step operations and read-your-own-writes consistency within complex query pipelines.
-
Scalar JSON helpers: Dedicated scalar functions materialize graph elements as structured JSON values for Cypher path expressions, bridging the gap between the flat document storage model and the nested representation expected by graph query languages.
The graph layer is accessible through SQL table functions, enabling composition with standard SQL operations:
-- Cross-paradigm query: full-text search + graph traversal
SELECT n.id, n.properties->>'name' AS name, n.depth
FROM social_nodes docs
JOIN fts_search('social_nodes', 'database expert') s ON docs._id = s._id
CROSS JOIN LATERAL graph_neighbors('social', docs._id, 'KNOWS', 'both', 2) n
WHERE n.depth <= 2
ORDER BY s._score DESC;
Graph mutations can also be composed through LATERAL joins, enabling complex multi-step operations within a single SQL statement:
-- Create a node and an edge in a single query pipeline
WITH src AS (SELECT 'social:Person:1' AS source_id)
SELECT refreshed.since
FROM src
CROSS JOIN LATERAL graph_create_node(
'social', 'Person', '{"name": "Bob"}') AS cn(id)
CROSS JOIN LATERAL graph_create_edge(
'social', 'KNOWS', src.source_id, cn.id, '{}') AS ce(id)
CROSS JOIN LATERAL graph_update_edge(
'social', ce.id, '{"since": 2024}') AS ue(success)
JOIN LATERAL (
SELECT e.since FROM social_edges e WHERE e._id = ce.id
) refreshed ON true;
This pattern is the foundation for Cypher-to-SQL compilation: each Cypher clause (CREATE, MERGE, SET) is lowered to a LATERAL join step that can read results from previous steps and feed values to subsequent steps, all within a single transactional context.
This composability is the practical realization of the algebraic unification described in Chapter 3: graph posting lists, document posting lists, and full-text search posting lists all participate in the same query evaluation framework, enabling queries that span paradigms without data duplication or cross-system orchestration.