Chapter 11: Graph Storage and Operations

Chapter 3 established the mathematical foundations for graph structures within Cognica's unified algebra, demonstrating that graph posting lists are isomorphic to document posting lists and that traversal, pattern matching, and path queries all produce sets amenable to Boolean composition. This chapter shifts from theory to implementation: how Cognica materializes the property graph model as document collections, how CRUD operations maintain graph invariants, how traversal algorithms navigate the adjacency structure, and how an in-memory adjacency cache accelerates multi-hop queries.

11.1 Property Graph Model in Cognica

11.1.1 From Algebra to Implementation

Chapter 3 defined a property graph as a tuple $G = (V, E, \rho, \lambda, \sigma)$ with vertices, edges, property assignments, labels, and edge types. The challenge of implementation is to map this abstract structure onto Cognica's concrete storage layer — document collections backed by LSM-trees — while preserving the algebraic properties that enable cross-paradigm query composition.

The central observation is that vertices and edges are both semi-structured entities carrying key-value properties, exactly like documents. A vertex with label "Person" and properties {name: "Alice", age: 30} differs from a document only in the additional graph metadata it carries (label, graph membership) and the structural relationships (edges) it participates in. This observation motivates a storage strategy where graph entities are stored as documents in specialized collections, leveraging all existing infrastructure for indexing, querying, and transaction management.

11.1.2 Property Graph Definition

Cognica implements the labeled property graph model, where:

Vertices (nodes) carry a single label and a set of key-value properties
Edges (relationships) carry a single type and a set of key-value properties
Edges are directed: each edge has a source vertex and a target vertex
Both vertices and edges have system-assigned unique identifiers
Multiple edges of the same or different types may connect the same pair of vertices

This model is formalized as:

G = (V, E, \text{id}, \text{label}, \text{type}, \text{props})

where:

$V$ is the set of vertices
$E \subseteq V \times V \times \mathcal{T}$ is the set of typed, directed edges
$\text{id}: V \cup E \to \text{String}$ assigns unique identifiers
$\text{label}: V \to L$ assigns a label to each vertex
$\text{type}: E \to \mathcal{T}$ assigns a type to each edge
$\text{props}: V \cup E \to 2^{\mathcal{K} \times \mathcal{V}}$ assigns properties

11.1.3 Design Decisions: Why Property Graph

Several graph data models were evaluated during the design of Cognica's graph layer:

Model	Strengths	Weaknesses
Property Graph	Rich properties, intuitive, industry standard	Multi-label support varies
RDF (Subject-Predicate-Object)	Standardized (SPARQL), reasoning	Verbose, no properties on edges natively
Hypergraph	Edges connect any number of vertices	Complexity, limited tooling
Bipartite Graph	Clean for specific domains	Too restrictive for general use

The property graph model was selected for three reasons:

Document-vertex isomorphism: Properties on vertices and edges map directly to JSON document fields, enabling reuse of the document storage engine without an impedance mismatch.
Industry adoption: Neo4j, Amazon Neptune, Apache AGE, and TigerGraph all use the property graph model. Adopting it enables Apache AGE-compatible Cypher query support (described in the Cypher design document) and reduces the learning curve for users migrating from these systems.
Expressive power: The property graph model naturally represents the entities and relationships found in social networks, knowledge graphs, fraud detection, and recommendation systems — the dominant use cases for graph databases.

11.2 Storage Architecture

11.2.1 Dual-Collection Model

Each graph in Cognica is stored as a pair of document collections:

{graph}_nodes — stores all vertices
{graph}_edges — stores all edges

For example, a graph named social creates collections social_nodes and social_edges. This naming convention is enforced by constants in the implementation:

constexpr auto kNodesSuffix = "_nodes";
constexpr auto kEdgesSuffix = "_edges";

(graph_functions.cpp:34-35)

The dual-collection model has several advantages over alternatives:

Separation of concerns: Node and edge schemas differ structurally, and separate collections allow independent indexing strategies.
Query selectivity: Queries that access only nodes (or only edges) scan a single collection, avoiding the overhead of filtering a mixed entity store.
Existing infrastructure: Each collection inherits the full capabilities of Cognica's document storage layer — primary and secondary indexes, cursor-based iteration, transactional operations — without requiring graph-specific storage code.

Loading diagram...

11.2.2 Document-Vertex Correspondence

Chapter 3 proved that graph posting lists are isomorphic to document posting lists. The storage architecture realizes this isomorphism concretely: every vertex is a document, and every document in a _nodes collection is a vertex. The mapping preserves identity (the document _id is the vertex identifier) and properties (document fields are vertex properties).

This correspondence means that graph vertices are queryable through the standard SQL interface:

-- Query vertices directly as documents
SELECT name, age FROM social_nodes WHERE _label = 'Person' AND age > 25;

The dual nature — vertices as documents, documents as vertices — enables the cross-paradigm queries that Chapter 3 motivated algebraically.

11.2.3 Node Document Format

A node document stores graph metadata in system fields (prefixed with underscore) and user properties as top-level fields:

{
  "_id": "social:Person:1001",
  "_graph": "social",
  "_label": "Person",
  "name": "Alice",
  "age": 30,
  "email": "alice@example.com"
}

The system fields are:

Field	Type	Description
`_id`	String	Unique vertex identifier
`_graph`	String	Graph namespace membership
`_label`	String	Vertex label (e.g., "Person", "Company")

All other fields are user-defined properties, stored flat at the top level of the document.

11.2.4 Edge Document Format

An edge document follows a similar pattern, with additional fields recording the source and target vertices:

{
  "_id": "social:KNOWS:42",
  "_graph": "social",
  "_type": "KNOWS",
  "_source": "social:Person:1001",
  "_target": "social:Person:1002",
  "since": 2020,
  "strength": 0.85
}

The system fields are:

Field	Type	Description
`_id`	String	Unique edge identifier
`_graph`	String	Graph namespace membership
`_type`	String	Edge type (e.g., "KNOWS", "WORKS_AT")
`_source`	String	Source vertex `_id`
`_target`	String	Target vertex `_id`

Edge properties (since, strength in the example) are stored flat alongside the system fields.

11.2.5 Identifier Scheme

Vertex and edge identifiers follow the format {graph}:{label_or_type}:{sequence}:

social:Person:1       -- first Person vertex in the social graph
social:Company:1      -- first Company vertex in the social graph
social:KNOWS:1        -- first KNOWS edge in the social graph
social:WORKS_AT:1     -- first WORKS_AT edge in the social graph

The identifier generation is implemented with per-key atomic sequence counters:

auto make_node_id_(const std::string& graph, const std::string& label)
    -> std::string {
  auto seq = get_next_sequence_(graph + ":" + label);
  return graph + ":" + label + ":" + std::to_string(seq);
}

(graph_functions.cpp:163-167)

This scheme provides several benefits:

Namespacing: The graph prefix prevents identifier collisions across graphs.
Type encoding: The label/type segment enables efficient prefix-based filtering.
Monotonicity: Sequence numbers ensure identifiers are monotonically increasing within each category, which provides good LSM-tree write locality.
Human readability: Unlike opaque integer or UUID identifiers, the structured format aids debugging and interactive querying.

11.2.6 Flat Property Storage

A key design decision is storing properties flat at the top level of the document rather than nested under a properties sub-object. Consider the two alternatives:

Cognica (flat storage):

{
  "_id": "social:Person:1",
  "_label": "Person",
  "_graph": "social",
  "name": "Alice",
  "age": 30
}

Alternative (nested storage, used by Apache AGE internally):

{
  "id": "social:Person:1",
  "label": "Person",
  "properties": {
    "name": "Alice",
    "age": 30
  }
}

Flat storage was chosen for three reasons:

Index utilization: Cognica's secondary indexes operate on top-level document fields. Flat storage means name and age are directly indexable. Nested storage would require JSON path indexes or specialized index types.
Query simplicity: With flat storage, a Cypher expression n.name translates to a direct column reference t0.name in SQL. Nested storage would require t0.properties->>'name', adding JSON extraction overhead.
Document-vertex unification: Flat storage means node documents are structurally identical to ordinary documents with a few extra system fields. This makes graph collections directly queryable with standard SQL, fulfilling the cross-paradigm promise.

The only cost is that system fields (prefixed with _) must be excluded when constructing a properties-only view for API compatibility. The implementation handles this with a simple prefix check:

for (auto it = doc.MemberBegin(); it != doc.MemberEnd(); ++it) {
  const auto* name = it->name.GetString();
  if (name[0] != '_') {
    // This is a user property, include in output
  }
}

(graph_functions.cpp:352-355)

11.2.7 Index Strategy

The graph_create function creates both collections with carefully chosen indexes:

Node collection indexes:

Index Name	Fields	Purpose
Primary	`_id`	Vertex lookup by identifier
`idx_label`	`_label`	Filter vertices by label
`idx_graph`	`_graph`	Filter vertices by graph namespace

Edge collection indexes:

Index Name	Fields	Purpose
Primary	`_id`	Edge lookup by identifier
`idx_type`	`_type`	Filter edges by type
`idx_graph`	`_graph`	Filter edges by graph namespace
`idx_src_type`	`_source, _type`	Outgoing edge traversal with type filter
`idx_tgt_type`	`_target, _type`	Incoming edge traversal with type filter

The composite indexes idx_src_type and idx_tgt_type are particularly important for graph traversal performance. When expanding outgoing edges from vertex $v$ with a type filter, the query {_source: v, _type: T} can be resolved entirely through the idx_src_type composite index, avoiding a full collection scan.

11.3 Graph Management

11.3.1 Graph Creation

The graph_create function atomically creates both the node and edge collections with their respective schemas and indexes:

SELECT * FROM graph_create('social');

The implementation first checks whether a graph with the given name already exists by verifying the presence of both collections. If either already exists, the operation fails with an error rather than silently succeeding or partially creating the graph.

auto nodes_schema
    = db::document::SchemaBuilder {}
          .set_collection_name(nodes_collection_name_(*graph_name))
          .set_primary_key({kFieldId})
          .add_secondary_key("idx_label", {kFieldLabel}, false)
          .add_secondary_key("idx_graph", {kFieldGraph}, false)
          .build();

(graph_functions.cpp:479-485)

The edge schema includes the composite indexes for efficient traversal:

auto edges_schema
    = db::document::SchemaBuilder {}
          .set_collection_name(edges_collection_name_(*graph_name))
          .set_primary_key({kFieldId})
          .add_secondary_key("idx_type", {kFieldType}, false)
          .add_secondary_key("idx_graph", {kFieldGraph}, false)
          .add_secondary_key("idx_src_type",
                             {kFieldSource, kFieldType}, false)
          .add_secondary_key("idx_tgt_type",
                             {kFieldTarget, kFieldType}, false)
          .build();

(graph_functions.cpp:492-502)

If the edges collection creation fails, the already-created nodes collection is dropped to maintain atomicity.

11.3.2 Graph Deletion

Graph deletion removes both collections and invalidates all adjacency cache entries for the graph:

SELECT * FROM graph_drop('social');

The order of operations matters: the adjacency cache is invalidated first, then the edges collection is dropped, then the nodes collection. This ordering ensures that no traversal operation can read cached data pointing to dropped collections.

GraphAdjacencyCache::instance().invalidate_graph(*graph_name);

auto edges_dropped
    = doc_db->drop_collection(edges_collection_name_(*graph_name));
auto nodes_dropped
    = doc_db->drop_collection(nodes_collection_name_(*graph_name));

(graph_functions.cpp:538-543)

11.3.3 Graph Discovery

Two functions support graph discovery:

graph_list() returns all graphs in the database by scanning collection names for the _nodes suffix and verifying that a corresponding _edges collection exists:

SELECT * FROM graph_list();
-- Returns: social, company, knowledge_base, ...

graph_exists(name) checks whether a specific graph exists:

SELECT * FROM graph_exists('social');
-- Returns: true or false

The existence check verifies both collections are present, not just one. This prevents a partially created or corrupted graph from appearing as valid.

11.4 Node and Edge Operations

11.4.1 Node Creation

Creating a node inserts a document into the _nodes collection with system fields and user properties:

SELECT * FROM graph_create_node('social', 'Person',
    '{"name": "Alice", "age": 30}');
-- Returns: social:Person:1

The implementation generates a unique identifier, constructs a document with system fields, copies user properties to the top level, and inserts the document:

auto node_id = make_node_id_(*graph_name, *label);

auto doc = db::document::Document {};
add_string_member_(doc, kFieldId, node_id);
add_string_member_(doc, kFieldGraph, *graph_name);
add_string_member_(doc, kFieldLabel, *label);

// Copy user properties to top level
for (auto it = properties->MemberBegin();
     it != properties->MemberEnd(); ++it) {
  // ... copy key-value pair to doc
}

(graph_functions.cpp:634-650)

The function validates that the graph exists before insertion and returns the generated node identifier on success.

11.4.2 Node Retrieval

A node can be retrieved by its identifier:

SELECT * FROM graph_get_node('social', 'social:Person:1');

The result includes three columns compatible with the Apache AGE format:

Column	Type	Description
`id`	String	Vertex identifier
`label`	String	Vertex label
`properties`	JSONB	User properties (system fields excluded)

The properties column is constructed by iterating over the document and filtering out fields whose names begin with _.

11.4.3 Node Querying

The graph_nodes function queries vertices by label and optional property filters:

-- All Person nodes
SELECT * FROM graph_nodes('social', 'Person');

-- Person nodes with specific properties
SELECT * FROM graph_nodes('social', 'Person', '{"age": 30}');

-- All nodes regardless of label
SELECT * FROM graph_nodes('social', NULL);

The label and property filters are composed into a single query document and executed against the nodes collection, leveraging the idx_label secondary index for label filtering.

11.4.4 Node Update

Node properties can be updated in place:

SELECT * FROM graph_update_node('social', 'social:Person:1',
    '{"age": 31, "title": "Engineer"}');

The update operation validates that no system fields (those prefixed with _) are being modified, with one exception: node updates allow changing the _label field to support vertex relabeling:

-- Relabel a Person node to Employee
SELECT * FROM graph_update_node('social', 'social:Person:1',
    '{"_label": "Employee"}');

The $delete operator supports both single-property and multi-property deletion:

-- Delete a single property
SELECT * FROM graph_update_node('social', 'social:Person:1',
    '{"$delete": "title"}');

-- Delete multiple properties at once
SELECT * FROM graph_update_node('social', 'social:Person:1',
    '{"$delete": ["title", "email"]}');

The $delete operator also validates that reserved system fields cannot be removed. The validation logic checks both direct property keys and $delete targets against the reserved _ prefix:

if (key == "$delete") {
  // Accepts a single string or an array of strings
  // Each path is validated against reserved field names
}

if (key[0] == '_') {
  if (allow_label_update && key == kFieldLabel) {
    continue;  // Node updates permit _label changes
  }
  return std::unexpected(/* system field error */);
}

(graph_functions.cpp:296-335)

After a successful update, the adjacency cache entry for the node is invalidated to ensure traversal operations see the current state.

11.4.5 Node Deletion

Deleting a node checks for connected edges before proceeding:

-- Fails if the node has any connected edges
SELECT * FROM graph_delete_node('social', 'social:Person:1');

-- Detach delete: removes the node and all connected edges
SELECT * FROM graph_delete_node('social', 'social:Person:1', true);

The detach delete operation collects all edges where the node appears as either source or target, removes them all, then removes the node. This is implemented as a two-phase process:

Edge collection: Query {_source: node_id} and {_target: node_id} to find all connected edge identifiers using a deduplication set.
Edge removal: Delete each connected edge by its identifier.
Node removal: Delete the node document.

auto collect_connected_edge_ids = [&](auto* edges_view)
    -> std::vector<std::string> {
  auto source_query = make_query_doc_(kFieldSource, *node_id);
  auto target_query = make_query_doc_(kFieldTarget, *node_id);
  auto edge_ids = std::vector<std::string> {};
  auto seen = absl::flat_hash_set<std::string> {};
  // ... collect from both source and target queries
};

(graph_functions.cpp:937-963)

After deletion, the adjacency cache is invalidated for the removed node.

11.4.6 Edge Creation

Creating an edge requires specifying the graph, edge type, source vertex, target vertex, and properties:

SELECT * FROM graph_create_edge('social', 'KNOWS',
    'social:Person:1', 'social:Person:2',
    '{"since": 2020}');
-- Returns: social:KNOWS:1

The implementation performs referential integrity checks — both source and target nodes must exist before the edge is created. This prevents dangling edges that reference non-existent vertices:

if (auto err = verify_node_exists(*source_id, source_query);
    err.has_value()) {
  return std::unexpected("graph_create_edge: source node '"
                         + *source_id + "' does not exist");
}

(graph_functions.cpp:1106-1108)

After edge insertion, the adjacency cache is invalidated for both the source and target nodes to ensure cache consistency.

11.4.7 Edge Retrieval and Querying

Individual edges are retrieved by identifier:

SELECT * FROM graph_get_edge('social', 'social:KNOWS:1');

The result follows the Apache AGE-compatible format:

Column	Type	Description
`id`	String	Edge identifier
`start_id`	String	Source vertex identifier
`end_id`	String	Target vertex identifier
`label`	String	Edge type
`properties`	JSONB	User properties

Edges connected to a specific node can be queried with direction control:

-- Outgoing edges from Alice
SELECT * FROM graph_edges('social', 'social:Person:1', 'KNOWS', 'outgoing');

-- Incoming edges to Alice
SELECT * FROM graph_edges('social', 'social:Person:1', 'KNOWS', 'incoming');

-- All edges (both directions)
SELECT * FROM graph_edges('social', 'social:Person:1', 'KNOWS', 'both');

The direction parameter controls which index is queried:

Direction	Index Used	Query Field
`outgoing`	`idx_src_type`	`_source`
`incoming`	`idx_tgt_type`	`_target`
`both`	Both indexes	`_source` and `_target`

11.4.8 Edge Update and Deletion

Edge updates follow a similar pattern to node updates but with stricter validation. Unlike node updates, edge updates do not permit changes to any system field (including _type). The properties object must also be non-empty:

SELECT * FROM graph_update_edge('social', 'social:KNOWS:1',
    '{"since": 2025}');

Before applying the update, the implementation loads the edge metadata to determine the source and target vertex identifiers. If the specified edge does not exist, the operation fails with an explicit error rather than silently succeeding:

auto load_edge_metadata = [&](auto* edges_view) {
  auto cursor = edges_view->find(query);
  if (!cursor || !cursor->is_valid()) {
    return;
  }
  found_edge = true;
  // Extract source_id and target_id for cache invalidation
};

(graph_functions.cpp:1385-1398)

Edge updates also support the $delete operator for removing individual properties, with the same reserved-field protection as node updates. The $delete operator accepts either a single field name or an array of field names.

Edge deletion loads the edge metadata first (to determine source and target for cache invalidation), then removes the edge document:

SELECT * FROM graph_delete_edge('social', 'social:KNOWS:1');

Both operations invalidate cache entries for the source and target nodes of the affected edge.

11.4.9 Transaction-Aware Operations

All graph CRUD functions participate in the database transaction context when one is available. Each table function receives a TableFunctionContext that may carry an active transaction and workspace identifier. When a transaction context is present, all collection lookups and mutations are routed through the transaction rather than directly through the document database:

auto find_collection_(const TableFunctionContext& ctx,
                      const std::string& collection_name)
    -> std::shared_ptr<db::document::Collection> {
  auto* doc_db = resolve_document_db_(ctx);
  if (doc_db == nullptr) {
    return nullptr;
  }

  if (has_transaction_context_(ctx)) {
    return ctx.transaction->find_collection(
        ctx.workspace_id, collection_name);
  }

  return doc_db->find_collection(collection_name);
}

(graph_functions.cpp:195-208)

This design enables several capabilities:

Atomic multi-step graph mutations: A Cypher CREATE statement that creates multiple nodes and edges can execute all mutations within a single transaction. If any step fails, the entire operation rolls back.
Read-your-own-writes: Within a LATERAL join chain, a graph mutation in one step is visible to subsequent steps in the same query. For example, creating a node and then creating an edge to that node within the same query works correctly because both operations share the same transaction context.
Isolation from concurrent queries: Graph mutations in progress are not visible to other sessions until the transaction commits, preventing partial graph states from being observed.

The transaction routing is transparent to the function implementations. Each CRUD function uses resolve_document_db_() to obtain the database handle and begin_collection_transaction_() to obtain a collection-scoped transaction when available. When no transaction context is present (standalone queries outside an explicit transaction), the functions fall back to direct collection access, preserving backward compatibility.

11.4.10 Scalar JSON Helpers for Path Materialization

In addition to the table functions described above, four scalar functions provide JSON-formatted access to graph elements. These functions are used internally by the Cypher query compiler to materialize path expressions, where nodes and edges must be returned as structured JSON values rather than tabular rows:

Function	Arguments	Returns	Description
`graph_get_node_json`	graph TEXT, id TEXT	JSONB	Single node as JSON object
`graph_get_nodes_json`	graph TEXT, ids TEXT[]	JSONB	Array of nodes as JSON array
`graph_get_edge_json`	graph TEXT, id TEXT	JSONB	Single edge as JSON object
`graph_get_edges_json`	graph TEXT, ids TEXT[]	JSONB	Array of edges as JSON array

The JSON representation follows a structured format distinct from the flat document storage:

Node JSON format:

{
  "id": "social:Person:1",
  "label": "Person",
  "properties": {"name": "Alice", "age": 30}
}

Edge JSON format:

{
  "id": "social:KNOWS:1",
  "start_id": "social:Person:1",
  "end_id": "social:Person:2",
  "label": "KNOWS",
  "properties": {"since": 2020}
}

Unlike the flat storage format used in document collections, these JSON representations nest user properties under a properties key. This matches the Apache AGE convention for Cypher path output and enables clean separation of system metadata from user data in query results.

The scalar functions are registered with kVolatile volatility because they read from mutable graph collections. They are also transaction-aware, using the same SessionContext plumbing as the table functions to ensure read-your-own-writes consistency within multi-step Cypher queries.

11.5 Traversal Algorithms

Graph traversal is the distinguishing capability that separates graph databases from document databases. Cognica implements several traversal algorithms as SQL table functions, each optimized for specific access patterns.

11.5.1 Neighbor Discovery: graph_neighbors

The graph_neighbors function performs breadth-first neighbor discovery from a starting vertex:

SELECT * FROM graph_neighbors('social', 'social:Person:1',
    'KNOWS', 'outgoing', 2);

Parameters:

Parameter	Type	Default	Description
`graph`	TEXT	required	Graph name
`start_node`	TEXT	required	Starting vertex ID
`edge_type`	TEXT	NULL	Edge type filter (NULL = all types)
`direction`	TEXT	'outgoing'	'outgoing', 'incoming', or 'both'
`max_depth`	INT	1	Maximum traversal depth

Algorithm: The implementation uses BFS with a visited set to avoid cycles and a frontier deque to track vertices at each depth level:

Loading diagram...

The algorithm operates as follows:

Initialize the visited set with the start node and push it onto the frontier with depth 0.
Pop the front of the frontier. If depth equals max_depth, skip expansion.
For each adjacent vertex not yet visited, add it to the visited set and the result set, then push it onto the back of the frontier with incremented depth.
Repeat until the frontier is empty.

auto visited = absl::flat_hash_set<std::string> {};
visited.insert(*start_node);

auto frontier = std::deque<
    std::tuple<std::string, int64_t, std::vector<std::string>>> {};
frontier.emplace_back(*start_node, 0,
    std::vector<std::string> {*start_node});

while (!frontier.empty()) {
  auto [current_node, depth, path] = std::move(frontier.front());
  frontier.pop_front();

  if (depth >= max_depth) {
    continue;
  }
  // ... expand neighbors
}

(graph_functions.cpp:1593-1608)

Result columns: Each discovered neighbor is returned with:

Column	Type	Description
`id`	String	Neighbor vertex ID
`label`	String	Neighbor vertex label
`properties`	JSONB	Neighbor properties
`depth`	INT	Distance from start node
`path`	ARRAY	Vertex IDs along the traversal path

Complexity: For a graph with maximum degree $d$ and traversal depth $k$ :

Time: $O(d^k)$ — exponential in depth, as each level expands up to $d$ neighbors
Space: $O(d^k)$ — the frontier and visited set grow proportionally

11.5.2 Configurable Traversal: graph_traverse

The graph_traverse function extends neighbor discovery with support for both BFS and DFS strategies, edge type filtering via arrays, and configurable depth limits:

-- BFS traversal following only KNOWS edges
SELECT * FROM graph_traverse('social', 'social:Person:1',
    ARRAY['KNOWS'], 'outgoing', 3, 'bfs');

-- DFS traversal following KNOWS and FOLLOWS edges
SELECT * FROM graph_traverse('social', 'social:Person:1',
    ARRAY['KNOWS', 'FOLLOWS'], 'outgoing', 5, 'dfs');

Parameters:

Parameter	Type	Default	Description
`graph`	TEXT	required	Graph name
`start_node`	TEXT	required	Starting vertex ID
`edge_types`	TEXT[]	NULL	Array of edge types (NULL = all)
`direction`	TEXT	'outgoing'	'outgoing', 'incoming', or 'both'
`max_depth`	INT	10	Maximum traversal depth
`strategy`	TEXT	'bfs'	'bfs' or 'dfs'

BFS vs DFS: The implementation uses a single std::deque for both strategies. The only difference is where elements are removed:

if (strategy == "bfs") {
  state = std::move(frontier.front());
  frontier.pop_front();
} else {
  state = std::move(frontier.back());
  frontier.pop_back();
}

(graph_functions.cpp:1765-1771)

BFS pops from the front (FIFO), guaranteeing that vertices are discovered in order of increasing depth. DFS pops from the back (LIFO), exploring each branch to its maximum depth before backtracking.

Edge type filtering: When the edge type array is non-empty, a hash set is constructed for $O(1)$ type membership checks during traversal:

auto type_filter = absl::flat_hash_set<std::string> {};
for (const auto& t : edge_types) {
  type_filter.insert(t);
}

(graph_functions.cpp:1738-1741)

During expansion, each edge's type is checked against this filter set. Edges with types not in the set are skipped.

Complexity: Same as neighbor discovery — $O(d^k)$ time and space — but the effective degree $d$ may be reduced by edge type filtering.

11.5.3 Shortest Path: graph_shortest_path

The graph_shortest_path function finds the shortest path between two vertices using bidirectional BFS:

SELECT * FROM graph_shortest_path('path_graph',
    'path_graph:Node:1', 'path_graph:Node:5',
    NULL, 'outgoing', 10);

Algorithm: Bidirectional BFS simultaneously expands from both the start and end vertices, meeting in the middle. This reduces the search space from $O(d^k)$ to $O(2 \cdot d^{k/2})$ for a graph with degree $d$ and shortest path length $k$ .

Loading diagram...

The algorithm maintains two separate search frontiers and visited maps:

auto forward = absl::flat_hash_map<std::string, PathNode> {};
auto backward = absl::flat_hash_map<std::string, PathNode> {};

forward[*start_node] = PathNode {*start_node, "", "", 0};
backward[*end_node] = PathNode {*end_node, "", "", 0};

auto forward_frontier = std::vector<std::string> {*start_node};
auto backward_frontier = std::vector<std::string> {*end_node};

(graph_functions.cpp:1930-1937)

Each PathNode stores the node ID, its parent in the search tree, the edge used to reach it, and its depth. This information enables path reconstruction once the two searches meet.

Alternating expansion: The main loop alternates between expanding the forward and backward frontiers. After each expansion, it checks whether any newly discovered vertex exists in the opposite frontier:

for (int64_t depth = 0; depth < max_depth / 2 + 1; ++depth) {
  // Expand forward frontier
  expand_frontier(forward_frontier, forward,
                  kFieldSource, kFieldTarget);

  // Check for meeting point
  for (const auto& node : forward_frontier) {
    if (backward.contains(node)) {
      return reconstruct_path(node);
    }
  }

  // Expand backward frontier
  expand_frontier(backward_frontier, backward,
                  kFieldTarget, kFieldSource);

  // Check for meeting point
  for (const auto& node : backward_frontier) {
    if (forward.contains(node)) {
      return reconstruct_path(node);
    }
  }
}

(graph_functions.cpp:2027-2061)

Path reconstruction: When the two frontiers meet at vertex $m$ , the path is reconstructed by tracing parents backward from $m$ to the start (forward path) and from $m$ to the end (backward path):

Trace the forward path: $m \to \text{parent}(m) \to \ldots \to s$ (start)
Reverse it to get: $s \to \ldots \to m$
Trace the backward path: $m \to \text{parent}(m) \to \ldots \to t$ (end)
Concatenate: $s \to \ldots \to m \to \ldots \to t$

Trivial case: When start equals end, the function immediately returns a path of length 0 containing just the single vertex.

Result columns:

Column	Type	Description
`nodes`	ARRAY	Ordered vertex IDs along the path
`edges`	ARRAY	Ordered edge IDs along the path
`length`	INT	Number of edges in the path
`total_weight`	FLOAT	Sum of edge weights (defaults to hop count)

Complexity: For a graph with average degree $d$ and shortest path length $k$ :

Time: $O(d^{k/2})$ — the square root of unidirectional BFS
Space: $O(d^{k/2})$ — for the two visited maps

The bidirectional approach provides substantial speedup for long paths in dense graphs. For a graph with $d = 100$ and $k = 6$ :

Unidirectional BFS: $O(100^6) = O(10^{12})$ vertex expansions
Bidirectional BFS: $O(2 \times 100^3) = O(2 \times 10^6)$ vertex expansions

This represents a millionfold reduction.

11.5.4 All Paths: graph_all_paths

The graph_all_paths function enumerates all simple paths between two vertices within depth constraints:

SELECT * FROM graph_all_paths('path_graph',
    'path_graph:Node:1', 'path_graph:Node:5',
    NULL, 1, 5, 100);

Parameters:

Parameter	Type	Default	Description
`graph`	TEXT	required	Graph name
`start`	TEXT	required	Start vertex ID
`end`	TEXT	required	End vertex ID
`edge_types`	TEXT[]	NULL	Edge type filter
`min_depth`	INT	1	Minimum path length
`max_depth`	INT	5	Maximum path length
`limit`	INT	100	Maximum number of paths to return

Algorithm: DFS with per-path cycle detection. Unlike the traversal functions which maintain a global visited set, graph_all_paths uses a per-path visited set to allow the same vertex to appear in different paths:

struct DFSState {
  std::string node;
  std::vector<std::string> path_nodes;
  std::vector<std::string> path_edges;
  absl::flat_hash_set<std::string> visited;
};

(graph_functions.cpp:2114-2119)

Each stack entry carries its own visited set. When a neighbor has already been visited on the current path, it is skipped — except when the neighbor is the target vertex, which is allowed to terminate the path:

if (state.visited.contains(neighbor) && neighbor != *end_node) {
  continue;
}

(graph_functions.cpp:2167-2168)

This cycle detection ensures that only simple paths (no repeated vertices except the terminal) are enumerated.

Complexity: In the worst case (complete graph), the number of simple paths between two vertices can be exponential: $O(n!)$ for $n$ vertices. The limit parameter bounds the output, but the search itself may explore exponentially many partial paths. The max_depth parameter provides a tighter bound by limiting path length.

11.5.5 Reachability: graph_reachable

The graph_reachable function checks whether a path exists between two vertices, optimized for early termination:

SELECT * FROM graph_reachable('social',
    'social:Person:1', 'social:Person:5');
-- Returns: true or false

The implementation delegates to graph_shortest_path and checks whether any path was found:

auto path_result = fn_graph_shortest_path(ctx, args);
if (!path_result) {
  return std::unexpected(path_result.error());
}

auto& paths = std::get<GraphPathSeries>(*path_result);
auto result = JsonValueSeries {};
result.push_back(make_bool_value(!paths.empty()));

(graph_functions.cpp:2199-2208)

This leverages the bidirectional BFS for efficiency while providing a simpler Boolean interface.

Complexity: Same as graph_shortest_path — $O(d^{k/2})$ where $k$ is the shortest path length. The bidirectional BFS terminates as soon as any path is found, providing implicit early termination.

11.5.6 Complexity Summary

Function	Algorithm	Time	Space
`graph_neighbors`	BFS	$O(d^k)$	$O(d^k)$
`graph_traverse`	BFS/DFS	$O(d^k)$	$O(d^k)$
`graph_shortest_path`	Bidirectional BFS	$O(d^{k/2})$	$O(d^{k/2})$
`graph_all_paths`	DFS + cycle detection	$O(n! / (n-k)!)$	$O(k \cdot n)$
`graph_reachable`	Bidirectional BFS	$O(d^{k/2})$	$O(d^{k/2})$

Where $d$ = maximum degree, $k$ = traversal depth or path length, $n$ = total vertices.

11.6 Adjacency Cache

Multi-hop graph traversals require repeated edge lookups for each vertex in the frontier. Without caching, a 3-hop traversal expanding 100 neighbors at each level generates $100 + 100^2 + 100^3 = 1{,}010{,}100$ index lookups against the edge collection. The adjacency cache dramatically reduces this cost by storing edge adjacency lists in memory.

11.6.1 Cache Architecture

The GraphAdjacencyCache is a process-wide singleton that stores both outgoing and incoming edge lists for cached nodes:

class GraphAdjacencyCache final {
  // ...
private:
  struct CacheEntry {
    std::vector<EdgeInfo> outgoing;
    std::vector<EdgeInfo> incoming;
    uint64_t access_time;
  };

  std::unordered_map<std::string, CacheEntry> cache_;
  mutable std::mutex mutex_;

  size_t max_entries_ = 1000000;  // 1M nodes
};

(graph_functions.hpp:277-289)

Each EdgeInfo stores the minimal information needed for traversal:

struct EdgeInfo {
  std::string edge_id;
  std::string neighbor_id;
  std::string edge_type;
};

(graph_functions.hpp:216-220)

The cache key is a composite string "graph:node_id", ensuring that identically-named nodes in different graphs are cached independently.

11.6.2 LRU Eviction

When the cache reaches its capacity (default 1M entries), the eviction policy removes the least recently used entry. Each access updates an access_time counter:

void GraphAdjacencyCache::evict_if_needed_() {
  if (cache_.size() < max_entries_) {
    return;
  }

  auto oldest_it = cache_.begin();
  auto oldest_time = oldest_it->second.access_time;

  for (auto it = cache_.begin(); it != cache_.end(); ++it) {
    if (it->second.access_time < oldest_time) {
      oldest_time = it->second.access_time;
      oldest_it = it;
    }
  }

  cache_.erase(oldest_it);
}

(graph_functions.cpp:2530-2548)

The eviction scan is $O(n)$ in the cache size. For production workloads that frequently trigger eviction, this could be improved with a doubly-linked list maintaining LRU order. However, for the current implementation, evictions are rare relative to cache hits because the 1M entry capacity accommodates most practical graph sizes.

11.6.3 Cache Warming

The graph_warm_cache function pre-populates the cache by iterating over all nodes in a graph and loading their adjacency lists:

SELECT * FROM graph_warm_cache('social', 100000);
-- Returns: cached = 100000, time_ms = 1500

The implementation scans the nodes collection, and for each node, queries its outgoing and incoming edges:

for (const auto& node_id : node_ids) {
  auto outgoing = std::vector<EdgeInfo> {};
  auto incoming = std::vector<EdgeInfo> {};

  auto out_query = make_query_doc_(kFieldSource, node_id);
  auto out_cursor = edges_coll->find(out_query);
  for (; out_cursor->is_valid(); out_cursor->next()) {
    // ... collect EdgeInfo
  }

  auto in_query = make_query_doc_(kFieldTarget, node_id);
  auto in_cursor = edges_coll->find(in_query);
  for (; in_cursor->is_valid(); in_cursor->next()) {
    // ... collect EdgeInfo
  }

  cache_node(graph, node_id, outgoing, incoming);
}

(graph_functions.cpp:2468-2503)

Cache warming is intended for use during low-traffic periods, such as application startup or after a graph bulk load. The max_nodes parameter limits the number of nodes cached to control memory usage and warming time.

11.6.4 Cache Invalidation

Cache consistency is maintained through eager invalidation on write operations. Every mutation that affects a node's adjacency triggers invalidation:

Operation	Invalidation Scope
`graph_create_edge`	Source node, Target node
`graph_update_edge`	Source node, Target node
`graph_delete_edge`	Source node, Target node
`graph_update_node`	The updated node
`graph_delete_node`	The deleted node
`graph_drop`	All nodes in the graph

The invalidation is per-node, not per-edge. When an edge is created between vertices $A$ and $B$ , both $A$ 's and $B$ 's cache entries are removed entirely (not surgically updated). This conservative approach avoids subtle consistency bugs at the cost of occasional additional cache misses:

auto& cache = GraphAdjacencyCache::instance();
cache.invalidate_node(*graph_name, *source_id);
cache.invalidate_node(*graph_name, *target_id);

(graph_functions.cpp:1152-1154)

For graph_drop, all entries with the graph's prefix are removed:

void GraphAdjacencyCache::invalidate_graph(const std::string& graph) {
  auto lock = std::lock_guard {mutex_};
  auto prefix = graph + ":";
  for (auto it = cache_.begin(); it != cache_.end();) {
    if (it->first.compare(0, prefix.size(), prefix) == 0) {
      it = cache_.erase(it);
    } else {
      ++it;
    }
  }
}

(graph_functions.cpp:2513-2523)

11.6.5 Thread Safety

All cache operations are protected by a mutex:

auto GraphAdjacencyCache::get_outgoing(
    const std::string& graph, const std::string& node_id)
    -> std::optional<std::vector<EdgeInfo>> {
  auto lock = std::lock_guard {mutex_};
  auto key = make_key_(graph, node_id);
  auto it = cache_.find(key);
  if (it == cache_.end()) {
    ++miss_count_;
    return std::nullopt;
  }
  ++hit_count_;
  it->second.access_time = ++access_counter_;
  return it->second.outgoing;
}

(graph_functions.cpp:2397-2410)

The single-mutex design is simple and correct. For workloads with very high traversal concurrency, a sharded lock design (partitioning the cache by hash of the key) could reduce contention. However, cache lookups are fast relative to the I/O they avoid, so the mutex overhead is typically negligible.

11.6.6 Cache Statistics and Monitoring

The graph_cache_stats function exposes cache performance metrics:

SELECT * FROM graph_cache_stats();

Column	Type	Description
`entries`	INT	Current number of cached nodes
`hits`	INT	Total cache hit count
`misses`	INT	Total cache miss count
`hit_rate`	FLOAT	Hit rate as a decimal (0.0 to 1.0)

These metrics enable operators to assess whether the cache is sized appropriately and whether cache warming is effective for their workload.

11.6.7 Performance Impact

The adjacency cache provides the greatest benefit for multi-hop traversals. Consider a 3-hop BFS traversal on a graph where each node has 50 outgoing edges:

Without cache: Each hop requires an index lookup against the edges collection.

Total lookups: $50 + 50^2 + 50^3 = 127{,}550$
Each lookup involves LSM-tree index traversal: multiple disk reads or memory-mapped page accesses

With warm cache: Each hop requires a hash table lookup.

Total lookups: $50 + 50^2 + 50^3 = 127{,}550$ (same count)
Each lookup: single hash table probe, $O(1)$ expected time, no disk I/O

The speedup is proportional to the ratio of index lookup time to hash table lookup time, which can be 100x or more depending on whether the index data is in the OS page cache.

11.7 Aggregation Operations

11.7.1 Degree Counting: graph_degree

The graph_degree function counts the number of edges connected to a vertex, optionally filtered by edge type and direction:

-- Count all edges (both directions)
SELECT * FROM graph_degree('social', 'social:Person:1', NULL, 'both');

-- Count only outgoing KNOWS edges
SELECT * FROM graph_degree('social', 'social:Person:1', 'KNOWS', 'outgoing');

The implementation iterates over matching edges and counts them:

auto count_edges = [&](const char* field) {
  auto query = db::document::Document {};
  add_string_member_(query, field, *node_id);

  if (edge_type) {
    add_string_member_(query, kFieldType, *edge_type);
  }

  auto cursor = edges_coll->find(query);
  for (; cursor->is_valid(); cursor->next()) {
    ++degree;
  }
};

(graph_functions.cpp:2248-2260)

The degree computation leverages the composite indexes (idx_src_type, idx_tgt_type) when an edge type filter is provided, making the operation efficient even for high-degree nodes.

Complexity: $O(\text{deg}(v))$ where $\text{deg}(v)$ is the degree of vertex $v$ in the specified direction. The index limits the scan to edges matching the filter.

11.7.2 Common Neighbor Computation: graph_common_neighbors

The graph_common_neighbors function finds vertices that are outgoing neighbors of both input vertices:

SELECT * FROM graph_common_neighbors('social',
    'social:Person:1', 'social:Person:2', 'KNOWS');

The algorithm is a straightforward set intersection:

Collect outgoing neighbors of vertex $A$ into set $N_A$ .
Collect outgoing neighbors of vertex $B$ into set $N_B$ .
Return $N_A \cap N_B$ — vertices present in both sets.

auto neighbors_a = get_neighbors(*node_a);
auto neighbors_b = get_neighbors(*node_b);

for (const auto& neighbor : neighbors_a) {
  if (neighbors_b.contains(neighbor)) {
    // Fetch neighbor node and add to result
  }
}

(graph_functions.cpp:2340-2345)

The implementation uses absl::flat_hash_set for $O(1)$ expected-time membership testing, making the intersection operation efficient.

Complexity: $O(\text{deg}(A) + \text{deg}(B) + |N_A \cap N_B| \cdot C_{\text{lookup}})$ where $C_{\text{lookup}}$ is the cost of fetching a node document.

Applications: Common neighbor counting is a fundamental building block for link prediction algorithms. The Jaccard coefficient between two vertices is computed as:

J(A, B) = \frac{|N_A \cap N_B|}{|N_A \cup N_B|}

The Adamic-Adar index weights each common neighbor by the inverse log of its degree:

AA(A, B) = \sum_{u \in N_A \cap N_B} \frac{1}{\log(\text{deg}(u))}

Both metrics can be computed efficiently using graph_common_neighbors combined with graph_degree.

11.8 Design Trade-offs

11.8.1 Flat vs Nested Property Storage

The choice to store properties flat alongside system fields (rather than nested under a properties key) is Cognica's most distinctive deviation from other graph databases.

Advantages of flat storage:

Direct indexability of property fields
Simpler SQL translation (n.name becomes t.name, not t.properties->>'name')
Unified document-graph model (graph vertices are ordinary documents)
No JSON extraction overhead for property access

Advantages of nested storage (used by Apache AGE internally):

Clean separation between system and user fields
No naming conflicts between user properties and system fields
Simpler introspection (all user data is under one key)

Cognica mitigates the disadvantage of flat storage by reserving the _ prefix for system fields and validating that user properties cannot overwrite system fields during updates:

if (key[0] == '_') {
  return std::unexpected(std::string {fn_name}
      + ": cannot update reserved graph system fields");
}

(graph_functions.cpp:328-330)

11.8.2 Collection-Based vs Native Graph Storage

Cognica stores graphs as document collections rather than implementing a purpose-built graph storage engine. This is a fundamentally different approach from systems like Neo4j, which uses a native graph storage format with fixed-size records and pointer-based adjacency traversal.

Collection-based storage (Cognica):

Aspect	Characteristic
Adjacency traversal	Index lookup per hop
Vertex access	Index lookup by `_id`
Storage overhead	Standard document overhead
Implementation cost	Minimal (reuses existing engine)
Transaction support	Full (inherits from document layer)
Cross-paradigm queries	Native (vertices are documents)

Native graph storage (Neo4j-style):

Aspect	Characteristic
Adjacency traversal	Pointer chase, O(1) per hop
Vertex access	Direct by internal ID
Storage overhead	Fixed-size records, compact
Implementation cost	Substantial (custom engine)
Transaction support	Custom implementation required
Cross-paradigm queries	Requires data duplication

Cognica's approach trades per-hop traversal speed for implementation simplicity, full transaction support, and seamless cross-paradigm query composition. The adjacency cache closes much of the traversal performance gap by providing in-memory adjacency lookups that approach the speed of native pointer-based traversal.

11.8.3 Cache vs Disk Trade-offs

The adjacency cache introduces a classic space-time trade-off:

Configuration	Memory	Traversal Speed	Write Overhead
No cache	Minimal	Slow (disk/index per hop)	None
Selective cache	Moderate	Fast for cached nodes	Per-write invalidation
Full cache	High (proportional to edge count)	Fastest	Per-write invalidation

The 1M entry default limit assumes approximately 100 bytes per entry (edge lists for both directions), yielding roughly 100 MB of cache memory. This accommodates graphs with up to 1 million active vertices. For larger graphs, the LRU eviction policy ensures that frequently accessed vertices remain cached while infrequently accessed vertices are evicted.

11.8.4 Comparison with Existing Systems

Feature	Cognica	Apache AGE	Neo4j
Storage backend	Document collections (LSM-tree)	PostgreSQL heap tables	Native fixed-size records
Property storage	Flat (top-level fields)	Nested (agtype)	Native property store
Index type	B-tree secondary indexes	PostgreSQL indexes	Native index-free adjacency
Query language	SQL + table functions + Cypher	SQL + Cypher	Cypher
Transaction model	Document-level MVCC	PostgreSQL MVCC	Custom MVCC
Cross-paradigm	Native (same storage layer)	Via PostgreSQL tables	Requires connector
Adjacency cache	LRU in-memory cache	None (relies on PostgreSQL buffer)	Index-free adjacency
Traversal method	BFS/DFS via collection queries	Custom scan operators	Native traversal engine

Apache AGE stores graph data in PostgreSQL heap tables with a custom agtype data type that wraps JSON-like values. Each vertex and edge is a row with an agtype properties column. Cognica's flat property storage avoids the agtype extraction overhead but requires system field naming discipline.

Neo4j uses a native storage format where each vertex record contains a pointer to its first relationship, and each relationship record contains pointers to the next relationships for both the source and target vertices. This forms a doubly-linked list that enables $O(1)$ adjacency traversal without index lookups. Cognica's adjacency cache provides similar $O(1)$ in-memory traversal when warmed, at the cost of cache memory.

11.9 Summary

This chapter presented Cognica's graph storage and operations layer, tracing the path from the mathematical property graph model defined in Chapter 3 to a concrete implementation built on the document storage engine described in Chapter 6.

The key architectural decisions are:

Dual-collection model: Each graph is stored as a pair of document collections (_nodes and _edges), inheriting all document storage capabilities including indexing, transactions, and SQL queryability.
Flat property storage: User properties are stored as top-level document fields rather than nested under a properties key. This enables direct indexing, simple SQL translation, and a unified document-graph data model.
Structured identifiers: The {graph}:{label}:{sequence} naming scheme provides namespace isolation, type encoding, and human readability while maintaining monotonic ordering for LSM-tree locality.
Traversal algorithms: BFS neighbor discovery, configurable BFS/DFS traversal, bidirectional BFS shortest path, DFS all-paths enumeration with per-path cycle detection, and BFS reachability checking cover the fundamental graph query patterns.
Adjacency cache: An LRU cache storing outgoing and incoming edge lists for up to 1 million nodes dramatically accelerates multi-hop traversals by replacing index lookups with hash table probes.
Eager invalidation: Write operations eagerly invalidate affected cache entries, maintaining consistency between the cache and the persistent storage layer.
Transaction-aware CRUD: All graph mutation functions participate in the database transaction context, enabling atomic multi-step operations and read-your-own-writes consistency within complex query pipelines.
Scalar JSON helpers: Dedicated scalar functions materialize graph elements as structured JSON values for Cypher path expressions, bridging the gap between the flat document storage model and the nested representation expected by graph query languages.

The graph layer is accessible through SQL table functions, enabling composition with standard SQL operations:

-- Cross-paradigm query: full-text search + graph traversal
SELECT n.id, n.properties->>'name' AS name, n.depth
FROM social_nodes docs
JOIN fts_search('social_nodes', 'database expert') s ON docs._id = s._id
CROSS JOIN LATERAL graph_neighbors('social', docs._id, 'KNOWS', 'both', 2) n
WHERE n.depth <= 2
ORDER BY s._score DESC;

Graph mutations can also be composed through LATERAL joins, enabling complex multi-step operations within a single SQL statement:

-- Create a node and an edge in a single query pipeline
WITH src AS (SELECT 'social:Person:1' AS source_id)
SELECT refreshed.since
FROM src
CROSS JOIN LATERAL graph_create_node(
    'social', 'Person', '{"name": "Bob"}') AS cn(id)
CROSS JOIN LATERAL graph_create_edge(
    'social', 'KNOWS', src.source_id, cn.id, '{}') AS ce(id)
CROSS JOIN LATERAL graph_update_edge(
    'social', ce.id, '{"since": 2024}') AS ue(success)
JOIN LATERAL (
    SELECT e.since FROM social_edges e WHERE e._id = ce.id
) refreshed ON true;

This pattern is the foundation for Cypher-to-SQL compilation: each Cypher clause (CREATE, MERGE, SET) is lowered to a LATERAL join step that can read results from previous steps and feed values to subsequent steps, all within a single transactional context.

This composability is the practical realization of the algebraic unification described in Chapter 3: graph posting lists, document posting lists, and full-text search posting lists all participate in the same query evaluation framework, enabling queries that span paradigms without data duplication or cross-system orchestration.