The Missing Layer: A Structural Gap in the AI Infrastructure Stack | Blog

The Layer No One Owns

If you sketch the architecture of a modern AI system, two layers stand out clearly. At the top: models — LLMs, embedding models, multimodal architectures, and the agent frameworks that orchestrate them. At the bottom: hardware — GPUs, TPUs, NPUs, and the new generation of on-device accelerators shipping in Apple Silicon and Qualcomm Hexagon chips. Both layers attract hundreds of billions of dollars of investment, enormous volumes of research, and crowded fields of competing startups.

Between them, there is a third layer that should be clearly defined but isn't. The layer a model must pass through to reach real user data. The layer that stores, indexes, and queries data sitting on top of hardware. Traditionally we have called this the database layer.

This layer is broken. More precisely, it exists in an ambiguous state — neither absent nor properly filled. In the datacenter, the situation is one of fragmentation: relational databases, full-text search engines, vector databases, and graph databases each followed separate historical trajectories, and modern AI applications stitch them together with application-layer glue code. On-device, the layer is nearly absent: each app keeps its data in a private silo, the OS provides only a permissions model, and the model sees only whatever fits in a context window.

This essay is an observation about why this layer has ended up in its current shape, why this is a structural problem rather than a transitional one that will resolve as the market matures, and what the conditions for a proper solution are. At the end, I present UQA and the Cognica engine as one concrete response to these conditions — as one data point, not the unique answer.

1. A History of Fragmentation

What we now call "data infrastructure" is a collection of engines built in different eras to solve different problems.

Relational databases emerged in the 1970s on top of Codd's relational algebra, designed for enterprise business data — finance, HR, inventory. ACID transactions, joins, aggregation, normalized schemas. This paradigm has been the default meaning of "database" for fifty years.

Full-text search crystallized in the 1990s as the rise of the web brought decades of information retrieval research into practical use. TF-IDF, BM25, inverted indexes, posting lists. Lucene became the standard in the 2000s, and Elasticsearch layered operational usability on top to become the de facto standard for enterprise search. That it ran in a different engine from the relational database was taken as self-evident.

Vector search is much more recent. Word2Vec, Transformers, and large language models made it practical to map text into continuous vector spaces, turning "find semantically similar things" into a real query primitive. ANN algorithms like HNSW and IVF made fast nearest-neighbor search feasible in these spaces, and dedicated vector databases — Pinecone, Weaviate, Qdrant, Milvus — appeared. This is a phenomenon of the early 2020s.

Graph databases followed yet another trajectory. Neo4j, Apache AGE, TigerGraph — systems built for relationship-heavy data (social graphs, knowledge graphs, dependency graphs) with their own query languages (Cypher, Gremlin, SPARQL) and storage structures.

Spatial databases are another parallel track. PostGIS, Oracle Spatial, SpatiaLite — systems that grew out of geographic information system (GIS) requirements, providing R-Tree indexes, spatial predicates (intersects, within, contains), and distance functions as specialized operations. Location-based services have made spatial queries an everyday feature of mobile applications, but spatial engines still live in isolation from the other paradigms.

These five paradigms evolved over decades, each with its own academic lineage, implementation tradition, and vendor ecosystem. There was no reason for them to mix in a single system. Each was optimized within the scope of what it did well, and the problems of other paradigms were solved by calling a different engine.

Then AI broke this stable division of labor.

2. The New Demands of AI Workloads

If you look carefully at the queries that arise the moment an LLM-based application actually touches real data, nearly every one of them demands multiple paradigms simultaneously. Consider an example.

An agent receives the user's question and needs to find: "the conversations I had with Alex last week, among those, the ones where we discussed the project back in March." This single query requires all of the following:

Relational filter: sender = Alex, time range = last 7 days
Full-text search: keyword matching on "project"
Vector search: semantic similarity to descriptions of the project referenced in March
Graph traversal: following the "conversation with Alex" relationship
Temporal reasoning: interpreting what "back in March" refers to

In today's stack, the application code handles this roughly as follows. First, pull Alex's messages from the past week out of the relational database. Then feed each message into a vector database to compute similarity against the "March project description." Also send a keyword match for "project" to the full-text engine. Combine the three result sets in application memory. The combination is usually implemented as a weighted sum of scores returned by each engine.

The problem with this approach is not surface inconvenience. It is a structural defect.

First, latency accumulates linearly. Each engine call requires a network round-trip, and if result sets are large, data transfer itself becomes the bottleneck. When an agent workflow chains dozens of such queries, user-perceived latency stretches from seconds into tens of seconds.

Second, score fusion is not principled. A BM25 score, a cosine similarity, and a PageRank value live on different scales and different distributions. Combining them as 0.3 * BM25 + 0.5 * cosine + 0.2 * pagerank is an expedient with no mathematical justification whatsoever. The weights are mostly chosen by intuition and A/B testing, and must be retuned whenever the data distribution shifts. Principled inference about the joint distribution of multiple signals is impossible through weighted summation.

Third, storage is duplicated. The same document lives once in the relational DB, once in the full-text engine, once in the vector DB. Synchronizing the three engines becomes a project of its own, and a recurring source of consistency bugs. Storage cost is literally tripled.

Fourth, no engine's optimizer can perform global optimization. The relational DB optimizes its own query. The vector DB optimizes its own. The "combination" that happens at the application layer is outside the view of any of them. Truly important optimizations — for example, "restrict vector search to 1,000 candidates, then score only those that pass the relational filter" — have to be expressed manually, and most teams fail to get them right.

The common root of these four problems: the data model is being reassembled at the application layer. But the application layer does not know the real shape of the data. That is the engine's job. When combination happens in a layer ignorant of the data's actual structure, the quality of the combination is left to luck.

3. Why Current "Solutions" Aren't Actually Solutions

The problem has not gone unrecognized. Many attempts exist, each with partial progress. But none of them solves it structurally.

RAG frameworks (LangChain, LlamaIndex, and similar). These are orchestration layers, not storage layers. They sit atop N specialized engines and inherit every impedance mismatch between them. Calling Python-level glue code a "framework" is what these projects amount to in practice. And they are nearly unusable on-device: Python runtimes don't fit on mobile, and there isn't memory to run multiple engines at once.

Vector databases with text and filters bolted on (Pinecone, Weaviate, Qdrant, Milvus). These started as vector databases and grafted hybrid search on as a "feature extension." Without an algebraic foundation, score fusion remains weighted summation. Relational capability rarely rises above filters, and graph is mostly absent. None were designed for embedded deployment.

Relational databases with vector bolted on (pgvector, MySQL HeatWave, SQL Server). The reverse approach. Genuine relational capability is present, but vector functionality is bolt-on. Full-text search is usually at a legacy tsquery level; modern BM25 with probabilistic calibration is essentially absent. There is no algebraic integration.

Lucene/Elasticsearch with vector. Best-in-class full-text search, decent vector support, no relational capability, no graph, a heavy JVM footprint. On-device is out of the question.

Specialized "AI databases" (LanceDB, Chroma, Marqo). Vector-first, everything else secondary. Often no durability guarantees for transactional workloads. Theoretical foundations are thin.

The common thread: every one of these is a local optimum of an existing paradigm. None of them started from the question "What is the right unified framework for data access in the AI era?" They either grafted features onto existing engines, or, when building new engines, kept one paradigm at the center and treated the others as peripherals.

This is also why it isn't a problem that will resolve as the market matures. A mature vector database will still treat relational capability as secondary. A refined pgvector will still inherit the limits of ad-hoc score fusion. Structural problems do not heal with time.

4. On-Device: The Layer Is Simply Absent

Everything so far has been about the datacenter. On-device, the layer doesn't just fragment — it barely exists.

Look carefully at the AI features running on iOS and Android devices today — Apple Intelligence, Gemini Nano, the various app-embedded models — and you will find that they all hit the same fundamental wall. The model is on the device, but there is no data layer the model can reach into.

Devices accumulate years of personal data: messages, emails, documents, calendars, photo metadata, browser history, app activity logs, location histories. This is an enormous reservoir of context, and with proper access it could be the foundation for a genuinely personalized AI experience. But under the current architecture, there is no way to access this data in a unified fashion.

Each app keeps its data in a private silo — this is partly by design, for privacy. The OS provides only a permissions model; cross-app data access happens only through explicit intents. APIs like Spotlight, Core Data, Core Spotlight, and App Intents attempt partial integration in their own ways, but these are "search" or "intent handling" at best, not queries. Vector search across multiple apps' data? Graph traversal? Principled multimodal fusion? None of it exists.

Worse, the resource constraints of mobile devices make it impossible to simply port server-side solutions. You cannot run an Elasticsearch instance on a phone. There isn't enough battery to run Postgres + pgvector + Neo4j simultaneously. The design pattern where each engine maintains its own memory-resident indexes collapses in an environment where background apps are forcibly reclaimed by the OS.

The result is that today's on-device LLMs function as stateless chatbots. The only way to give them user context is to stuff it into the context window; the context window has a fixed size; it has to be refilled from scratch on every query. Years of personal data sit right next to the model, but the model cannot query them. This is a failure of architecture, not a failure of the model.

5. What a Proper Solution Must Look Like

With the problem stated clearly, let me enumerate the structural conditions a proper solution must satisfy. This is not a prelude to recommending a particular system. It is an attempt to establish evaluation criteria that apply to anything serious in this space.

Condition 1: Algebraic unification, not bolt-on. Relational, full-text, vector, graph, and spatial queries must all be expressible on the same mathematical structure. Each paradigm must be an operator under a single optimizer, not a separate engine. Global optimization is only possible on this basis.

Condition 2: Principled score fusion. Scores from different signals must be expressible as calibrated probabilities. Instead of arbitrary weighted sums, mathematically grounded combination rules — combination in log-odds space, say, or combination of Bayesian posteriors — must be supported. Without this, hybrid ranking remains a perpetual tuning exercise.

Condition 3: The same engine from server to embedded. If the datacenter and the on-device environment run engines of different architectures, developers have to learn two worlds and query semantics diverge between them. SQLite became the de facto standard for "embedded relational database" precisely because the same engine ran everywhere. The data layer of the AI era needs the same property.

Condition 4: Theoretical grounding. Integration of this scope cannot be achieved by accumulating "nice-to-have features." A mathematical definition of what is being unified must come first. Can posting lists serve as a universal abstraction? Can relational algebra, Boolean algebra, and vector spaces be expressed in a single structure? These questions must be answered first for the engine's implementation to hold a coherent shape.

Condition 5: Production-grade implementation. Theory alone convinces no one. Performance under real workloads, transactional guarantees, compatibility with existing tools (PostgreSQL wire protocol and similar), and operational experience are all necessary.

Condition 6: Verifiable openness. The data layer is one of the deepest dependencies an application has. The code at this layer must be open, the theory must be published, and external validation must exist. A closed commercial system is inadequate for solving problems of this depth — if the community cannot read, question, and contribute to the code, integration at this scale cannot be trusted.

Does any system currently in the market satisfy all six conditions? As far as I know, none does. Systems that satisfy some of them exist. None satisfies all six.

6. One Concrete Response: UQA and the Cognica Engine

One concrete response to this gap is Unified Query Algebra (UQA) and its production implementation, the Cognica engine. The two names are deliberately kept separate. UQA is a mathematical specification whose semantic equivalence is maintained across three implementations — in Python, TypeScript, and C++. The Cognica engine is the C++23 production implementation of that specification. I present this not as the unique answer, but as a data point showing how the six conditions above can be addressed precisely.

Response to Condition 1 — Algebraic Unification. UQA takes posting lists (ordered sequences of (id, payload) pairs) as its universal abstraction. Relational filters, BM25 scoring, vector search, graph traversal, and spatial queries all compile into operators over posting lists.

The difference this unification makes is clearest when the same query is written two ways side by side. Consider the following: "Papers from 2020 onward whose titles relate to 'attention,' are semantically close to a given embedding vector, and are reachable within 2 hops via citation from a specific paper, ranked by score." Implemented on today's fragmented stack, it looks roughly like this:

# 1. Candidate retrieval from vector DB (year filter applied upfront)
vec = pinecone.query(
    vector=query_emb,
    top_k=200,  # over-fetch to leave room for downstream filters
    filter={"year": {"$gte": 2020}},
)
cand_ids = [r.id for r in vec.matches]
vec_score = {r.id: r.score for r in vec.matches}

# 2. BM25 scoring on the same candidates via Elasticsearch
es = elasticsearch.search(
    index="papers",
    body={
        "query": {
            "bool": {
                "must": [{"match": {"title": "attention"}}],
                "filter": [{"terms": {"_id": cand_ids}}],
            }
        },
        "size": 200,
    },
)
text_score = {h["_id"]: h["_score"] for h in es["hits"]["hits"]}

# 3. Reachability within 2 hops from paper 1, via Neo4j
g = neo4j.run(
    """
    MATCH (src:Paper {id: $s})-[:CITED_BY*1..2]->(p:Paper)
    WHERE p.id IN $ids
    RETURN DISTINCT p.id AS id
    """,
    s=1, ids=cand_ids,
)
graph_hit = {r["id"] for r in g}

# 4. Display metadata from PostgreSQL
rows = pg.query(
    "SELECT id, title FROM papers WHERE id = ANY(%s)",
    (cand_ids,),
)
title = {r.id: r.title for r in rows}

# 5. Score fusion in application memory
# - Vector: cosine similarity ~ [0, 1]
# - Text: BM25, unbounded; normalize by observed maximum
# - Graph: binary indicator
# Weights hand-tuned through offline A/B testing
MAX_BM25 = 25.0
out = []
for id in cand_ids:
    if id not in title:
        continue
    s = (
        0.40 * vec_score.get(id, 0.0)
        + 0.40 * (text_score.get(id, 0.0) / MAX_BM25)
        + 0.20 * (1.0 if id in graph_hit else 0.0)
    )
    out.append((title[id], s))

out.sort(key=lambda x: -x[1])

Five systems (vector DB, full-text engine, graph DB, relational DB, application runtime), four network round-trips, and three weights chosen with no mathematical basis. A number of hidden arbitrary decisions are baked into this code. Why should cand_ids come from the vector database? Either of the other two systems could have played the role of "primary candidate generator," and the choice completely changes which candidates the remaining systems will miss. How was top_k=200 in the vector search decided? Too small and genuinely relevant papers never reach downstream filters; too large and latency accumulates. MAX_BM25 = 25.0 is a constant tuned by observation, and must be retuned whenever the data distribution shifts. And because this entire body of code lives outside any engine's optimizer, no one can optimize this query as a whole. The responsibility falls on the author alone.

The same query, in UQA, is expressed as follows:

SELECT title, _score FROM papers
WHERE fuse_log_odds(
    text_match(title, 'attention'),
    bayesian_knn_match(embedding, $1, 10),
    traverse_match(1, 'cited_by', 2)
) AND year >= 2020
ORDER BY _score DESC;

Text matching, vector similarity, graph reachability, and relational filtering are all reordered by cost under a single optimizer, and processed in a single execution engine as a single transaction. The combination rule of fuse_log_odds is a mathematically justified joint distribution built on Bayesian BM25's probabilistic calibration, not an arbitrary weighted sum. Arbitrary constants like top_k and MAX_BM25 do not appear. Neither does the decision about candidate-generator ordering — that is the optimizer's job. The four problems enumerated earlier — latency accumulation, non-principled fusion, duplicated storage, and the impossibility of global optimization — are all addressed in this single contrast. The difference is structural, not rhetorical.

Response to Condition 2 — Principled Score Fusion. The Cognica team has worked on this problem for years. Bayesian BM25 transforms raw BM25 scores into calibrated probabilities in [0, 1] through a sigmoid transformation. Vector score calibration uses distributional statistics from the ANN index to derive likelihood ratios that convert vector similarities into relevance probabilities. And combination rules in log-odds space provide mathematically justified joint distributions. Part of this work — Bayesian BM25 — has been merged into the Apache Lucene core, adopted as the official baseline in the MTEB benchmark, and incorporated into information retrieval frameworks including Vespa.ai and txtai. These are not academic claims; they are results validated in industry.

Response to Condition 3 — The same engine from server to embedded. The Cognica C++ engine is modular enough that server-side components like the JIT compiler and bytecode VM can be stripped out to produce an embedded build. Same UQA semantics, different form factor. In parallel, there is a Python reference implementation (cognica-io/uqa) and a TypeScript browser implementation (cognica-io/uqa-js). The latter uses SQLite as its key-value storage and on-disk IVF instead of HNSW, optimized for browser and mobile environments. That the same UQA specification maintains semantic equivalence across three implementations in C++, Python, and TypeScript demonstrates empirically that UQA is an implementation-independent specification.

Response to Condition 4 — Theoretical Grounding. UQA rests on five theoretical papers.

A unified mathematical framework for the relational, text, vector, and graph paradigms. Posting lists serve as the universal abstraction, and each paradigm is expressed on top of a Boolean algebra.
Extension of the graph algebra through a posting-list-graph isomorphism. Graph traversal and pattern matching map to posting list operations.
The probabilistic framework of Bayesian BM25. It derives the mathematical necessity of transforming BM25 scores into calibrated probabilities.
From Bayesian inference to neural computation. The result that the structure of neural networks emerges analytically from multi-signal Bayesian inference. Activation functions are derived as answers to probabilistic questions.
Vector scores as likelihood ratios. An index-aware method that uses distributional statistics from ANN indexes to convert vector similarities into relevance probabilities.

The foundation of this structure is built entirely on set theory. From the single definition that a posting list is in bijection with the power set of documents, all the properties of Boolean algebra follow, and each paradigm finds its place naturally on top of that algebra. This minimality is itself the strongest evidence of the structure's foundational character. That no elaborate foundation is required means the unification is closer to having been seen than to having been constructed.

Section 7 of the first paper introduces a supplementary perspective using lattice theory. It reinterprets the query space as a complete lattice over a partial order — and the origin of this perspective was not theoretical generalization but a practical concern. Once you pursue query optimization seriously, you start asking what structure exists among queries, and the natural answer to that question is a complete lattice. What matters is that this was not a new discovery but a recognition that posting list Boolean algebra already carried the lattice structure within it. A practical problem revealed a facet of a structure that had always been there.

This kind of thing — a new problem domain turning out to have been a face of an existing structure all along — happens repeatedly throughout the UQA framework. The extensions to graph, probabilistic calibration, neural computation, and vector calibration all follow the same pattern. Each extension was not the construction of a new structure but a demonstration of how an existing structure represents the new domain. The coherence running through the five papers is less designed consistency than the repetition of this pattern.

The most recent example is the integration of spatial queries. R-Tree spatial indexes, spatial predicates (within, intersects, contains), and distance functions have been incorporated as additional operators over the posting list algebra. This, too, was not the addition of a new axiom to the framework but a demonstration of how the existing structure handles spatial information. A spatial predicate is a filter selecting a subset of a posting list. A distance function is a score. A spatial index is simply an access path for evaluating the filter efficiently. This is the most direct evidence that the pattern is ongoing — and a signal that this framework has room to absorb other kinds of data yet to come — sensor streams, time series, tensor fields — in the same way.

The importance of this is not merely aesthetic. That the integration rests on a minimal foundation, and that extensions into different domains repeatedly follow the same pattern, means that new paradigms yet to emerge — ones we don't even have names for today — are likely to find their place in this same framework naturally. For the same reason set theory can serve as the foundation of modern mathematics as a whole, Boolean algebra over sets has room to encompass nearly every kind of computationally meaningful data structure.

Response to Condition 5 — Production-grade implementation. The Cognica engine is over 800,000 lines of C++23, supporting SQL that is nearly fully compatible with PostgreSQL 17 (95 of 103 target features), ACID transactions, the DPccp join-ordering algorithm, Copy-and-Patch JIT compilation, the Raft consensus protocol, MVCC-based isolation levels, Cypher graph queries (Apache AGE compatible), R-Tree spatial indexes with spatial predicates, Apache Arrow columnar execution, and federation with external data via DuckDB and Arrow Flight SQL. This is not a research prototype. It is an engine handling real workloads. Both server mode (PostgreSQL wire protocol) and embedded mode (SQLite/DuckDB-style library embedding) are supported.

Response to Condition 6 — Verifiable Openness. The UQA reference implementation is released under AGPL-3.0. All five theoretical papers are public. And a 34-chapter textbook, Cognica Database Internals, documenting the engine's internal architecture, is freely available — covering LSM-tree storage, the CVM bytecode interpreter, Copy-and-Patch JIT, Zero-Copy JOINs, WAND/BMW evaluation strategies, Raft consensus, Cypher-over-SQL rewriting, vector score calibration, every layer. This is a degree of openness almost unheard of among commercial database companies. It means a reader who wants to verify the claims can cross-check theory, implementation, and exposition against each other.

7. Open Questions and Limitations

Having enumerated responses to the six conditions, I do not want to claim the Cognica engine has solved the problem. Several honest limitations must be stated.

Scale validation. The Cognica engine is technically designed to handle 100TB on a single node, but validation under petabyte-scale distributed workloads across dozens of nodes is still ongoing. Proving the system at the scales demanded by the largest datacenter customers is work for the future.

Maturity of on-device deployment. The embedded mode is architecturally possible, but whether actual deployments on iOS and Android have reached production-grade levels of battery usage, memory footprint, and cold-start latency is a separate question. This area is under active development.

Ecosystem. SQLite's ubiquity is not only a matter of technology, but also of decades of tooling — drivers, ORMs, admin tools, textbooks. UQA can leverage much of the existing ecosystem through PostgreSQL wire protocol compatibility, but tools specialized for an "AI-native data layer" — hybrid query debuggers, signal calibration analyzers, embedded index visualizers — do not yet exist.

Verifiable external adoption. Bayesian BM25 merging into the Lucene core and being adopted by Vespa are strong signals, but the Cognica engine as a whole has not yet been validated in external production systems. This is a stage that requires collaboration with early adopters.

These limitations do not mean Cognica's structural approach is wrong. On the contrary, they point to the fact that these are the kinds of problems that can be solved, and that if the foundation is correctly laid, overcoming them is possible. A system built on a misunderstood structure cannot be saved by any amount of engineering. A system built on a correctly understood structure matures over time.

Conclusion: A Beginning, Not an Answer

The purpose of this essay has not been to argue "use Cognica." It has been two things.

First, to make clear that there is an invisible structural gap in the AI infrastructure stack. The data layer that should exist between model and hardware is fragmented, and on-device it barely exists at all. This is not "a minor inconvenience we currently tolerate" — it is a structural problem, and the current generation of RAG frameworks and vector database extensions will not resolve it.

Second, to make explicit the conditions a proper solution must satisfy. Algebraic unification. Principled score fusion. Server-to-embedded unification. Theoretical grounding. Production grade. Openness. These six conditions are evaluation criteria. Whatever system one is considering, one should ask how well it satisfies each of them.

The Cognica engine is one response to these conditions. Probably not the only one. If the structural nature of the problem becomes broadly recognized, other teams will arrive at their own answers, and through that process the right shape of this space will emerge. I hope the conversation moves in that direction.

What you can do right away, if anything, is evaluate how well the AI system you are currently building satisfies the six conditions above. In most cases the answer will be "not by much," and the next step is to distinguish whether the gap is incidental or structural. A structural gap cannot be closed with glue code.

References

UQA open-source implementation: cognica-io/uqa (Python, AGPL-3.0), cognica-io/uqa-js (TypeScript)
UQA theoretical papers and documentation: cognica-io.github.io/uqa
Cognica Database Internals textbook: cognica.io/docs/cognica-internals
Interactive browser demo: cognica.io/demo/uqa
Cognica Insights (a code analysis tool built on UQA): insights.demo.cognica.io

Tags:

#AI Infrastructure #Database Architecture #Unified Query Algebra #Hybrid Search #Vector Search #On-device AI #Cognica