Architecture
Understanding the architecture of External Virtual Tables helps you make informed decisions about when and how to use them. Cognica implements External Virtual Tables through two distinct backends, each optimized for different types of data sources.
The Two-Backend Design
The decision to use two separate backends - Apache Arrow for file-based sources and DuckDB for database and data lake sources - reflects a fundamental architectural principle: use the right tool for each job.
The Arrow Backend excels at reading columnar file formats like Parquet, ORC, and Arrow IPC. Apache Arrow provides highly optimized readers for these formats, with support for predicate pushdown directly into the file format's metadata and statistics. When you query a Parquet file, Arrow can skip entire row groups that don't match your filter conditions without reading the underlying data. The Arrow backend also handles partitioned datasets natively, understanding Hive-style partition layouts and applying partition pruning automatically.
The DuckDB Backend handles database connectivity and data lake formats. DuckDB's extension ecosystem provides battle-tested connectors for PostgreSQL, MySQL, and SQLite, as well as readers for Delta Lake and Apache Iceberg. By embedding DuckDB as a query processing engine, Cognica gains access to this entire ecosystem while maintaining a consistent SQL interface. DuckDB runs as an in-memory instance within the Cognica process, eliminating inter-process communication overhead.
Query Execution Flow
When you execute a query against an External Virtual Table, the execution follows a carefully orchestrated flow designed to maximize efficiency:
First, the SQL parser analyzes your query and identifies which tables are referenced. For each table, the query planner consults the virtual table registry to determine whether it is a native collection, an Arrow-backed external table, or a DuckDB-backed external table.
For DuckDB-backed sources, the planner extracts filter conditions that can be pushed down and generates a DuckDB SQL query. This query is executed against DuckDB's in-memory instance, which in turn connects to the external source, executes the translated query, and streams results back.
For Arrow-backed sources, the planner creates an Arrow dataset scanner with the appropriate filters and projections, which reads directly from the file system or cloud storage.
Results from external sources are materialized as Cognica cursors, which can then participate in joins with other external sources or native collections. The query executor handles the orchestration, applying any remaining operations that could not be pushed down to the sources.
The Cursor Provider Chain
Cognica's query execution system uses a chain-of-responsibility pattern for cursor creation. When the executor needs to read from a table, it asks the cursor provider chain, which consists of multiple providers, each handling a specific type of table:
SystemCatalogCursorProvider
-> DuckDBCursorProvider
-> ExternalTableCursorProvider
-> TransactionCursorProvider
The SystemCatalogCursorProvider handles queries against system catalogs like _sys.virtual_tables and _sys.sequences. These are internal metadata tables that describe the database structure itself.
The DuckDBCursorProvider intercepts requests for tables that are registered as DuckDB-backed virtual tables. It checks the virtual table registry, and if the requested table is a PostgreSQL, MySQL, Delta Lake, or Iceberg source, it generates the appropriate DuckDB query and returns a cursor over the results.
The ExternalTableCursorProvider handles Arrow-backed file sources. It creates Arrow dataset scanners for Parquet, CSV, ORC, and other file formats.
The TransactionCursorProvider is the final link in the chain, handling native Cognica collections. If no upstream provider claims a table, it must be a native collection, and this provider creates a cursor using the transaction's snapshot of the collection.
This chain architecture provides clean separation of concerns and makes it easy to add new types of external sources in the future.