Supported Data Sources
Cognica supports a diverse range of external data sources, carefully selected to cover the most common enterprise data management scenarios. Each source type has specific characteristics, capabilities, and configuration requirements that you should understand to use them effectively.
Database Sources
Database sources allow you to query tables in external relational database management systems. This is particularly valuable for accessing operational data that cannot be easily replicated or for querying systems of record in real-time.
PostgreSQL is the most feature-rich database source, leveraging DuckDB's postgres extension. This extension uses PostgreSQL's native wire protocol to establish connections and execute queries. It supports the full range of PostgreSQL data types including arrays, JSON, and user-defined types. The extension handles connection management, query translation, and result streaming automatically.
PostgreSQL virtual tables are ideal for accessing transactional data, master data, and any information maintained in PostgreSQL-based systems. Common use cases include querying CRM databases, ERP systems, and custom applications built on PostgreSQL.
MySQL support is provided through DuckDB's mysql extension. Like the PostgreSQL extension, it uses the native MySQL protocol for efficient communication. MySQL virtual tables work well for accessing e-commerce platforms, content management systems, and legacy applications that use MySQL as their data store.
One important consideration with MySQL is character set handling. The extension assumes UTF-8 encoding, which matches MySQL's default in modern versions. If you are connecting to a database with different encoding settings, you may need to configure character set conversion on the MySQL server side.
SQLite is unique among database sources because it operates on local database files rather than network connections. This makes it useful for accessing embedded databases, application-specific data stores, and database backups. SQLite virtual tables are read-only by default, protecting the source file from accidental modifications.
SQLite support is built into DuckDB without requiring an additional extension, making it immediately available without any download or installation step.
Data Lake Formats
Data lake formats represent a significant evolution in how organizations store and manage large-scale analytical data. Unlike traditional file formats, data lake formats provide ACID transaction semantics, schema evolution, and time travel capabilities on top of object storage.
Delta Lake is an open-source storage layer developed by Databricks that brings reliability to data lakes. Delta Lake stores data as Parquet files with a transaction log that tracks all changes. This transaction log enables atomic writes, schema enforcement, and the ability to query historical versions of the data.
When you create a Delta Lake virtual table, Cognica uses DuckDB's delta extension to read the transaction log and identify which Parquet files constitute the current version of the table. Filter conditions are evaluated against Delta Lake's file-level statistics, enabling efficient partition pruning and data skipping.
Delta Lake virtual tables are particularly valuable for accessing data produced by Spark-based ETL pipelines, streaming ingestion systems, and any process that writes to Delta format. The ability to query Delta tables directly eliminates the need to maintain separate exports or copies of the data.
Apache Iceberg is another open table format, originally developed by Netflix and now an Apache project. Iceberg takes a different approach to metadata management, storing schema, partitioning, and snapshot information in a manifest structure that scales to extremely large tables.
Iceberg's hidden partitioning is particularly powerful - partition transforms are applied automatically based on the table's partitioning spec, so you don't need to include partition columns in your queries. For example, if a table is partitioned by month based on an event_timestamp column, Iceberg automatically applies partition pruning when you filter on event_timestamp, even though you never explicitly mention the partition.
The iceberg extension in DuckDB provides read access to Iceberg tables, supporting snapshot isolation and the ability to query specific snapshots for time travel analysis.
File Formats
For simpler use cases where full data lake capabilities are not required, Cognica supports direct access to common file formats through the Arrow backend.
Parquet is a columnar storage format that has become the de facto standard for analytical workloads. Parquet files include rich metadata with column statistics, enabling predicate pushdown and efficient data skipping. The Arrow backend's Parquet reader is highly optimized, supporting parallel column reads and vectorized decompression.
Parquet virtual tables excel at querying exported data, archived datasets, and any data stored in Parquet format without the overhead of a full data lake format.
CSV (Comma-Separated Values) remains ubiquitous despite its limitations. The Arrow backend provides robust CSV parsing with automatic type inference, configurable delimiters, quote handling, and header detection. CSV virtual tables are useful for querying data exports, log files, and data from systems that only support text-based formats.
ORC (Optimized Row Columnar) is another columnar format, originally developed for Apache Hive. ORC provides similar benefits to Parquet with some differences in compression and encoding strategies. ORC virtual tables provide access to data produced by Hive-based systems and older Hadoop ecosystems.
JSON and Newline-Delimited JSON (NDJSON) formats are supported for querying semi-structured data. Each line in an NDJSON file is parsed as a separate JSON object, making this format suitable for log files, API response archives, and event streams.
Arrow IPC files contain data serialized in Arrow's native format. This format has minimal parsing overhead since the data is already in Arrow's in-memory representation. Arrow IPC virtual tables are ideal for data produced by other Arrow-based systems or for high-performance data exchange.