Chapter 8: SQL Parser and Semantic Analysis
This chapter examines how Cognica transforms SQL text into executable query plans. We explore the parsing infrastructure built on PostgreSQL's battle-tested parser, the Abstract Syntax Tree representation, and the semantic analysis phase that validates and enriches queries with type information.
8.1 The SQL Processing Pipeline
SQL is a declarative language - users specify what data they want, not how to retrieve it. The database must bridge this gap through multiple transformation stages.
8.1.1 Pipeline Overview
Each stage transforms the query into a more concrete representation:
| Stage | Input | Output | Purpose |
|---|---|---|---|
| Lexical Analysis | SQL text | Tokens | Break into words |
| Parsing | Tokens | Parse Tree | Recognize grammar |
| AST Construction | Parse Tree | AST | Structured representation |
| Semantic Analysis | AST | Annotated AST | Type checking, name resolution |
| Logical Planning | Annotated AST | Logical Plan | Relational operators |
| Optimization | Logical Plan | Optimized Plan | Cost-based transformations |
| Physical Planning | Optimized Plan | Physical Plan | Execution strategies |
| Code Generation | Physical Plan | Bytecode | Executable instructions |
8.1.2 Why Multiple Stages?
Separation of concerns enables:
Modularity: Each stage can be tested and optimized independently.
Reusability: The same AST serves multiple backends (CVM, Volcano iterator).
Extensibility: New optimizations can be added without modifying parsing.
Debugging: Problems can be isolated to specific stages.
8.2 Lexical Analysis and Parsing
Cognica leverages libpg_query, a standalone extraction of PostgreSQL's parser, providing full PostgreSQL 17 compatibility.
8.2.1 Why PostgreSQL's Parser?
PostgreSQL's SQL parser is:
- Battle-tested: 25+ years of production use
- Standards-compliant: ANSI SQL with PostgreSQL extensions
- Feature-rich: Supports CTEs, window functions, JSON operators, arrays
- Well-maintained: Regular updates with PostgreSQL releases
Building a custom parser would require years of effort to match this maturity.
8.2.2 libpg_query Integration
The ParseResult class wraps libpg_query with RAII semantics:
class ParseResult {
public:
// Parse SQL string, returns result with parse tree or error
static auto parse(std::string_view sql) -> ParseResult;
// Access parse tree as JSON
auto get_parse_tree() const -> std::string_view;
// Error information
auto has_error() const -> bool;
auto get_error_message() const -> std::string_view;
auto get_error_line() const -> int32_t;
auto get_error_column() const -> int32_t;
private:
PgQueryParseResult result_; // libpg_query C struct
};
Memory Safety: The destructor automatically frees libpg_query's internal allocations.
8.2.3 Parse Tree Format
libpg_query produces a JSON representation of PostgreSQL's internal parse tree:
SELECT id, name FROM users WHERE status = 'active'
Produces (simplified):
{
"stmts": [{
"stmt": {
"SelectStmt": {
"targetList": [
{"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": {"sval": "id"}}]}}}},
{"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": {"sval": "name"}}]}}}}
],
"fromClause": [
{"RangeVar": {"relname": "users"}}
],
"whereClause": {
"A_Expr": {
"kind": "AEXPR_OP",
"name": [{"String": {"sval": "="}}],
"lexpr": {"ColumnRef": {"fields": [{"String": {"sval": "status"}}]}},
"rexpr": {"A_Const": {"sval": {"sval": "active"}}}
}
}
}
}
}]
}
8.2.4 Additional libpg_query Features
Query Fingerprinting:
class FingerprintResult {
static auto fingerprint(std::string_view sql) -> FingerprintResult;
auto get_fingerprint() const -> uint64_t; // Hash for cache key
};
Fingerprints normalize queries for caching:
SELECT * FROM users WHERE id = 1
SELECT * FROM users WHERE id = 2
Both produce the same fingerprint, enabling prepared statement caching.
Tokenization:
class ScanResult {
static auto scan(std::string_view sql) -> ScanResult;
auto get_tokens() const -> std::vector<Token>;
};
Useful for syntax highlighting and query analysis tools.
Deparsing:
class DeparseResult {
static auto deparse(const ProtobufParseResult& tree) -> DeparseResult;
auto get_sql() const -> std::string_view;
};
Converts parse trees back to normalized SQL, useful for query logging.
8.3 Abstract Syntax Tree
The AST provides a type-safe, navigable representation of SQL queries.
8.3.1 Node Hierarchy
Node: Base class with type discriminator.
Expr: Expressions that produce values.
Stmt: Complete SQL statements.
8.3.2 Node Type Enumeration
Cognica supports 56+ node types:
enum class NodeType : uint8_t {
// Statements
kSelectStmt, kInsertStmt, kUpdateStmt, kDeleteStmt,
kCreateStmt, kDropStmt, kAlterStmt, kIndexStmt,
kTruncateStmt, kRenameStmt, kTransactionStmt,
kExplainStmt, kAnalyzeStmt, kCopyStmt,
kViewStmt, kCreateMatViewStmt, kRefreshMatViewStmt,
kCreateFunctionStmt, kDropFunctionStmt,
kCreateTriggerStmt, kDropTriggerStmt,
kGrantStmt, kRevokeStmt, kCreateRoleStmt,
kCreatePolicyStmt, kAlterPolicyStmt,
// Expressions
kColumnRef, kConstant, kParamRef,
kBinaryExpr, kUnaryExpr, kFunctionCall,
kCaseExpr, kCastExpr, kSubqueryExpr,
kArrayExpr, kArraySubscript, kArraySlice,
kCoalesceExpr, kNullIfExpr, kTypedLiteral,
// Other
kJoinExpr, kRangeVar, kTargetEntry, kOrderByItem,
kWindowDef, kGroupingSet, kCommonTableExpr,
// ...
};
8.3.3 Expression Nodes
ColumnRef - Table column references:
class ColumnRef : public Expr {
std::optional<std::string> schema_name; // Optional schema
std::optional<std::string> table_name; // Optional table/alias
std::string column_name; // Column name
};
Examples:
id->{column_name: "id"}users.id->{table_name: "users", column_name: "id"}public.users.id->{schema_name: "public", table_name: "users", column_name: "id"}
Constant - Literal values:
class Constant : public Expr {
enum class Type { kNull, kBool, kInt64, kDouble, kString, kArray };
Type value_type;
std::variant<
std::monostate, // NULL
bool, // Boolean
int64_t, // Integer
double, // Floating point
std::string, // String
std::vector<Constant> // Array
> value;
};
BinaryExpr - Binary operators:
class BinaryExpr : public Expr {
BinaryOpType op;
ExprPtr left;
ExprPtr right;
};
With 50+ operator types:
enum class BinaryOpType {
// Arithmetic
kAdd, kSubtract, kMultiply, kDivide, kModulo,
// Comparison
kEqual, kNotEqual, kLessThan, kLessEqual,
kGreaterThan, kGreaterEqual,
// Logical
kAnd, kOr,
// String
kLike, kILike, kSimilarTo, kRegexMatch,
// JSON (12 operators)
kJSONExtract, // ->
kJSONExtractText, // ->>
kJSONExtractPath, // #>
kJSONExtractPathText,// #>>
kJSONContains, // @>
kJSONContainedBy, // <@
kJSONKeyExists, // ?
kJSONAnyKeyExists, // ?|
kJSONAllKeysExist, // ?&
kJSONConcat, // ||
kJSONDelete, // -
kJSONDeletePath, // #-
// Array
kArrayContains, kArrayConcat,
// Other
kIn, kNotIn, kBetween, kNotBetween,
kIsDistinctFrom, kIsNotDistinctFrom,
};
FunctionCall - Function invocations:
class FunctionCall : public Expr {
std::string function_name;
ExprList arguments;
bool is_distinct; // COUNT(DISTINCT x)
bool is_star; // COUNT(*)
ExprPtr filter; // FILTER (WHERE ...)
WindowDef window_def; // OVER clause
bool agg_within_group; // WITHIN GROUP (ORDER BY ...)
};
8.3.4 Statement Nodes
SelectStmt - The most complex statement type:
class SelectStmt : public Stmt {
// WITH clause
std::vector<CommonTableExpr> with_clause;
bool with_recursive;
// SELECT list
std::vector<TargetEntry> target_list;
bool distinct;
ExprList distinct_on;
// FROM clause
std::vector<RangeVar> from_clause;
std::vector<RangeSubselect> from_subqueries;
std::vector<RangeTableFunc> from_table_funcs;
std::vector<JoinExpr> joins;
// Filtering
ExprPtr where_clause;
ExprList group_by;
std::vector<GroupingSet> grouping_sets;
ExprPtr having_clause;
// Window definitions
std::vector<NamedWindowDef> window_clause;
// Ordering and limits
std::vector<OrderByItem> order_by;
std::optional<int64_t> limit;
std::optional<int64_t> offset;
// Set operations
SetOperation set_op;
SelectStmtPtr set_left;
SelectStmtPtr set_right;
};
JoinExpr - Join specifications:
enum class JoinType {
kInner, kLeft, kRight, kFull, kCross,
kSemi, kAnti // Generated by subquery optimization
};
class JoinExpr {
JoinType join_type;
RangeVar left;
RangeVar right;
ExprPtr join_condition; // ON clause
std::vector<std::string> using_columns; // USING(...)
bool is_natural; // NATURAL JOIN
};
8.3.5 Smart Pointer Conventions
The AST uses unique ownership:
using NodePtr = std::unique_ptr<Node>;
using ExprPtr = std::unique_ptr<Expr>;
using StmtPtr = std::unique_ptr<Stmt>;
using ExprList = std::vector<ExprPtr>;
using StmtList = std::vector<StmtPtr>;
Unique pointers ensure:
- No accidental sharing of mutable state
- Automatic cleanup when AST is destroyed
- Clear ownership semantics
8.4 AST Builder
The AST builder converts libpg_query's JSON parse tree into Cognica's typed AST.
8.4.1 Builder Architecture
class ASTBuilder {
public:
// Build single statement (returns nullptr for multiple statements)
auto build(const ParseTree& tree, std::string_view original_sql) -> StmtPtr;
// Build multiple statements
auto build_all(const ParseTree& tree) -> std::vector<StmtPtr>;
// Error handling
auto has_error() const -> bool;
auto get_last_error() const -> const std::string&;
private:
// Statement builders (27 types)
auto build_select_stmt_(const Value& node) -> StmtPtr;
auto build_insert_stmt_(const Value& node) -> StmtPtr;
auto build_update_stmt_(const Value& node) -> StmtPtr;
auto build_delete_stmt_(const Value& node) -> StmtPtr;
// ... 23 more
// Expression builders (25+ types)
auto build_expr_(const Value& node) -> ExprPtr;
auto build_column_ref_(const Value& node) -> ExprPtr;
auto build_constant_(const Value& node) -> ExprPtr;
auto build_binary_expr_(const Value& node) -> ExprPtr;
auto build_function_call_(const Value& node) -> ExprPtr;
// ... 20 more
};
8.4.2 Recursive Descent
The builder uses recursive descent to traverse the JSON tree:
auto ASTBuilder::build_expr_(const Value& node) -> ExprPtr {
// Determine expression type from JSON structure
if (node.HasMember("ColumnRef")) {
return build_column_ref_(node["ColumnRef"]);
}
if (node.HasMember("A_Const")) {
return build_constant_(node["A_Const"]);
}
if (node.HasMember("A_Expr")) {
return build_binary_expr_(node["A_Expr"]);
}
if (node.HasMember("FuncCall")) {
return build_function_call_(node["FuncCall"]);
}
if (node.HasMember("SubLink")) {
return build_subquery_expr_(node["SubLink"]);
}
// ... handle other expression types
set_error("Unknown expression type");
return nullptr;
}
8.4.3 Operator Mapping
PostgreSQL operator names must be mapped to Cognica's enum:
auto ASTBuilder::parse_operator_(std::string_view op_name) -> BinaryOpType {
static const std::unordered_map<std::string_view, BinaryOpType> mapping = {
// Arithmetic
{"+", BinaryOpType::kAdd},
{"-", BinaryOpType::kSubtract},
{"*", BinaryOpType::kMultiply},
{"/", BinaryOpType::kDivide},
{"%", BinaryOpType::kModulo},
// Comparison
{"=", BinaryOpType::kEqual},
{"<>", BinaryOpType::kNotEqual},
{"!=", BinaryOpType::kNotEqual},
{"<", BinaryOpType::kLessThan},
{"<=", BinaryOpType::kLessEqual},
{">", BinaryOpType::kGreaterThan},
{">=", BinaryOpType::kGreaterEqual},
// JSON operators
{"->", BinaryOpType::kJSONExtract},
{"->>", BinaryOpType::kJSONExtractText},
{"#>", BinaryOpType::kJSONExtractPath},
{"#>>", BinaryOpType::kJSONExtractPathText},
{"@>", BinaryOpType::kJSONContains},
{"<@", BinaryOpType::kJSONContainedBy},
{"?", BinaryOpType::kJSONKeyExists},
{"?|", BinaryOpType::kJSONAnyKeyExists},
{"?&", BinaryOpType::kJSONAllKeysExist},
// Pattern matching
{"~~", BinaryOpType::kLike},
{"~~*", BinaryOpType::kILike},
{"~", BinaryOpType::kRegexMatch},
{"~*", BinaryOpType::kRegexMatchCI},
};
auto it = mapping.find(op_name);
return it != mapping.end() ? it->second : BinaryOpType::kUnknown;
}
8.4.4 Complex Node Handling
CASE Expressions:
CASE status
WHEN 'active' THEN 1
WHEN 'pending' THEN 2
ELSE 0
END
auto ASTBuilder::build_case_expr_(const Value& node) -> ExprPtr {
auto result = std::make_unique<CaseExpr>();
// Simple CASE: CASE expr WHEN value THEN result
if (node.HasMember("arg")) {
result->case_expr = build_expr_(node["arg"]);
}
// WHEN clauses
for (const auto& when : node["args"].GetArray()) {
CaseWhen case_when;
case_when.condition = build_expr_(when["CaseWhen"]["expr"]);
case_when.result = build_expr_(when["CaseWhen"]["result"]);
result->when_clauses.push_back(std::move(case_when));
}
// ELSE clause
if (node.HasMember("defresult")) {
result->else_result = build_expr_(node["defresult"]);
}
return result;
}
Window Functions:
SUM(amount) OVER (PARTITION BY category ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
auto ASTBuilder::build_window_def_(const Value& node) -> WindowDef {
WindowDef def;
// PARTITION BY
if (node.HasMember("partitionClause")) {
for (const auto& expr : node["partitionClause"].GetArray()) {
def.partition_by.push_back(build_expr_(expr));
}
}
// ORDER BY
if (node.HasMember("orderClause")) {
for (const auto& item : node["orderClause"].GetArray()) {
def.order_by.push_back(build_order_by_item_(item));
}
}
// Frame specification
def.frame_options = node.HasMember("frameOptions")
? node["frameOptions"].GetInt()
: 0;
if (node.HasMember("startOffset")) {
def.frame_start_offset = build_expr_(node["startOffset"]);
}
if (node.HasMember("endOffset")) {
def.frame_end_offset = build_expr_(node["endOffset"]);
}
return def;
}
8.5 Semantic Analysis
Semantic analysis validates the AST and enriches it with type information.
8.5.1 Validation Tasks
| Task | Description | Example Error |
|---|---|---|
| Table existence | Verify tables exist | "Table 'foo' not found" |
| Column existence | Verify columns exist | "Column 'bar' not found in 'users'" |
| Type compatibility | Check operator types | "Cannot compare string to integer" |
| Ambiguity resolution | Resolve unqualified names | "Column 'id' is ambiguous" |
| Aggregate validation | Check GROUP BY usage | "Column must appear in GROUP BY" |
8.5.2 Semantic Analyzer Interface
class SemanticAnalyzer {
public:
explicit SemanticAnalyzer(const SchemaProvider* schema_provider);
// Analyze any statement
auto analyze(const ast::Stmt& stmt) -> Status;
// Analyze SELECT with column resolution
auto analyze_select(const ast::SelectStmt& stmt) -> Status;
// Analyze and resolve column references
auto analyze_and_resolve(ast::SelectStmt* stmt) -> Status;
private:
const SchemaProvider* schema_provider_;
};
8.5.3 Schema Provider Interface
The analyzer queries schema information through an abstract interface:
class SchemaProvider {
public:
virtual ~SchemaProvider() = default;
// Table existence
virtual auto collection_exists(const std::string& name) const -> bool = 0;
// Table enumeration
virtual auto get_collection_names() const
-> std::unordered_set<std::string> = 0;
// Column enumeration
virtual auto get_column_names(const std::string& table_name) const
-> std::unordered_set<std::string> = 0;
};
This abstraction enables:
- Testing with mock schemas
- Different backends (document DB, relational tables, external tables)
- Schema caching
8.5.4 Name Resolution
Unqualified column references must be resolved:
SELECT id, name FROM users u JOIN orders o ON u.id = o.user_id
The analyzer must determine that:
idrefers tou.id(or error if ambiguous)namerefers tou.nameuser_idrefers too.user_id
Resolution Algorithm:
auto SemanticAnalyzer::resolve_column_(
const ColumnRef& col,
const std::vector<TableScope>& scopes
) -> ResolvedColumn {
std::vector<ResolvedColumn> candidates;
for (const auto& scope : scopes) {
if (col.table_name.has_value()) {
// Qualified: match table name
if (scope.alias == col.table_name || scope.table_name == col.table_name) {
if (scope.has_column(col.column_name)) {
candidates.push_back({scope.alias, col.column_name});
}
}
} else {
// Unqualified: search all tables
if (scope.has_column(col.column_name)) {
candidates.push_back({scope.alias, col.column_name});
}
}
}
if (candidates.empty()) {
throw SemanticError("Column '" + col.column_name + "' not found");
}
if (candidates.size() > 1) {
throw SemanticError("Column '" + col.column_name + "' is ambiguous");
}
return candidates[0];
}
8.5.5 Scope Management
Queries can have nested scopes (subqueries, CTEs):
WITH active_users AS (
SELECT id, name FROM users WHERE status = 'active'
)
SELECT a.name, COUNT(o.id)
FROM active_users a
JOIN orders o ON a.id = o.user_id
GROUP BY a.name
Scope stack:
- CTE scope:
active_users(id, name) - FROM scope:
a(id, name), o(id, user_id, amount, ...)
class ScopeManager {
std::vector<Scope> scope_stack_;
public:
void push_scope(Scope scope) {
scope_stack_.push_back(std::move(scope));
}
void pop_scope() {
scope_stack_.pop_back();
}
auto resolve_column(const std::string& name) -> ResolvedColumn {
// Search from innermost to outermost scope
for (auto it = scope_stack_.rbegin(); it != scope_stack_.rend(); ++it) {
if (auto col = it->find_column(name)) {
return *col;
}
}
throw SemanticError("Column not found: " + name);
}
};
8.5.6 Aggregate Validation
Aggregates impose constraints on non-aggregated columns:
-- Valid: name is in GROUP BY
SELECT name, COUNT(*) FROM users GROUP BY name
-- Invalid: email is not in GROUP BY
SELECT name, email, COUNT(*) FROM users GROUP BY name
Validation Algorithm:
auto SemanticAnalyzer::validate_grouping_(
const SelectStmt& stmt
) -> Status {
if (stmt.group_by.empty() && !has_aggregates(stmt.target_list)) {
return Status::OK(); // No grouping, no aggregates
}
std::unordered_set<std::string> grouped_columns;
for (const auto& expr : stmt.group_by) {
if (auto* col = dynamic_cast<ColumnRef*>(expr.get())) {
grouped_columns.insert(col->column_name);
}
}
for (const auto& target : stmt.target_list) {
if (!is_aggregate(target.expr.get())) {
auto columns = extract_columns(target.expr.get());
for (const auto& col : columns) {
if (grouped_columns.find(col) == grouped_columns.end()) {
return Status::SemanticError(
"Column '" + col + "' must appear in GROUP BY clause"
);
}
}
}
}
return Status::OK();
}
8.6 Type System
Cognica's type system bridges SQL types and internal representations.
8.6.1 SQL Type Categories
| Category | SQL Types | Internal Type |
|---|---|---|
| Integer | INTEGER, BIGINT, SMALLINT | int64_t |
| Float | REAL, DOUBLE PRECISION | double |
| String | VARCHAR, CHAR, TEXT | std::string |
| Boolean | BOOLEAN | bool |
| JSON | JSON, JSONB | Document |
| Array | INTEGER[], VARCHAR[] | std::vector |
| Temporal | DATE, TIME, TIMESTAMP | int64_t (epoch) |
8.6.2 Type Inference
Expressions have inferred types based on operators and operands:
Type Promotion Rules:
| Left Type | Right Type | Result Type |
|---|---|---|
| INTEGER | INTEGER | INTEGER |
| INTEGER | DOUBLE | DOUBLE |
| INTEGER | STRING | ERROR |
| STRING | STRING | STRING |
| JSON | STRING | JSON |
auto infer_binary_type(
BinaryOpType op,
CVMType left_type,
CVMType right_type
) -> CVMType {
switch (op) {
case BinaryOpType::kAdd:
case BinaryOpType::kSubtract:
case BinaryOpType::kMultiply:
case BinaryOpType::kDivide:
// Numeric operations
if (left_type == CVMType::kDouble || right_type == CVMType::kDouble) {
return CVMType::kDouble;
}
return CVMType::kInt64;
case BinaryOpType::kEqual:
case BinaryOpType::kLessThan:
// ... other comparisons
return CVMType::kBool;
case BinaryOpType::kLike:
case BinaryOpType::kConcat:
return CVMType::kString;
case BinaryOpType::kJSONExtract:
return CVMType::kJSON;
case BinaryOpType::kJSONExtractText:
return CVMType::kString;
default:
return CVMType::kUnknown;
}
}
8.6.3 Implicit Casting
Some type mismatches can be resolved through implicit casts:
SELECT * FROM users WHERE id = '123' -- String compared to integer
The analyzer inserts an implicit cast:
auto maybe_insert_cast_(ExprPtr& expr, CVMType target_type) -> Status {
CVMType source_type = infer_type(expr.get());
if (source_type == target_type) {
return Status::OK();
}
if (can_implicit_cast(source_type, target_type)) {
expr = std::make_unique<CastExpr>(std::move(expr), target_type);
return Status::OK();
}
return Status::TypeError(
"Cannot convert " + type_name(source_type) +
" to " + type_name(target_type)
);
}
Implicit Cast Rules:
| From | To | Allowed |
|---|---|---|
| INTEGER | DOUBLE | Yes |
| STRING | INTEGER | Yes (with validation) |
| STRING | DOUBLE | Yes (with validation) |
| INTEGER | STRING | Yes |
| BOOLEAN | INTEGER | No |
| JSON | STRING | Yes (serialization) |
8.7 Function Resolution
The analyzer resolves function calls to their implementations.
8.7.1 Function Registry
class ScalarFunctionRegistry {
public:
// Function lookup
auto is_registered(const std::string& name) const -> bool;
auto get_function_info(const std::string& name) const -> FunctionInfo;
// Execution
auto evaluate(
const std::string& name,
const std::vector<Value>& args
) -> ScalarFunctionResult;
};
struct FunctionInfo {
ScalarFunctionPtr func;
int32_t min_args;
int32_t max_args; // -1 for variadic
FunctionVolatility volatility;
};
8.7.2 Function Categories
Cognica provides 150+ built-in functions:
String Functions (30+):
concat,upper,lower,trim,ltrim,rtrimsubstr,substring,left,rightlength,char_length,octet_lengthreplace,translate,reverselike,ilike,regexp_matchsplit_part,string_to_array
Math Functions (20+):
abs,ceil,floor,round,truncsqrt,cbrt,power,exp,ln,logsin,cos,tan,asin,acos,atanmod,div,gcd,lcm
DateTime Functions (25+):
now,current_date,current_time,current_timestampextract,date_part,date_truncage,date_add,date_subto_char,to_date,to_timestamp
JSON Functions (30+):
jsonb_extract_path,jsonb_extract_path_textjsonb_set,jsonb_insert,jsonb_delete_pathjsonb_agg,jsonb_object_aggjsonb_array_length,jsonb_array_elementsjsonb_typeof,jsonb_pretty
Aggregate Functions:
count,sum,avg,min,maxarray_agg,string_agg,jsonb_aggbool_and,bool_or,everystddev,variance,corr,covar_pop
8.7.3 Function Volatility
Functions are classified by side effects:
enum class FunctionVolatility {
kImmutable, // Pure function, can be constant-folded
kStable, // Same within transaction, can cache per-query
kVolatile // May return different results each call
};
Examples:
| Function | Volatility | Reason |
|---|---|---|
abs(x) | Immutable | Deterministic |
now() | Stable | Same within transaction |
random() | Volatile | Different each call |
nextval() | Volatile | Side effects |
Volatility affects optimization:
-- Immutable: can fold
SELECT * FROM t WHERE x = abs(-5) -- Becomes: WHERE x = 5
-- Stable: can cache
SELECT * FROM t WHERE created_at > now() - interval '1 day'
-- Volatile: cannot optimize
SELECT * FROM t WHERE x = random() -- Must evaluate per row
8.7.4 Overload Resolution
Functions can be overloaded by argument types:
auto resolve_overload(
const std::string& name,
const std::vector<CVMType>& arg_types
) -> const FunctionOverload* {
auto& overloads = registry_.get_overloads(name);
// Exact match first
for (const auto& overload : overloads) {
if (types_match_exactly(overload.param_types, arg_types)) {
return &overload;
}
}
// Implicit conversion match
for (const auto& overload : overloads) {
if (types_match_with_conversion(overload.param_types, arg_types)) {
return &overload;
}
}
return nullptr; // No matching overload
}
8.8 Query Rewriting
Before optimization, certain query patterns are rewritten for efficiency.
8.8.1 View Expansion
Views are expanded inline:
-- View definition
CREATE VIEW active_users AS
SELECT id, name, email FROM users WHERE status = 'active';
-- Query
SELECT name FROM active_users WHERE email LIKE '%@example.com';
-- Expanded
SELECT name FROM (
SELECT id, name, email FROM users WHERE status = 'active'
) AS active_users WHERE email LIKE '%@example.com';
8.8.2 Star Expansion
SELECT * is expanded to explicit columns:
-- Original
SELECT * FROM users
-- Expanded
SELECT id, name, email, status, created_at FROM users
This ensures consistent column ordering and enables projection pushdown.
8.8.3 Subquery Flattening
Certain subqueries can be flattened:
-- Original
SELECT * FROM (SELECT id, name FROM users WHERE status = 'active') t
WHERE t.name LIKE 'A%'
-- Flattened
SELECT id, name FROM users
WHERE status = 'active' AND name LIKE 'A%'
8.8.4 IN to JOIN Transformation
IN subqueries often become joins:
-- Original
SELECT * FROM orders
WHERE user_id IN (SELECT id FROM users WHERE status = 'active')
-- Transformed
SELECT o.* FROM orders o
INNER JOIN users u ON o.user_id = u.id
WHERE u.status = 'active'
This enables join optimization algorithms.
8.8.5 Cypher Query Rewriting
When the parser encounters a cypher() function call in a FROM clause, it rewrites the call into a standard SQL subselect. The cypher() function accepts a graph name, a Cypher query string, and optional parameters, and the parser delegates to a Cypher rewriter that transforms the Cypher query into an equivalent SQL subquery operating on the graph's underlying collections.
-- Original: Cypher function call in FROM
SELECT v.name, v.age
FROM cypher('social', 'MATCH (n:Person) RETURN n.name, n.age') AS v(name, age)
-- Rewritten: SQL subselect over graph collections
SELECT v.name, v.age
FROM (SELECT ... FROM __graph_social_vertices ... ) AS v(name, age)
The rewriter enforces several constraints:
cypher()is only valid in FROM clauses; using it in SELECT or WHERE produces a clear error.- Mutating Cypher queries (CREATE, DELETE, SET) cannot appear directly in JOINs or multi-source FROM clauses — they must be wrapped in a CTE to ensure deterministic execution order.
- The function requires an explicit column definition list so the parser knows the output schema.
auto ASTBuilder::build_cypher_range_subselect_(
const JsonValue& node, bool disallow_mutating_join_usage)
-> std::unique_ptr<ast::RangeSubselect>;
The is_cypher_range_function_() helper inspects the parse tree to determine whether a RangeFunction node refers to cypher(), enabling the parser to route the node through the Cypher rewriting path instead of the standard table function path.
8.8.6 Expression-Based LIMIT and OFFSET
LIMIT and OFFSET clauses now accept arbitrary expressions in addition to integer literals. The parser builds AST expression nodes for these clauses while retaining the integer shortcut for constant values:
-- Constant LIMIT (both limit and limit_expr are set)
SELECT * FROM users LIMIT 10
-- Expression LIMIT (only limit_expr is set; limit is empty)
SELECT * FROM users LIMIT (SELECT count(*) / 10 FROM users)
The AST representation reflects this dual approach:
class SelectStmt : public Stmt {
ExprPtr limit_expr; // Full expression tree
ExprPtr offset_expr; // Full expression tree
std::optional<int64_t> limit; // Constant shortcut (if available)
std::optional<int64_t> offset; // Constant shortcut (if available)
};
When a constant value is available, the planner uses it directly for cost estimation and optimization. When only the expression is present, the physical plan evaluates it at execution time during the open() phase of PhysicalLimit. The expression must not reference query columns — it is evaluated against an empty document context, and references to row-dependent expressions produce a clear error.
8.9 Error Handling
Good error messages are crucial for usability.
8.9.1 Error Categories
enum class ErrorCategory {
kSyntax, // Parse errors
kSemantic, // Type/name errors
kExecution, // Runtime errors
kInternal // Bug in database
};
8.9.2 Error Context
Errors include context for debugging:
struct QueryError {
ErrorCategory category;
std::string message;
std::optional<int32_t> line;
std::optional<int32_t> column;
std::optional<std::string> hint;
std::optional<std::string> detail;
};
Example:
SELECT * FROM users WHERE sttaus = 'active'
ERROR: Column 'sttaus' not found in table 'users'
LINE 1: SELECT * FROM users WHERE sttaus = 'active'
^
HINT: Did you mean 'status'?
8.9.3 Error Recovery
The parser attempts error recovery for better diagnostics:
SELECT id name FROM users WHERE status =
Rather than failing at the first error (id name missing comma), the parser continues to report the incomplete WHERE clause.
8.10 Summary
This chapter examined Cognica's SQL parsing and semantic analysis infrastructure. Key takeaways:
-
PostgreSQL compatibility via libpg_query provides battle-tested SQL parsing with full support for CTEs, window functions, JSON operators, and more.
-
The AST provides a type-safe, navigable representation with 56+ node types covering all SQL constructs.
-
AST building recursively transforms JSON parse trees using 70+ specialized builder methods.
-
Semantic analysis validates queries by checking table/column existence, resolving ambiguous references, and validating aggregate usage.
-
The type system bridges SQL types to internal representations with inference rules and implicit casting.
-
Function resolution matches 150+ built-in functions with overload resolution and volatility classification.
-
Query rewriting transforms views, expands stars, flattens subqueries, and rewrites Cypher graph queries into SQL subselects before optimization.
-
Expression-based LIMIT/OFFSET extends the parser to accept arbitrary expressions, not only integer literals, enabling dynamic pagination and computed bounds.
The parsed and validated AST feeds into the logical planner, where the declarative SQL is transformed into a tree of relational operators — the subject of our next chapter.