Chapter 8: SQL Parser and Semantic Analysis

This chapter examines how Cognica transforms SQL text into executable query plans. We explore the parsing infrastructure built on PostgreSQL's battle-tested parser, the Abstract Syntax Tree representation, and the semantic analysis phase that validates and enriches queries with type information.

8.1 The SQL Processing Pipeline

SQL is a declarative language - users specify what data they want, not how to retrieve it. The database must bridge this gap through multiple transformation stages.

8.1.1 Pipeline Overview

Loading diagram...

Each stage transforms the query into a more concrete representation:

StageInputOutputPurpose
Lexical AnalysisSQL textTokensBreak into words
ParsingTokensParse TreeRecognize grammar
AST ConstructionParse TreeASTStructured representation
Semantic AnalysisASTAnnotated ASTType checking, name resolution
Logical PlanningAnnotated ASTLogical PlanRelational operators
OptimizationLogical PlanOptimized PlanCost-based transformations
Physical PlanningOptimized PlanPhysical PlanExecution strategies
Code GenerationPhysical PlanBytecodeExecutable instructions

8.1.2 Why Multiple Stages?

Separation of concerns enables:

Modularity: Each stage can be tested and optimized independently.

Reusability: The same AST serves multiple backends (CVM, Volcano iterator).

Extensibility: New optimizations can be added without modifying parsing.

Debugging: Problems can be isolated to specific stages.

8.2 Lexical Analysis and Parsing

Cognica leverages libpg_query, a standalone extraction of PostgreSQL's parser, providing full PostgreSQL 17 compatibility.

8.2.1 Why PostgreSQL's Parser?

PostgreSQL's SQL parser is:

  • Battle-tested: 25+ years of production use
  • Standards-compliant: ANSI SQL with PostgreSQL extensions
  • Feature-rich: Supports CTEs, window functions, JSON operators, arrays
  • Well-maintained: Regular updates with PostgreSQL releases

Building a custom parser would require years of effort to match this maturity.

8.2.2 libpg_query Integration

The ParseResult class wraps libpg_query with RAII semantics:

class ParseResult {
public:
  // Parse SQL string, returns result with parse tree or error
  static auto parse(std::string_view sql) -> ParseResult;

  // Access parse tree as JSON
  auto get_parse_tree() const -> std::string_view;

  // Error information
  auto has_error() const -> bool;
  auto get_error_message() const -> std::string_view;
  auto get_error_line() const -> int32_t;
  auto get_error_column() const -> int32_t;

private:
  PgQueryParseResult result_;  // libpg_query C struct
};

Memory Safety: The destructor automatically frees libpg_query's internal allocations.

8.2.3 Parse Tree Format

libpg_query produces a JSON representation of PostgreSQL's internal parse tree:

SELECT id, name FROM users WHERE status = 'active'

Produces (simplified):

{
  "stmts": [{
    "stmt": {
      "SelectStmt": {
        "targetList": [
          {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": {"sval": "id"}}]}}}},
          {"ResTarget": {"val": {"ColumnRef": {"fields": [{"String": {"sval": "name"}}]}}}}
        ],
        "fromClause": [
          {"RangeVar": {"relname": "users"}}
        ],
        "whereClause": {
          "A_Expr": {
            "kind": "AEXPR_OP",
            "name": [{"String": {"sval": "="}}],
            "lexpr": {"ColumnRef": {"fields": [{"String": {"sval": "status"}}]}},
            "rexpr": {"A_Const": {"sval": {"sval": "active"}}}
          }
        }
      }
    }
  }]
}

8.2.4 Additional libpg_query Features

Query Fingerprinting:

class FingerprintResult {
  static auto fingerprint(std::string_view sql) -> FingerprintResult;
  auto get_fingerprint() const -> uint64_t;  // Hash for cache key
};

Fingerprints normalize queries for caching:

SELECT * FROM users WHERE id = 1
SELECT * FROM users WHERE id = 2

Both produce the same fingerprint, enabling prepared statement caching.

Tokenization:

class ScanResult {
  static auto scan(std::string_view sql) -> ScanResult;
  auto get_tokens() const -> std::vector<Token>;
};

Useful for syntax highlighting and query analysis tools.

Deparsing:

class DeparseResult {
  static auto deparse(const ProtobufParseResult& tree) -> DeparseResult;
  auto get_sql() const -> std::string_view;
};

Converts parse trees back to normalized SQL, useful for query logging.

8.3 Abstract Syntax Tree

The AST provides a type-safe, navigable representation of SQL queries.

8.3.1 Node Hierarchy

Loading diagram...

Node: Base class with type discriminator.

Expr: Expressions that produce values.

Stmt: Complete SQL statements.

8.3.2 Node Type Enumeration

Cognica supports 56+ node types:

enum class NodeType : uint8_t {
  // Statements
  kSelectStmt, kInsertStmt, kUpdateStmt, kDeleteStmt,
  kCreateStmt, kDropStmt, kAlterStmt, kIndexStmt,
  kTruncateStmt, kRenameStmt, kTransactionStmt,
  kExplainStmt, kAnalyzeStmt, kCopyStmt,
  kViewStmt, kCreateMatViewStmt, kRefreshMatViewStmt,
  kCreateFunctionStmt, kDropFunctionStmt,
  kCreateTriggerStmt, kDropTriggerStmt,
  kGrantStmt, kRevokeStmt, kCreateRoleStmt,
  kCreatePolicyStmt, kAlterPolicyStmt,

  // Expressions
  kColumnRef, kConstant, kParamRef,
  kBinaryExpr, kUnaryExpr, kFunctionCall,
  kCaseExpr, kCastExpr, kSubqueryExpr,
  kArrayExpr, kArraySubscript, kArraySlice,
  kCoalesceExpr, kNullIfExpr, kTypedLiteral,

  // Other
  kJoinExpr, kRangeVar, kTargetEntry, kOrderByItem,
  kWindowDef, kGroupingSet, kCommonTableExpr,
  // ...
};

8.3.3 Expression Nodes

ColumnRef - Table column references:

class ColumnRef : public Expr {
  std::optional<std::string> schema_name;   // Optional schema
  std::optional<std::string> table_name;    // Optional table/alias
  std::string column_name;                  // Column name
};

Examples:

  • id -> {column_name: "id"}
  • users.id -> {table_name: "users", column_name: "id"}
  • public.users.id -> {schema_name: "public", table_name: "users", column_name: "id"}

Constant - Literal values:

class Constant : public Expr {
  enum class Type { kNull, kBool, kInt64, kDouble, kString, kArray };

  Type value_type;
  std::variant<
    std::monostate,     // NULL
    bool,               // Boolean
    int64_t,            // Integer
    double,             // Floating point
    std::string,        // String
    std::vector<Constant>  // Array
  > value;
};

BinaryExpr - Binary operators:

class BinaryExpr : public Expr {
  BinaryOpType op;
  ExprPtr left;
  ExprPtr right;
};

With 50+ operator types:

enum class BinaryOpType {
  // Arithmetic
  kAdd, kSubtract, kMultiply, kDivide, kModulo,

  // Comparison
  kEqual, kNotEqual, kLessThan, kLessEqual,
  kGreaterThan, kGreaterEqual,

  // Logical
  kAnd, kOr,

  // String
  kLike, kILike, kSimilarTo, kRegexMatch,

  // JSON (12 operators)
  kJSONExtract,        // ->
  kJSONExtractText,    // ->>
  kJSONExtractPath,    // #>
  kJSONExtractPathText,// #>>
  kJSONContains,       // @>
  kJSONContainedBy,    // <@
  kJSONKeyExists,      // ?
  kJSONAnyKeyExists,   // ?|
  kJSONAllKeysExist,   // ?&
  kJSONConcat,         // ||
  kJSONDelete,         // -
  kJSONDeletePath,     // #-

  // Array
  kArrayContains, kArrayConcat,

  // Other
  kIn, kNotIn, kBetween, kNotBetween,
  kIsDistinctFrom, kIsNotDistinctFrom,
};

FunctionCall - Function invocations:

class FunctionCall : public Expr {
  std::string function_name;
  ExprList arguments;
  bool is_distinct;           // COUNT(DISTINCT x)
  bool is_star;               // COUNT(*)
  ExprPtr filter;             // FILTER (WHERE ...)
  WindowDef window_def;       // OVER clause
  bool agg_within_group;      // WITHIN GROUP (ORDER BY ...)
};

8.3.4 Statement Nodes

SelectStmt - The most complex statement type:

class SelectStmt : public Stmt {
  // WITH clause
  std::vector<CommonTableExpr> with_clause;
  bool with_recursive;

  // SELECT list
  std::vector<TargetEntry> target_list;
  bool distinct;
  ExprList distinct_on;

  // FROM clause
  std::vector<RangeVar> from_clause;
  std::vector<RangeSubselect> from_subqueries;
  std::vector<RangeTableFunc> from_table_funcs;
  std::vector<JoinExpr> joins;

  // Filtering
  ExprPtr where_clause;
  ExprList group_by;
  std::vector<GroupingSet> grouping_sets;
  ExprPtr having_clause;

  // Window definitions
  std::vector<NamedWindowDef> window_clause;

  // Ordering and limits
  std::vector<OrderByItem> order_by;
  std::optional<int64_t> limit;
  std::optional<int64_t> offset;

  // Set operations
  SetOperation set_op;
  SelectStmtPtr set_left;
  SelectStmtPtr set_right;
};

JoinExpr - Join specifications:

enum class JoinType {
  kInner, kLeft, kRight, kFull, kCross,
  kSemi, kAnti  // Generated by subquery optimization
};

class JoinExpr {
  JoinType join_type;
  RangeVar left;
  RangeVar right;
  ExprPtr join_condition;           // ON clause
  std::vector<std::string> using_columns;  // USING(...)
  bool is_natural;                  // NATURAL JOIN
};

8.3.5 Smart Pointer Conventions

The AST uses unique ownership:

using NodePtr = std::unique_ptr<Node>;
using ExprPtr = std::unique_ptr<Expr>;
using StmtPtr = std::unique_ptr<Stmt>;
using ExprList = std::vector<ExprPtr>;
using StmtList = std::vector<StmtPtr>;

Unique pointers ensure:

  • No accidental sharing of mutable state
  • Automatic cleanup when AST is destroyed
  • Clear ownership semantics

8.4 AST Builder

The AST builder converts libpg_query's JSON parse tree into Cognica's typed AST.

8.4.1 Builder Architecture

class ASTBuilder {
public:
  // Build single statement (returns nullptr for multiple statements)
  auto build(const ParseTree& tree, std::string_view original_sql) -> StmtPtr;

  // Build multiple statements
  auto build_all(const ParseTree& tree) -> std::vector<StmtPtr>;

  // Error handling
  auto has_error() const -> bool;
  auto get_last_error() const -> const std::string&;

private:
  // Statement builders (27 types)
  auto build_select_stmt_(const Value& node) -> StmtPtr;
  auto build_insert_stmt_(const Value& node) -> StmtPtr;
  auto build_update_stmt_(const Value& node) -> StmtPtr;
  auto build_delete_stmt_(const Value& node) -> StmtPtr;
  // ... 23 more

  // Expression builders (25+ types)
  auto build_expr_(const Value& node) -> ExprPtr;
  auto build_column_ref_(const Value& node) -> ExprPtr;
  auto build_constant_(const Value& node) -> ExprPtr;
  auto build_binary_expr_(const Value& node) -> ExprPtr;
  auto build_function_call_(const Value& node) -> ExprPtr;
  // ... 20 more
};

8.4.2 Recursive Descent

The builder uses recursive descent to traverse the JSON tree:

auto ASTBuilder::build_expr_(const Value& node) -> ExprPtr {
  // Determine expression type from JSON structure
  if (node.HasMember("ColumnRef")) {
    return build_column_ref_(node["ColumnRef"]);
  }
  if (node.HasMember("A_Const")) {
    return build_constant_(node["A_Const"]);
  }
  if (node.HasMember("A_Expr")) {
    return build_binary_expr_(node["A_Expr"]);
  }
  if (node.HasMember("FuncCall")) {
    return build_function_call_(node["FuncCall"]);
  }
  if (node.HasMember("SubLink")) {
    return build_subquery_expr_(node["SubLink"]);
  }
  // ... handle other expression types

  set_error("Unknown expression type");
  return nullptr;
}

8.4.3 Operator Mapping

PostgreSQL operator names must be mapped to Cognica's enum:

auto ASTBuilder::parse_operator_(std::string_view op_name) -> BinaryOpType {
  static const std::unordered_map<std::string_view, BinaryOpType> mapping = {
    // Arithmetic
    {"+", BinaryOpType::kAdd},
    {"-", BinaryOpType::kSubtract},
    {"*", BinaryOpType::kMultiply},
    {"/", BinaryOpType::kDivide},
    {"%", BinaryOpType::kModulo},

    // Comparison
    {"=", BinaryOpType::kEqual},
    {"<>", BinaryOpType::kNotEqual},
    {"!=", BinaryOpType::kNotEqual},
    {"<", BinaryOpType::kLessThan},
    {"<=", BinaryOpType::kLessEqual},
    {">", BinaryOpType::kGreaterThan},
    {">=", BinaryOpType::kGreaterEqual},

    // JSON operators
    {"->", BinaryOpType::kJSONExtract},
    {"->>", BinaryOpType::kJSONExtractText},
    {"#>", BinaryOpType::kJSONExtractPath},
    {"#>>", BinaryOpType::kJSONExtractPathText},
    {"@>", BinaryOpType::kJSONContains},
    {"<@", BinaryOpType::kJSONContainedBy},
    {"?", BinaryOpType::kJSONKeyExists},
    {"?|", BinaryOpType::kJSONAnyKeyExists},
    {"?&", BinaryOpType::kJSONAllKeysExist},

    // Pattern matching
    {"~~", BinaryOpType::kLike},
    {"~~*", BinaryOpType::kILike},
    {"~", BinaryOpType::kRegexMatch},
    {"~*", BinaryOpType::kRegexMatchCI},
  };

  auto it = mapping.find(op_name);
  return it != mapping.end() ? it->second : BinaryOpType::kUnknown;
}

8.4.4 Complex Node Handling

CASE Expressions:

CASE status
  WHEN 'active' THEN 1
  WHEN 'pending' THEN 2
  ELSE 0
END
auto ASTBuilder::build_case_expr_(const Value& node) -> ExprPtr {
  auto result = std::make_unique<CaseExpr>();

  // Simple CASE: CASE expr WHEN value THEN result
  if (node.HasMember("arg")) {
    result->case_expr = build_expr_(node["arg"]);
  }

  // WHEN clauses
  for (const auto& when : node["args"].GetArray()) {
    CaseWhen case_when;
    case_when.condition = build_expr_(when["CaseWhen"]["expr"]);
    case_when.result = build_expr_(when["CaseWhen"]["result"]);
    result->when_clauses.push_back(std::move(case_when));
  }

  // ELSE clause
  if (node.HasMember("defresult")) {
    result->else_result = build_expr_(node["defresult"]);
  }

  return result;
}

Window Functions:

SUM(amount) OVER (PARTITION BY category ORDER BY date ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
auto ASTBuilder::build_window_def_(const Value& node) -> WindowDef {
  WindowDef def;

  // PARTITION BY
  if (node.HasMember("partitionClause")) {
    for (const auto& expr : node["partitionClause"].GetArray()) {
      def.partition_by.push_back(build_expr_(expr));
    }
  }

  // ORDER BY
  if (node.HasMember("orderClause")) {
    for (const auto& item : node["orderClause"].GetArray()) {
      def.order_by.push_back(build_order_by_item_(item));
    }
  }

  // Frame specification
  def.frame_options = node.HasMember("frameOptions")
    ? node["frameOptions"].GetInt()
    : 0;

  if (node.HasMember("startOffset")) {
    def.frame_start_offset = build_expr_(node["startOffset"]);
  }
  if (node.HasMember("endOffset")) {
    def.frame_end_offset = build_expr_(node["endOffset"]);
  }

  return def;
}

8.5 Semantic Analysis

Semantic analysis validates the AST and enriches it with type information.

8.5.1 Validation Tasks

TaskDescriptionExample Error
Table existenceVerify tables exist"Table 'foo' not found"
Column existenceVerify columns exist"Column 'bar' not found in 'users'"
Type compatibilityCheck operator types"Cannot compare string to integer"
Ambiguity resolutionResolve unqualified names"Column 'id' is ambiguous"
Aggregate validationCheck GROUP BY usage"Column must appear in GROUP BY"

8.5.2 Semantic Analyzer Interface

class SemanticAnalyzer {
public:
  explicit SemanticAnalyzer(const SchemaProvider* schema_provider);

  // Analyze any statement
  auto analyze(const ast::Stmt& stmt) -> Status;

  // Analyze SELECT with column resolution
  auto analyze_select(const ast::SelectStmt& stmt) -> Status;

  // Analyze and resolve column references
  auto analyze_and_resolve(ast::SelectStmt* stmt) -> Status;

private:
  const SchemaProvider* schema_provider_;
};

8.5.3 Schema Provider Interface

The analyzer queries schema information through an abstract interface:

class SchemaProvider {
public:
  virtual ~SchemaProvider() = default;

  // Table existence
  virtual auto collection_exists(const std::string& name) const -> bool = 0;

  // Table enumeration
  virtual auto get_collection_names() const
      -> std::unordered_set<std::string> = 0;

  // Column enumeration
  virtual auto get_column_names(const std::string& table_name) const
      -> std::unordered_set<std::string> = 0;
};

This abstraction enables:

  • Testing with mock schemas
  • Different backends (document DB, relational tables, external tables)
  • Schema caching

8.5.4 Name Resolution

Unqualified column references must be resolved:

SELECT id, name FROM users u JOIN orders o ON u.id = o.user_id

The analyzer must determine that:

  • id refers to u.id (or error if ambiguous)
  • name refers to u.name
  • user_id refers to o.user_id

Resolution Algorithm:

auto SemanticAnalyzer::resolve_column_(
    const ColumnRef& col,
    const std::vector<TableScope>& scopes
) -> ResolvedColumn {

  std::vector<ResolvedColumn> candidates;

  for (const auto& scope : scopes) {
    if (col.table_name.has_value()) {
      // Qualified: match table name
      if (scope.alias == col.table_name || scope.table_name == col.table_name) {
        if (scope.has_column(col.column_name)) {
          candidates.push_back({scope.alias, col.column_name});
        }
      }
    } else {
      // Unqualified: search all tables
      if (scope.has_column(col.column_name)) {
        candidates.push_back({scope.alias, col.column_name});
      }
    }
  }

  if (candidates.empty()) {
    throw SemanticError("Column '" + col.column_name + "' not found");
  }
  if (candidates.size() > 1) {
    throw SemanticError("Column '" + col.column_name + "' is ambiguous");
  }

  return candidates[0];
}

8.5.5 Scope Management

Queries can have nested scopes (subqueries, CTEs):

WITH active_users AS (
  SELECT id, name FROM users WHERE status = 'active'
)
SELECT a.name, COUNT(o.id)
FROM active_users a
JOIN orders o ON a.id = o.user_id
GROUP BY a.name

Scope stack:

  1. CTE scope: active_users(id, name)
  2. FROM scope: a(id, name), o(id, user_id, amount, ...)
class ScopeManager {
  std::vector<Scope> scope_stack_;

public:
  void push_scope(Scope scope) {
    scope_stack_.push_back(std::move(scope));
  }

  void pop_scope() {
    scope_stack_.pop_back();
  }

  auto resolve_column(const std::string& name) -> ResolvedColumn {
    // Search from innermost to outermost scope
    for (auto it = scope_stack_.rbegin(); it != scope_stack_.rend(); ++it) {
      if (auto col = it->find_column(name)) {
        return *col;
      }
    }
    throw SemanticError("Column not found: " + name);
  }
};

8.5.6 Aggregate Validation

Aggregates impose constraints on non-aggregated columns:

-- Valid: name is in GROUP BY
SELECT name, COUNT(*) FROM users GROUP BY name

-- Invalid: email is not in GROUP BY
SELECT name, email, COUNT(*) FROM users GROUP BY name

Validation Algorithm:

auto SemanticAnalyzer::validate_grouping_(
    const SelectStmt& stmt
) -> Status {

  if (stmt.group_by.empty() && !has_aggregates(stmt.target_list)) {
    return Status::OK();  // No grouping, no aggregates
  }

  std::unordered_set<std::string> grouped_columns;
  for (const auto& expr : stmt.group_by) {
    if (auto* col = dynamic_cast<ColumnRef*>(expr.get())) {
      grouped_columns.insert(col->column_name);
    }
  }

  for (const auto& target : stmt.target_list) {
    if (!is_aggregate(target.expr.get())) {
      auto columns = extract_columns(target.expr.get());
      for (const auto& col : columns) {
        if (grouped_columns.find(col) == grouped_columns.end()) {
          return Status::SemanticError(
            "Column '" + col + "' must appear in GROUP BY clause"
          );
        }
      }
    }
  }

  return Status::OK();
}

8.6 Type System

Cognica's type system bridges SQL types and internal representations.

8.6.1 SQL Type Categories

CategorySQL TypesInternal Type
IntegerINTEGER, BIGINT, SMALLINTint64_t
FloatREAL, DOUBLE PRECISIONdouble
StringVARCHAR, CHAR, TEXTstd::string
BooleanBOOLEANbool
JSONJSON, JSONBDocument
ArrayINTEGER[], VARCHAR[]std::vector
TemporalDATE, TIME, TIMESTAMPint64_t (epoch)

8.6.2 Type Inference

Expressions have inferred types based on operators and operands:

type(a+b)=promote(type(a),type(b))\text{type}(a + b) = \text{promote}(\text{type}(a), \text{type}(b))

Type Promotion Rules:

Left TypeRight TypeResult Type
INTEGERINTEGERINTEGER
INTEGERDOUBLEDOUBLE
INTEGERSTRINGERROR
STRINGSTRINGSTRING
JSONSTRINGJSON
auto infer_binary_type(
    BinaryOpType op,
    CVMType left_type,
    CVMType right_type
) -> CVMType {

  switch (op) {
    case BinaryOpType::kAdd:
    case BinaryOpType::kSubtract:
    case BinaryOpType::kMultiply:
    case BinaryOpType::kDivide:
      // Numeric operations
      if (left_type == CVMType::kDouble || right_type == CVMType::kDouble) {
        return CVMType::kDouble;
      }
      return CVMType::kInt64;

    case BinaryOpType::kEqual:
    case BinaryOpType::kLessThan:
    // ... other comparisons
      return CVMType::kBool;

    case BinaryOpType::kLike:
    case BinaryOpType::kConcat:
      return CVMType::kString;

    case BinaryOpType::kJSONExtract:
      return CVMType::kJSON;

    case BinaryOpType::kJSONExtractText:
      return CVMType::kString;

    default:
      return CVMType::kUnknown;
  }
}

8.6.3 Implicit Casting

Some type mismatches can be resolved through implicit casts:

SELECT * FROM users WHERE id = '123'  -- String compared to integer

The analyzer inserts an implicit cast:

auto maybe_insert_cast_(ExprPtr& expr, CVMType target_type) -> Status {
  CVMType source_type = infer_type(expr.get());

  if (source_type == target_type) {
    return Status::OK();
  }

  if (can_implicit_cast(source_type, target_type)) {
    expr = std::make_unique<CastExpr>(std::move(expr), target_type);
    return Status::OK();
  }

  return Status::TypeError(
    "Cannot convert " + type_name(source_type) +
    " to " + type_name(target_type)
  );
}

Implicit Cast Rules:

FromToAllowed
INTEGERDOUBLEYes
STRINGINTEGERYes (with validation)
STRINGDOUBLEYes (with validation)
INTEGERSTRINGYes
BOOLEANINTEGERNo
JSONSTRINGYes (serialization)

8.7 Function Resolution

The analyzer resolves function calls to their implementations.

8.7.1 Function Registry

class ScalarFunctionRegistry {
public:
  // Function lookup
  auto is_registered(const std::string& name) const -> bool;
  auto get_function_info(const std::string& name) const -> FunctionInfo;

  // Execution
  auto evaluate(
    const std::string& name,
    const std::vector<Value>& args
  ) -> ScalarFunctionResult;
};

struct FunctionInfo {
  ScalarFunctionPtr func;
  int32_t min_args;
  int32_t max_args;  // -1 for variadic
  FunctionVolatility volatility;
};

8.7.2 Function Categories

Cognica provides 150+ built-in functions:

String Functions (30+):

  • concat, upper, lower, trim, ltrim, rtrim
  • substr, substring, left, right
  • length, char_length, octet_length
  • replace, translate, reverse
  • like, ilike, regexp_match
  • split_part, string_to_array

Math Functions (20+):

  • abs, ceil, floor, round, trunc
  • sqrt, cbrt, power, exp, ln, log
  • sin, cos, tan, asin, acos, atan
  • mod, div, gcd, lcm

DateTime Functions (25+):

  • now, current_date, current_time, current_timestamp
  • extract, date_part, date_trunc
  • age, date_add, date_sub
  • to_char, to_date, to_timestamp

JSON Functions (30+):

  • jsonb_extract_path, jsonb_extract_path_text
  • jsonb_set, jsonb_insert, jsonb_delete_path
  • jsonb_agg, jsonb_object_agg
  • jsonb_array_length, jsonb_array_elements
  • jsonb_typeof, jsonb_pretty

Aggregate Functions:

  • count, sum, avg, min, max
  • array_agg, string_agg, jsonb_agg
  • bool_and, bool_or, every
  • stddev, variance, corr, covar_pop

8.7.3 Function Volatility

Functions are classified by side effects:

enum class FunctionVolatility {
  kImmutable,  // Pure function, can be constant-folded
  kStable,     // Same within transaction, can cache per-query
  kVolatile    // May return different results each call
};

Examples:

FunctionVolatilityReason
abs(x)ImmutableDeterministic
now()StableSame within transaction
random()VolatileDifferent each call
nextval()VolatileSide effects

Volatility affects optimization:

-- Immutable: can fold
SELECT * FROM t WHERE x = abs(-5)  -- Becomes: WHERE x = 5

-- Stable: can cache
SELECT * FROM t WHERE created_at > now() - interval '1 day'

-- Volatile: cannot optimize
SELECT * FROM t WHERE x = random()  -- Must evaluate per row

8.7.4 Overload Resolution

Functions can be overloaded by argument types:

auto resolve_overload(
    const std::string& name,
    const std::vector<CVMType>& arg_types
) -> const FunctionOverload* {

  auto& overloads = registry_.get_overloads(name);

  // Exact match first
  for (const auto& overload : overloads) {
    if (types_match_exactly(overload.param_types, arg_types)) {
      return &overload;
    }
  }

  // Implicit conversion match
  for (const auto& overload : overloads) {
    if (types_match_with_conversion(overload.param_types, arg_types)) {
      return &overload;
    }
  }

  return nullptr;  // No matching overload
}

8.8 Query Rewriting

Before optimization, certain query patterns are rewritten for efficiency.

8.8.1 View Expansion

Views are expanded inline:

-- View definition
CREATE VIEW active_users AS
  SELECT id, name, email FROM users WHERE status = 'active';

-- Query
SELECT name FROM active_users WHERE email LIKE '%@example.com';

-- Expanded
SELECT name FROM (
  SELECT id, name, email FROM users WHERE status = 'active'
) AS active_users WHERE email LIKE '%@example.com';

8.8.2 Star Expansion

SELECT * is expanded to explicit columns:

-- Original
SELECT * FROM users

-- Expanded
SELECT id, name, email, status, created_at FROM users

This ensures consistent column ordering and enables projection pushdown.

8.8.3 Subquery Flattening

Certain subqueries can be flattened:

-- Original
SELECT * FROM (SELECT id, name FROM users WHERE status = 'active') t
WHERE t.name LIKE 'A%'

-- Flattened
SELECT id, name FROM users
WHERE status = 'active' AND name LIKE 'A%'

8.8.4 IN to JOIN Transformation

IN subqueries often become joins:

-- Original
SELECT * FROM orders
WHERE user_id IN (SELECT id FROM users WHERE status = 'active')

-- Transformed
SELECT o.* FROM orders o
INNER JOIN users u ON o.user_id = u.id
WHERE u.status = 'active'

This enables join optimization algorithms.

8.8.5 Cypher Query Rewriting

When the parser encounters a cypher() function call in a FROM clause, it rewrites the call into a standard SQL subselect. The cypher() function accepts a graph name, a Cypher query string, and optional parameters, and the parser delegates to a Cypher rewriter that transforms the Cypher query into an equivalent SQL subquery operating on the graph's underlying collections.

-- Original: Cypher function call in FROM
SELECT v.name, v.age
FROM cypher('social', 'MATCH (n:Person) RETURN n.name, n.age') AS v(name, age)

-- Rewritten: SQL subselect over graph collections
SELECT v.name, v.age
FROM (SELECT ... FROM __graph_social_vertices ... ) AS v(name, age)

The rewriter enforces several constraints:

  • cypher() is only valid in FROM clauses; using it in SELECT or WHERE produces a clear error.
  • Mutating Cypher queries (CREATE, DELETE, SET) cannot appear directly in JOINs or multi-source FROM clauses — they must be wrapped in a CTE to ensure deterministic execution order.
  • The function requires an explicit column definition list so the parser knows the output schema.
auto ASTBuilder::build_cypher_range_subselect_(
    const JsonValue& node, bool disallow_mutating_join_usage)
    -> std::unique_ptr<ast::RangeSubselect>;

The is_cypher_range_function_() helper inspects the parse tree to determine whether a RangeFunction node refers to cypher(), enabling the parser to route the node through the Cypher rewriting path instead of the standard table function path.

8.8.6 Expression-Based LIMIT and OFFSET

LIMIT and OFFSET clauses now accept arbitrary expressions in addition to integer literals. The parser builds AST expression nodes for these clauses while retaining the integer shortcut for constant values:

-- Constant LIMIT (both limit and limit_expr are set)
SELECT * FROM users LIMIT 10

-- Expression LIMIT (only limit_expr is set; limit is empty)
SELECT * FROM users LIMIT (SELECT count(*) / 10 FROM users)

The AST representation reflects this dual approach:

class SelectStmt : public Stmt {
  ExprPtr limit_expr;                 // Full expression tree
  ExprPtr offset_expr;                // Full expression tree
  std::optional<int64_t> limit;       // Constant shortcut (if available)
  std::optional<int64_t> offset;      // Constant shortcut (if available)
};

When a constant value is available, the planner uses it directly for cost estimation and optimization. When only the expression is present, the physical plan evaluates it at execution time during the open() phase of PhysicalLimit. The expression must not reference query columns — it is evaluated against an empty document context, and references to row-dependent expressions produce a clear error.

8.9 Error Handling

Good error messages are crucial for usability.

8.9.1 Error Categories

enum class ErrorCategory {
  kSyntax,      // Parse errors
  kSemantic,    // Type/name errors
  kExecution,   // Runtime errors
  kInternal     // Bug in database
};

8.9.2 Error Context

Errors include context for debugging:

struct QueryError {
  ErrorCategory category;
  std::string message;
  std::optional<int32_t> line;
  std::optional<int32_t> column;
  std::optional<std::string> hint;
  std::optional<std::string> detail;
};

Example:

SELECT * FROM users WHERE sttaus = 'active'
ERROR: Column 'sttaus' not found in table 'users'
LINE 1: SELECT * FROM users WHERE sttaus = 'active'
                                  ^
HINT: Did you mean 'status'?

8.9.3 Error Recovery

The parser attempts error recovery for better diagnostics:

SELECT id name FROM users WHERE status =

Rather than failing at the first error (id name missing comma), the parser continues to report the incomplete WHERE clause.

8.10 Summary

This chapter examined Cognica's SQL parsing and semantic analysis infrastructure. Key takeaways:

  1. PostgreSQL compatibility via libpg_query provides battle-tested SQL parsing with full support for CTEs, window functions, JSON operators, and more.

  2. The AST provides a type-safe, navigable representation with 56+ node types covering all SQL constructs.

  3. AST building recursively transforms JSON parse trees using 70+ specialized builder methods.

  4. Semantic analysis validates queries by checking table/column existence, resolving ambiguous references, and validating aggregate usage.

  5. The type system bridges SQL types to internal representations with inference rules and implicit casting.

  6. Function resolution matches 150+ built-in functions with overload resolution and volatility classification.

  7. Query rewriting transforms views, expands stars, flattens subqueries, and rewrites Cypher graph queries into SQL subselects before optimization.

  8. Expression-based LIMIT/OFFSET extends the parser to accept arbitrary expressions, not only integer literals, enabling dynamic pagination and computed bounds.

The parsed and validated AST feeds into the logical planner, where the declarative SQL is transformed into a tree of relational operators — the subject of our next chapter.

Copyright (c) 2023-2026 Cognica, Inc.