Chapter 21: Vector Search and HNSW Index

Introduction

The emergence of deep learning has fundamentally transformed information retrieval. Dense vector embeddings—learned representations that capture semantic meaning in continuous vector spaces—enable similarity search that transcends lexical matching. A query about "automobile repairs" can match documents discussing "car maintenance" because their vector representations lie close together in embedding space, even though they share no common terms.

This chapter explores Cognica's vector search infrastructure, built around the Hierarchical Navigable Small World (HNSW) algorithm. We examine the theoretical foundations of approximate nearest neighbor search, the mathematical properties that make HNSW effective, and the engineering decisions that enable production-scale deployment. The implementation supports multiple backends, quantization strategies, and hybrid indexing approaches, providing flexibility for diverse workloads from small embeddings to billion-scale indices.

21.1 Vector Similarity Search Foundations

21.1.1 The Curse of Dimensionality

Exact nearest neighbor search in high-dimensional spaces faces fundamental computational barriers. For $n$ vectors in $d$ dimensions, naive linear scan requires $O(nd)$ distance computations per query. Tree-based methods like KD-trees degrade to linear scan when $d > 20$ due to the curse of dimensionality—the phenomenon where high-dimensional spaces become increasingly sparse, rendering spatial partitioning ineffective.

The curse manifests mathematically in the concentration of distances. For uniformly distributed points in a $d$ -dimensional hypercube, the ratio of maximum to minimum distance converges to 1 as $d$ increases:

\lim_{d \to \infty} \frac{\max_{i,j} \|x_i - x_j\|}{\min_{i,j} \|x_i - x_j\|} = 1

This concentration renders distance-based discrimination meaningless for exact methods in high dimensions.

21.1.2 Approximate Nearest Neighbor Search

Approximate Nearest Neighbor (ANN) algorithms trade exactness for efficiency, accepting a small probability of missing the true nearest neighbors in exchange for sublinear query time. The quality-performance tradeoff is characterized by recall—the fraction of true $k$ nearest neighbors returned:

\text{Recall@}k = \frac{|S_{\text{approx}} \cap S_{\text{exact}}|}{k}

where $S_{\text{approx}}$ denotes the approximate result set and $S_{\text{exact}}$ the true nearest neighbors.

Modern ANN algorithms achieve 95-99% recall with query times 100-1000x faster than exact search, making semantic similarity search practical at scale.

21.1.3 Distance Metrics

Cognica supports three distance metrics for vector similarity:

enum class MetricType : int32_t {
  kInnerProduct,  // Maximum inner product search (MIPS)
  kL2,            // Squared Euclidean distance
  kL1,            // Manhattan distance
};

Squared Euclidean Distance (L2)

The most common metric for embedding similarity:

d_{L2}(x, y) = \|x - y\|_2^2 = \sum_{i=1}^{d} (x_i - y_i)^2

Squared distance avoids the square root computation while preserving ranking order.

Inner Product (IP)

For normalized embeddings, inner product equals cosine similarity:

\text{IP}(x, y) = x \cdot y = \sum_{i=1}^{d} x_i y_i

When $\|x\| = \|y\| = 1$ :

\text{IP}(x, y) = \cos(\theta_{xy})

The implementation converts inner product similarity to distance for consistent semantics:

if (options_.metric_type == MetricType::kInnerProduct) {
  std::transform(dists.cbegin(), dists.cend(), dists.begin(),
                 [](auto dist) noexcept {
                   return 1.f - dist;  // Convert similarity to distance
                 });
}

Manhattan Distance (L1)

Less common but useful for sparse embeddings:

d_{L1}(x, y) = \|x - y\|_1 = \sum_{i=1}^{d} |x_i - y_i|

21.2 HNSW Algorithm Theory

21.2.1 Navigable Small World Graphs

The HNSW algorithm builds on the small-world network phenomenon—the observation that most nodes in real-world networks can be reached through a small number of hops. A navigable small world (NSW) graph augments this property with greedy routing: starting from any node, repeatedly moving to the neighbor closest to the target eventually reaches the target or its neighborhood.

For an NSW graph with $n$ nodes, the expected number of hops for greedy search scales as:

E[\text{hops}] = O(\log n)

provided the graph maintains appropriate connectivity structure.

21.2.2 Hierarchical Structure

HNSW extends NSW with a hierarchical layer structure that provides logarithmic entry point selection. The hierarchy consists of layers $L_0, L_1, \ldots, L_{\max}$ where:

Layer $L_0$ contains all $n$ vectors
Each successive layer contains approximately $1/M$ of the nodes from the layer below
The maximum layer for a node is determined probabilistically

The layer assignment follows an exponential distribution:

\ell = \lfloor -\ln(\text{uniform}(0, 1)) \cdot m_L \rfloor

where $m_L = 1/\ln(M)$ is the level multiplier. This ensures that on average:

E[|L_i|] = n \cdot \left(\frac{1}{M}\right)^i

Loading diagram...

21.2.3 Search Algorithm

HNSW search proceeds in two phases:

Phase 1: Coarse Search (Layers $L_{\max}$ to $L_1$ )

Starting from the entry point at the top layer, greedy search descends through layers:

function SEARCH_LAYER(q, entry, ef, layer):
    candidates = min-heap({entry})
    visited = {entry}
    result = max-heap({entry})

    while candidates not empty:
        c = candidates.pop_min()
        f = result.peek_max()

        if distance(c, q) > distance(f, q):
            break  // All remaining candidates are farther

        for neighbor in get_neighbors(c, layer):
            if neighbor not in visited:
                visited.add(neighbor)
                f = result.peek_max()

                if distance(neighbor, q) < distance(f, q) or |result| < ef:
                    candidates.push(neighbor)
                    result.push(neighbor)

                    if |result| > ef:
                        result.pop_max()

    return result

Phase 2: Fine Search (Layer $L_0$ )

At the base layer, the search expands with larger beam width (ef_search) to improve recall.

21.2.4 Key Parameters

M (Connectivity)

The maximum number of bidirectional connections per node. Cognica sets:

connectivity = M for layers $L_1$ and above
connectivity_base = 2M for layer $L_0$

config.connectivity = options_.M;         // Default: 32
config.connectivity_base = options_.M * 2; // Layer 0: 64

Higher M increases recall but also memory usage and construction time:

\text{Memory per node} \approx M \cdot \text{sizeof(id)} \cdot \text{avg\_layers}

ef_construction

The beam width during index construction. Larger values improve graph quality at the cost of construction time:

config.expansion_add = options_.ef_construction;  // Default: 40

ef_search

The beam width during search. This parameter directly controls the recall-latency tradeoff:

config.expansion_search = options_.ef_search;  // Default: 16

The relationship between ef_search and recall is approximately:

\text{Recall} \approx 1 - \exp\left(-\frac{\text{ef\_search}}{c \cdot k}\right)

where $c$ is a dataset-dependent constant and $k$ is the number of neighbors requested.

21.3 Index Architecture

21.3.1 Index Type Hierarchy

Cognica supports three HNSW index variants:

enum class HNSWIndexType : int32_t {
  kHNSW,           // Pure HNSW graph index
  kIVF_HNSW,       // Inverted File + HNSW hybrid
  kHNSW_USEARCH,   // USearch native implementation
};

The configuration structure encapsulates all index parameters:

struct HNSWOptions {
  HNSWIndexType index_type = HNSWIndexType::kHNSW;

  // Core HNSW parameters
  int32_t dims = 512;              // Vector dimensionality
  int32_t M = 32;                  // Max connections per node
  int32_t ef_construction = 40;    // Construction beam width
  int32_t ef_search = 16;          // Search beam width

  // IVF parameters (for kIVF_HNSW)
  int32_t num_inv_lists = 65536;   // Number of partitions
  int32_t num_segments = 32;       // PQ code segments
  int32_t num_bits = 8;            // Bits per PQ segment
  int32_t num_probe = 10;          // Partitions to search

  MetricType metric_type = MetricType::kL2;
  QuantizerType quantizer_type = QuantizerType::kFlat;
  bool normalize = false;
  int32_t shards = 1;
};

21.3.2 Quantization Strategies

Vector quantization reduces memory footprint and can improve search speed through cache efficiency:

enum class QuantizerType : int32_t {
  kFlat,        // Full precision (32-bit float)
  k4Bit,        // 4-bit scalar quantization
  k6Bit,        // 6-bit scalar quantization
  k8Bit,        // 8-bit scalar quantization
  k8BitDirect,  // Direct 8-bit (no scaling)
  k16Bit,       // 16-bit precision (float16)
};

Scalar Quantization

Maps each dimension independently to a fixed-bit representation:

q_i = \text{round}\left(\frac{x_i - \min_i}{\max_i - \min_i} \cdot (2^b - 1)\right)

where $b$ is the number of bits. The memory savings are substantial:

Quantizer	Bits/Dim	Memory (512D)	Relative
kFlat	32	2048 bytes	100%
k16Bit	16	1024 bytes	50%
k8Bit	8	512 bytes	25%
k4Bit	4	256 bytes	12.5%

Product Quantization (PQ)

For IVF_HNSW indices, Product Quantization provides aggressive compression:

enum class PQTrainType : int32_t {
  kDefault,       // Standard k-means training
  kHotStart,      // Pre-initialized centroids
  kShared,        // Shared codebook across segments
  kHypercube,     // Hypercube initialization
  kHypercubePCA,  // Hypercube + PCA rotation
};

PQ divides the vector into $m$ segments and quantizes each segment to one of $2^b$ centroids:

\text{PQ}(x) = (c_1(x_{1:d/m}), c_2(x_{d/m+1:2d/m}), \ldots, c_m(x_{(m-1)d/m+1:d}))

Memory per vector: $m \cdot b$ bits, enabling billion-scale indices in memory.

21.3.3 USearch Backend Integration

The primary backend uses the USearch library for efficient HNSW operations:

class HNSWIndexProxyUSearch final : public HNSWIndexProxy {
private:
  std::unique_ptr<unum::usearch::index_dense_t> index_;
  HNSWOptions options_;

  // Buffered insertion for batching
  std::vector<float> vectors_;
  std::vector<DocID> doc_ids_;
  absl::flat_hash_set<DocID> doc_ids_set_;

  // Training thresholds
  static constexpr auto kMinTrainingSetSize = 10'000;
  static constexpr auto kBatchSizeForBulkLoading = 100;
};

Index initialization configures USearch with Cognica's options:

HNSWIndexProxyUSearch::HNSWIndexProxyUSearch(...)
    : index_([&]() {
        using namespace unum::usearch;

        // Map metric type
        auto metric_type = metric_kind_t::l2sq_k;
        if (options_.metric_type == MetricType::kInnerProduct) {
          metric_type = metric_kind_t::ip_k;
        }

        // Map scalar precision
        auto scalar_type = scalar_kind_t::f32_k;
        switch (options_.quantizer_type) {
          case QuantizerType::k8Bit:
            scalar_type = scalar_kind_t::i8_k;
            break;
          case QuantizerType::k16Bit:
            scalar_type = scalar_kind_t::f16_k;
            break;
        }

        // Configure HNSW parameters
        auto config = index_dense_config_t {};
        config.connectivity = options_.M;
        config.connectivity_base = options_.M * 2;
        config.expansion_add = options_.ef_construction;
        config.expansion_search = options_.ef_search;

        auto metric = index_dense_t::metric_t {
            static_cast<size_t>(options_.dims),
            metric_type,
            scalar_type,
        };

        return std::make_unique<index_dense_t>(
            index_dense_t::make(metric, config));
      }()) {}

21.3.4 FAISS Backend Integration

For advanced use cases, FAISS provides additional capabilities:

class HNSWIndexProxyFaiss final : public HNSWIndexProxy {
private:
  std::vector<std::unique_ptr<faiss::Index>> shards_hnsw_;
  std::vector<std::unique_ptr<faiss::IndexIDMap>> shards_index_;
  std::unique_ptr<faiss::IndexShards> index_;
  std::unique_ptr<DynamicRoaringBitSet> deleted_doc_ids_;
};

FAISS integration enables:

Sharded indices for parallel search
Custom inverted list storage via KV database
IVF+PQ compression for billion-scale indices

21.4 Vector Storage and Analysis

21.4.1 Vector Encoding

Vectors are stored as raw binary float arrays for efficient access:

struct DenseVectorCodec {
  static auto encode(const std::vector<float>& vector,
                     std::optional<int32_t> dims = std::nullopt)
      -> std::expected<std::string, Status> {
    auto vector_dims = static_cast<int32_t>(vector.size());

    // Validate dimension alignment
    if (dims.has_value() && vector_dims % dims.value()) {
      return std::unexpected {
          Status::InvalidArgument("Invalid dimensions")
      };
    }

    auto output = std::string {};
    auto size = sizeof(float) * vector_dims;
    output.reserve(size);

    encoding::put_binary(&output, vector.data(), size);
    return output;
  }

  static int32_t get_dim(const std::string_view& data) {
    return data.size() / sizeof(float);
  }
};

The binary format stores vectors contiguously without headers, enabling zero-copy access when aligned:

\text{offset}(v_i) = i \cdot d \cdot \text{sizeof(float)}

21.4.2 Dense Vector Analyzer

The analyzer pipeline converts JSON arrays to binary vector format:

class DenseVectorAnalyzer final : public Analyzer {
public:
  Tokens tokenize(const Value& value) const override {
    if (!value.IsArray()) {
      THROW_EXCEPTION(std::invalid_argument(
          "Vector must be an array"));
    }

    const auto& array = value.GetArray();
    auto vectors = std::vector<float> {};

    // Support 1D vectors: [v1, v2, v3, ...]
    if (is_valid_1d_(value)) {
      vectors.reserve(array.Size());
      for (const auto& e : array) {
        vectors.emplace_back(e.GetFloat());
      }
    }
    // Support 2D vectors: [[v1, v2], [v3, v4], ...]
    else if (is_valid_2d_(value)) {
      vectors.reserve(array.Size() * dims_);
      for (const auto& v : array) {
        for (const auto& e : v.GetArray()) {
          vectors.emplace_back(e.GetFloat());
        }
      }
    }

    // Validate dimensions
    if (vectors.size() % static_cast<uint32_t>(dims_)) {
      THROW_EXCEPTION(std::invalid_argument(
          fmt::format("Invalid dimension: expected {}, got {}",
                      dims_, vectors.size())));
    }

    auto output = DenseVectorCodec::encode(vectors, dims_);

    return Tokens {
        Token {
            TokenType::kDenseVector,
            std::move(output.value()),
            std::nullopt,
            0,
            Offset {0, size, size},
            Offset {0, size, size},
        },
    };
  }

private:
  int32_t dims_;
};

Loading diagram...

21.5 K-NN Query Execution

21.5.1 Query Processing Pipeline

Vector queries flow through the standard FTS query pipeline:

class DenseVectorQuery final : public Query {
public:
  auto create_weight(Transaction* txn,
                     const IndexSearcher* searcher) const
      -> std::shared_ptr<Weight> override {

    const auto& context = searcher->get_context();
    const auto* hnsw_index = context->get_hnsw_index();

    // Configure search parameters
    constexpr auto kMaxSearchSize = 100uz;
    auto top_k = static_cast<int32_t>(
        options_->top_k.value_or(kMaxSearchSize));
    auto ef_search = get_field_option_value_<int32_t>("ef_search");

    // Execute HNSW search
    auto result = hnsw_index->search(term_, top_k, ef_search);

    // Sort by doc_id for boolean query integration
    ranges::sort(
        ranges::views::zip(result.doc_ids, result.distances),
        std::less<> {},
        [](const auto& row) { return std::get<0>(row); });

    return std::make_shared<DenseVectorWeight>(
        index_stats.value(),
        std::move(result.doc_ids),
        std::move(result.distances));
  }
};

21.5.2 Search Result Structure

Search results include both distances and probabilistic scores:

struct HNSWSearchResult {
  std::vector<DocID> doc_ids {};    // Document identifiers
  std::vector<float> distances {};  // Distances to query
  std::vector<float> probs {};      // Probability scores
};

21.5.3 K-NN Search Implementation

The search algorithm orchestrates buffer flushing, dimension validation, and result conversion:

HNSWSearchResult HNSWIndexProxyUSearch::search(
    const std::string_view& vector,
    int32_t k,
    std::optional<int32_t> ef_search) const {

  // Flush buffered vectors for consistency
  if (options.db.fts.hnsw_index.flush_before_search) {
    const_cast<HNSWIndexProxyUSearch*>(this)->flush(true);
  }

  // Validate query dimensions
  auto dim = DenseVectorCodec::get_dim(vector);
  if (dim != static_cast<int32_t>(index_->dimensions())) {
    THROW_EXCEPTION(std::runtime_error(
        fmt::format("Dimension mismatch: expected {}, got {}",
                    index_->dimensions(), dim)));
  }

  // Execute USearch search
  const float* query = reinterpret_cast<const float*>(vector.data());
  auto result = index_->search(query, k);

  // Convert to internal format
  auto doc_ids = std::vector<DocID> {};
  auto dists = std::vector<float> {};

  for (auto i = 0uz; i < result.size(); ++i) {
    doc_ids.emplace_back(result[i].member.key);
    dists.emplace_back(result[i].distance);
  }

  // Convert IP similarity to distance
  if (options_.metric_type == MetricType::kInnerProduct) {
    std::transform(dists.cbegin(), dists.cend(), dists.begin(),
                   [](auto d) { return 1.f - d; });
  }

  // Compute softmax probabilities
  auto probs = compute_softmax_probs(dists);

  return {
      std::move(doc_ids),
      std::move(dists),
      std::move(probs),
  };
}

21.5.4 Probabilistic Scoring via Softmax

Distances are converted to probability distributions using temperature-scaled softmax:

auto compute_score_prob_softmax(const std::vector<float>& values,
                                float temperature = 1.f)
    -> std::vector<float> {
  auto probs = std::vector<float> {};
  probs.reserve(values.size());

  // Numerically stable softmax: subtract max
  auto max_value = *std::max_element(values.cbegin(), values.cend());

  // Compute scaled exponentials
  std::transform(values.cbegin(), values.cend(),
                 std::back_inserter(probs),
                 [max_value, temperature](auto value) {
                   return std::exp((value - max_value) / temperature);
                 });

  // Normalize
  auto sum = std::accumulate(probs.cbegin(), probs.cend(), 0.f);
  if (sum > 0.f) {
    std::transform(probs.cbegin(), probs.cend(), probs.begin(),
                   [sum](auto p) { return p / sum; });
  }

  return probs;
}

The softmax formula for converting distances to probabilities:

p_i = \frac{\exp(-d_i / T)}{\sum_{j=1}^{k} \exp(-d_j / T)}

where $d_i$ is the distance to the $i$ -th result and $T$ is the temperature parameter.

Temperature Effects:

Temperature	Distribution	Interpretation
$T \to 0$	One-hot	Winner-take-all
$T = 1$	Standard	Balanced confidence
$T \to \infty$	Uniform	Maximum entropy

21.6 Index Maintenance

21.6.1 Batched Insertion

Vectors are buffered before batch insertion for efficiency:

Status HNSWIndexProxyUSearch::add(DocID doc_id,
                                  const std::string_view& data,
                                  bool write_to_wal) {
  auto vector_dim = static_cast<int32_t>(index_->dimensions());
  auto dim = DenseVectorCodec::get_dim(data);

  if (dim != vector_dim) {
    return Status::Error("Invalid vector size");
  }

  // Buffer vector
  doc_ids_.emplace_back(doc_id);
  doc_ids_set_.emplace(doc_id);

  auto offset = vectors_.size();
  vectors_.resize(offset + vector_dim);
  memcpy(vectors_.data() + offset, data.data(),
         vector_dim * sizeof(float));

  // Write-ahead logging
  if (write_to_wal) {
    wal_->append(HNSWIndexWALOperationType::kAdd, doc_id,
                 data.data(), vector_dim * sizeof(float));
  }

  dirty_ = true;
  need_flush_ = true;

  return Status::OK();
}

21.6.2 Parallel Flush

Batch insertion uses thread pools for parallel HNSW graph construction:

Status HNSWIndexProxyUSearch::flush(bool force) {
  if (!need_flush_) return Status::OK();

  auto n = doc_ids_.size();
  if (n == 0) {
    need_flush_ = false;
    return Status::OK();
  }

  // Minimum training set requirement
  auto min_training = options_.training_set_size
      .value_or(kMinTrainingSetSize);  // Default: 10,000
  if (!force && n < min_training) {
    return Status::OK();
  }

  // Reserve capacity
  auto dims = index_->dimensions();
  auto count = vectors_.size() / dims;
  auto new_capacity = index_->size() + count;
  if (index_->capacity() < new_capacity) {
    index_->reserve(new_capacity + count);  // Over-allocate
  }

  // Parallel insertion (256 vectors per partition)
  constexpr auto kPartitionSize = 256uz;
  auto partitions = (count + kPartitionSize - 1) / kPartitionSize;

  auto& pool = NamedThreadPools::get(ThreadPoolName::kBatchWrite);
  auto futures = std::vector<std::future<void>> {};

  for (auto i = 0uz; i < partitions; ++i) {
    futures.emplace_back(pool.invoke([&, i]() {
      for (auto j = 0uz; j < kPartitionSize; ++j) {
        auto idx = i * kPartitionSize + j;
        if (idx >= count) break;

        const auto* vec = vectors_.data() + idx * dims;
        auto doc_id = doc_ids_[idx];

        index_->add(doc_id, vec,
                    unum::usearch::index_dense_t::any_thread(),
                    false);
      }
    }));
  }

  for (auto& f : futures) f.get();

  vectors_.clear();
  doc_ids_.clear();
  doc_ids_set_.clear();
  need_flush_ = false;

  return Status::OK();
}

Loading diagram...

21.6.3 Write-Ahead Logging

Durability is ensured through WAL before index modification:

enum class HNSWIndexWALOperationType : uint8_t {
  kAdd = 0,
  kRemove = 1,
};

class HNSWIndexWAL {
public:
  Status append(HNSWIndexWALOperationType op,
                DocID doc_id,
                const char* buffer,
                size_t size);

  Status apply(const FunctionRef<
      Status(HNSWIndexWALOperationType, DocID,
             const char*, size_t)>& callback);

  Status flush();
};

Recovery replays the WAL to reconstruct buffered state:

Status HNSWIndexProxyUSearch::recover() {
  return wal_->apply([this](auto op, auto doc_id,
                            const char* data, size_t size) {
    switch (op) {
      case HNSWIndexWALOperationType::kAdd:
        return add(doc_id, {data, size}, false);
      case HNSWIndexWALOperationType::kRemove:
        return remove(doc_id, false);
    }
    return Status::OK();
  });
}

21.6.4 Index Persistence

Committed indices are written atomically using temporary files:

Status HNSWIndexProxyUSearch::commit(const std::string_view& field) {
  if (!dirty_) return Status::OK();

  // Flush remaining vectors
  auto status = flush(true);
  if (!status.ok()) return status;

  // Parallel disk I/O
  auto& pool = NamedThreadPools::get(ThreadPoolName::kDiskIO);
  auto futures = std::vector<std::future<Status>> {};

  for (auto shard_id = 0; shard_id < get_shard_count(); ++shard_id) {
    futures.emplace_back(pool.invoke([this, field, shard_id]() {
      auto filename = get_index_filename(field, shard_id);
      auto tmp = std::filesystem::path{filename}
          .replace_extension(".tmp");

      // Write to temp file
      auto config = unum::usearch::
          index_dense_serialization_config_t {};
      auto result = index_->save(tmp.c_str(), config);

      if (!result) {
        return Status::IOError("Failed to save HNSW index");
      }
      return Status::OK();
    }));
  }

  // Collect results
  for (auto& f : futures) {
    status = f.get();
    if (!status.ok()) return status;
  }

  // Atomic rename
  auto txn = HNSWIndexTransaction{get_filenames(field)};
  status = txn.prepare();
  if (!status.ok()) return status;

  status = txn.commit();
  if (!status.ok()) return status;

  // Flush WAL after successful commit
  status = wal_->flush();

  dirty_ = false;
  return status;
}

21.6.5 Document Deletion

HNSW supports soft deletion with graph connectivity preservation:

Status HNSWIndexProxyUSearch::remove(DocID doc_id,
                                     bool write_to_wal) {
  if (doc_id == kInvalidDocID) {
    return Status::OK();
  }

  // Remove from buffer
  remove_doc_from_buffer_(doc_id);

  // WAL entry
  if (write_to_wal) {
    wal_->append(HNSWIndexWALOperationType::kRemove,
                 doc_id, nullptr, 0);
  }

  // Remove from index (marks as deleted)
  index_->remove(doc_id);

  dirty_ = true;
  need_flush_ = true;

  return Status::OK();
}

Updates are implemented as delete-then-insert:

Status HNSWIndex::update(const DocValue& doc) {
  auto status = remove(doc);
  if (!status.ok()) return status;

  return add(doc);
}

21.7 IVF-HNSW Hybrid Index

21.7.1 Architecture Overview

The IVF-HNSW hybrid combines Inverted File (IVF) partitioning with HNSW for coarse quantization:

Loading diagram...

21.7.2 Custom Inverted Lists

FAISS inverted lists are backed by the document database:

class InvertedLists final : public faiss::InvertedLists {
public:
  size_t add_entries(size_t list_no, size_t n_entry,
                     const faiss::idx_t* ids,
                     const uint8_t* code) override {
    auto txn = db_.begin_write_batch();

    for (auto i = 0uz; i < n_entry; ++i) {
      auto key = HNSWIndexIVFKeyCodecV1::encode(
          guid_, field_name_, list_no, ids[i]);
      auto value_view = Slice {
          reinterpret_cast<const char*>(code + i * code_size),
          code_size,
      };
      auto value = HNSWIndexIVFValueCodecV1::encode(value_view);
      txn->put(key, value);
    }

    return txn->commit().ok() ? n_entry : 0;
  }

  auto get_iterator(size_t list_no, void* ctx) const
      -> faiss::InvertedListsIterator* override {
    return new InvertedListsIterator(
        &db_, guid_, field_name_, list_no, code_size);
  }
};

21.7.3 Search with IVF

IVF search probes multiple partitions:

\text{Search}(q, k, \text{nprobe}) = \bigcup_{i=1}^{\text{nprobe}} \text{TopK}(q, L_{c_i}, k)

where $c_1, \ldots, c_{\text{nprobe}}$ are the nearest centroids to $q$ .

21.8 Performance Characteristics

21.8.1 Complexity Analysis

Operation	Time Complexity	Space Complexity
Build	$O(n \log n \cdot M \cdot \text{ef\_c})$	$O(n \cdot M \cdot \bar{\ell})$
Search	$O(\log n + k \cdot M \cdot \text{ef\_s})$	$O(\text{ef\_s})$
Insert	$O(\log n \cdot M \cdot \text{ef\_c})$	$O(M \cdot \bar{\ell})$
Delete	$O(M)$	$O(1)$

where $\bar{\ell}$ is the average layer count per node.

21.8.2 Memory Footprint

Memory per vector in HNSW:

\text{Memory} = d \cdot \text{sizeof(scalar)} + M \cdot \bar{\ell} \cdot \text{sizeof(id)}

For 512-dimensional vectors with M=32 and float32 storage:

\text{Memory} = 512 \times 4 + 32 \times 1.5 \times 8 \approx 2432 \text{ bytes/vector}

With 8-bit quantization:

\text{Memory} = 512 \times 1 + 32 \times 1.5 \times 8 \approx 896 \text{ bytes/vector}

21.8.3 Recall vs Latency Tradeoff

The ef_search parameter controls the recall-latency tradeoff:

ef_search	Recall@10	Latency (ms)
16	~85%	0.1
64	~95%	0.3
256	~99%	1.0
1024	~99.9%	4.0

Values are representative for 1M vectors in 512 dimensions.

21.9 Configuration and Best Practices

21.9.1 Parameter Selection Guidelines

Dimensionality (dims)

Match the embedding model output dimension. Common values:

384: Sentence transformers (MiniLM)
512: Custom models, reduced BERT
768: BERT base, RoBERTa
1024: BERT large
1536: OpenAI text-embedding-3-small

Connectivity (M)

Higher M increases recall at the cost of memory:

M=16: Memory-constrained, lower recall acceptable
M=32: Balanced (default)
M=64: High-recall applications

ef_construction

Higher values improve graph quality:

ef_construction=40: Fast indexing (default)
ef_construction=100: Balanced
ef_construction=200+: Maximum quality

ef_search

Tune per query based on latency budget:

ef_search=16: Lowest latency (default)
ef_search=64: Balanced
ef_search=256+: Maximum recall

21.9.2 Index Configuration Example

{
  "vector_field": {
    "analyzer": {
      "type": "DenseVectorAnalyzer",
      "options": {
        "dims": 512
      }
    },
    "hnsw": {
      "index_type": "HNSW_USEARCH",
      "dims": 512,
      "M": 32,
      "ef_construction": 100,
      "ef_search": 64,
      "metric_type": "InnerProduct",
      "quantizer_type": "k8Bit",
      "normalize": true
    }
  }
}

21.10 Summary

This chapter explored Cognica's vector search infrastructure:

Theoretical foundations: The curse of dimensionality motivates approximate methods; HNSW provides logarithmic search through hierarchical small-world graphs
Algorithm parameters: M controls connectivity, ef_construction determines graph quality, ef_search trades recall for latency
Multi-backend architecture: USearch provides the default implementation; FAISS enables advanced features like IVF-PQ compression
Quantization strategies: Scalar quantization (4-16 bit) reduces memory; Product Quantization enables billion-scale indices
Probabilistic scoring: Temperature-scaled softmax converts distances to calibrated probability distributions
Durability guarantees: WAL-based recovery and atomic commits ensure crash consistency
Performance characteristics: Sublinear search complexity enables real-time similarity search at scale

The next chapter examines hybrid search, combining the semantic understanding of vector search with the precision of lexical matching through principled score fusion.